summaryrefslogtreecommitdiffstats
path: root/include (follow)
Commit message (Collapse)AuthorAgeFilesLines
* sched, arch: Create asm/preempt.hPeter Zijlstra2013-09-252-48/+55
| | | | | | | | | | In order to prepare to per-arch implementations of preempt_count move the required bits into an asm-generic header and use this for all archs. Signed-off-by: Peter Zijlstra <peterz@infradead.org> Link: http://lkml.kernel.org/n/tip-h5j0c1r3e3fk015m30h8f1zx@git.kernel.org Signed-off-by: Ingo Molnar <mingo@kernel.org>
* sched: Add NEED_RESCHED to the preempt_countPeter Zijlstra2013-09-252-7/+47
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | In order to combine the preemption and need_resched test we need to fold the need_resched information into the preempt_count value. Since the NEED_RESCHED flag is set across CPUs this needs to be an atomic operation, however we very much want to avoid making preempt_count atomic, therefore we keep the existing TIF_NEED_RESCHED infrastructure in place but at 3 sites test it and fold its value into preempt_count; namely: - resched_task() when setting TIF_NEED_RESCHED on the current task - scheduler_ipi() when resched_task() sets TIF_NEED_RESCHED on a remote task it follows it up with a reschedule IPI and we can modify the cpu local preempt_count from there. - cpu_idle_loop() for when resched_task() found tsk_is_polling(). We use an inverted bitmask to indicate need_resched so that a 0 means both need_resched and !atomic. Also remove the barrier() in preempt_enable() between preempt_enable_no_resched() and preempt_check_resched() to avoid having to reload the preemption value and allow the compiler to use the flags of the previuos decrement. I couldn't come up with any sane reason for this barrier() to be there as preempt_enable_no_resched() already has a barrier() before doing the decrement. Suggested-by: Ingo Molnar <mingo@kernel.org> Signed-off-by: Peter Zijlstra <peterz@infradead.org> Link: http://lkml.kernel.org/n/tip-7a7m5qqbn5pmwnd4wko9u6da@git.kernel.org Signed-off-by: Ingo Molnar <mingo@kernel.org>
* sched: Introduce preempt_count accessor functionsPeter Zijlstra2013-09-251-6/+19
| | | | | | | | | | | | | | | | Replace the single preempt_count() 'function' that's an lvalue with two proper functions: preempt_count() - returns the preempt_count value as rvalue preempt_count_set() - Allows setting the preempt-count value Also provide preempt_count_ptr() as a convenience wrapper to implement all modifying operations. Signed-off-by: Peter Zijlstra <peterz@infradead.org> Link: http://lkml.kernel.org/n/tip-orxrbycjozopqfhb4dxdkdvb@git.kernel.org [ Fixed build failure. ] Signed-off-by: Ingo Molnar <mingo@kernel.org>
* sched, idle: Fix the idle polling state logicPeter Zijlstra2013-09-252-7/+73
| | | | | | | | | | | | | | | | | | | | | | | | | Mike reported that commit 7d1a9417 ("x86: Use generic idle loop") regressed several workloads and caused excessive reschedule interrupts. The patch in question failed to notice that the x86 code had an inverted sense of the polling state versus the new generic code (x86: default polling, generic: default !polling). Fix the two prominent x86 mwait based idle drivers and introduce a few new generic polling helpers (fixing the wrong smp_mb__after_clear_bit usage). Also switch the idle routines to using tif_need_resched() which is an immediate TIF_NEED_RESCHED test as opposed to need_resched which will end up being slightly different. Reported-by: Mike Galbraith <bitbucket@online.de> Signed-off-by: Peter Zijlstra <peterz@infradead.org> Cc: lenb@kernel.org Cc: tglx@linutronix.de Link: http://lkml.kernel.org/n/tip-nc03imb0etuefmzybzj7sprf@git.kernel.org Signed-off-by: Ingo Molnar <mingo@kernel.org>
* sched: Remove {set,clear}_need_reschedPeter Zijlstra2013-09-251-2/+13
| | | | | | | | | | | | | | | | | | Preemption semantics are going to change which mandate a change. All DRM usage sites are already broken and will not be affected (much) by this change. DRM people are aware and will remove the last few stragglers. For now, leave an empty stub that generates a warning, once all users are gone we can remove this. Signed-off-by: Peter Zijlstra <peterz@infradead.org> Cc: airlied@linux.ie Cc: daniel.vetter@ffwll.ch Cc: paulmck@linux.vnet.ibm.com Link: http://lkml.kernel.org/n/tip-qfc1el2zvhxiyut4ai99ij4n@git.kernel.org Signed-off-by: Ingo Molnar <mingo@kernel.org>
* sched/balancing: Periodically decay max cost of idle balanceJason Low2013-09-202-0/+6
| | | | | | | | | | | This patch builds on patch 2 and periodically decays that max value to do idle balancing per sched domain by approximately 1% per second. Also decay the rq's max_idle_balance_cost value. Signed-off-by: Jason Low <jason.low2@hp.com> Signed-off-by: Peter Zijlstra <peterz@infradead.org> Link: http://lkml.kernel.org/r/1379096813-3032-4-git-send-email-jason.low2@hp.com Signed-off-by: Ingo Molnar <mingo@kernel.org>
* sched/balancing: Consider max cost of idle balance per sched domainJason Low2013-09-202-0/+4
| | | | | | | | | | | | | | | | | | | In this patch, we keep track of the max cost we spend doing idle load balancing for each sched domain. If the avg time the CPU remains idle is less then the time we have already spent on idle balancing + the max cost of idle balancing in the sched domain, then we don't continue to attempt the balance. We also keep a per rq variable, max_idle_balance_cost, which keeps track of the max time spent on newidle load balances throughout all its domains so that we can determine the avg_idle's max value. By using the max, we avoid overrunning the average. This further reduces the chance we attempt balancing when the CPU is not idle for longer than the cost to balance. Signed-off-by: Jason Low <jason.low2@hp.com> Signed-off-by: Peter Zijlstra <peterz@infradead.org> Link: http://lkml.kernel.org/r/1379096813-3032-3-git-send-email-jason.low2@hp.com Signed-off-by: Ingo Molnar <mingo@kernel.org>
* Merge branch 'perf-urgent-for-linus' of ↵Linus Torvalds2013-09-181-1/+1
|\ | | | | | | | | | | | | | | | | | | | | git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull perf fixes from Ingo Molnar: "Two small fixes" * 'perf-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: perf: Fix UAPI export of PERF_EVENT_IOC_ID perf/x86/intel: Fix Silvermont offcore masks
| * perf: Fix UAPI export of PERF_EVENT_IOC_IDVince Weaver2013-09-181-1/+1
| | | | | | | | | | | | | | | | | | | | | | | | Without the following patch I have problems compiling code using the new PERF_EVENT_IOC_ID ioctl(). It looks like u64 was used instead of __u64 Signed-off-by: Vince Weaver <vincent.weaver@maine.edu> Acked-by: Peter Zijlstra <peterz@infradead.org> Cc: Arnaldo Carvalho de Melo <acme@ghostprotocols.net> Link: http://lkml.kernel.org/r/alpine.DEB.2.10.1309171450380.11444@vincent-weaver-1.um.maine.edu Signed-off-by: Ingo Molnar <mingo@kernel.org>
* | Merge branch 'fixes' of git://git.kernel.org/pub/scm/virt/kvm/kvmLinus Torvalds2013-09-181-0/+1
|\ \ | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Pull KVM fixes from Gleb Natapov. * 'fixes' of git://git.kernel.org/pub/scm/virt/kvm/kvm: KVM: VMX: set "blocked by NMI" flag if EPT violation happens during IRET from NMI kvm: free resources after canceling async_pf KVM: nEPT: reset PDPTR register cache on nested vmentry emulation KVM: mmu: allow page tables to be in read-only slots KVM: x86 emulator: emulate RETF imm
| * | KVM: mmu: allow page tables to be in read-only slotsPaolo Bonzini2013-09-171-0/+1
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Page tables in a read-only memory slot will currently cause a triple fault because the page walker uses gfn_to_hva and it fails on such a slot. OVMF uses such a page table; however, real hardware seems to be fine with that as long as the accessed/dirty bits are set. Save whether the slot is readonly, and later check it when updating the accessed and dirty bits. Reviewed-by: Xiao Guangrong <xiaoguangrong@linux.vnet.ibm.com> Reviewed-by: Gleb Natapov <gleb@redhat.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
* | | Merge branch 'for-linus' of ↵Linus Torvalds2013-09-181-0/+4
|\ \ \ | |/ / |/| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | git://git.kernel.org/pub/scm/linux/kernel/git/jikos/hid Pull HID updates from Jiri Kosina: "Fixes for CVE-2013-2897, CVE-2013-2895, CVE-2013-2897, CVE-2013-2894, CVE-2013-2893, CVE-2013-2891, CVE-2013-2890, CVE-2013-2889. All the bugs are triggerable only by specially crafted evil-on-purpose HW devices. Fixes by Kees Cook and Benjamin Tissoires" * 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/jikos/hid: HID: lenovo-tpkbd: fix leak if tpkbd_probe_tp fails HID: multitouch: validate indexes details HID: logitech-dj: validate output report details HID: validate feature and input report details HID: lenovo-tpkbd: validate output report details HID: LG: validate HID output report details HID: steelseries: validate output report details HID: sony: validate HID output report details HID: zeroplus: validate output report details HID: provide a helper for validating hid reports
| * | HID: provide a helper for validating hid reportsKees Cook2013-09-131-0/+4
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Many drivers need to validate the characteristics of their HID report during initialization to avoid misusing the reports. This adds a common helper to perform validation of the report exisitng, the field existing, and the expected number of values within the field. Signed-off-by: Kees Cook <keescook@chromium.org> Cc: stable@vger.kernel.org Reviewed-by: Benjamin Tissoires <benjamin.tissoires@redhat.com> Signed-off-by: Jiri Kosina <jkosina@suse.cz>
* | | Merge branch 'timers/core' of ↵Linus Torvalds2013-09-161-16/+0
|\ \ \ | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull timer code update from Thomas Gleixner: - armada SoC clocksource overhaul with a trivial merge conflict - Minor improvements to various SoC clocksource drivers * 'timers/core' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: clocksource: armada-370-xp: Add detailed clock requirements in devicetree binding clocksource: armada-370-xp: Get reference fixed-clock by name clocksource: armada-370-xp: Replace WARN_ON with BUG_ON clocksource: armada-370-xp: Fix device-tree binding clocksource: armada-370-xp: Introduce new compatibles clocksource: armada-370-xp: Use CLOCKSOURCE_OF_DECLARE clocksource: armada-370-xp: Simplify TIMER_CTRL register access clocksource: armada-370-xp: Use BIT() ARM: timer-sp: Set dynamic irq affinity ARM: nomadik: add dynamic irq flag to the timer clocksource: sh_cmt: 32-bit control register support clocksource: em_sti: Convert to devm_* managed helpers
| * | | clocksource: armada-370-xp: Use CLOCKSOURCE_OF_DECLAREEzequiel Garcia2013-09-021-18/+0
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | This is almost cosmetic: we achieve a bit of consistency with other clocksource drivers by using the CLOCKSOURCE_OF_DECLARE macro for the boilerplate code. Signed-off-by: Ezequiel Garcia <ezequiel.garcia@free-electrons.com> Signed-off-by: Daniel Lezcano <daniel.lezcano@linaro.org> Reviewed-by: Andrew Lunn <andrew@lunn.ch>
* | | | Merge tag 'scsi-misc' of git://git.kernel.org/pub/scm/linux/kernel/git/jejb/scsiLinus Torvalds2013-09-151-0/+1
|\ \ \ \ | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Pull misc SCSI driver updates from James Bottomley: "This patch set is a set of driver updates (megaraid_sas, fnic, lpfc, ufs, hpsa) we also have a couple of bug fixes (sd out of bounds and ibmvfc error handling) and the first round of esas2r checker fixes and finally the much anticipated big endian additions for megaraid_sas" * tag 'scsi-misc' of git://git.kernel.org/pub/scm/linux/kernel/git/jejb/scsi: (47 commits) [SCSI] fnic: fnic Driver Tuneables Exposed through CLI [SCSI] fnic: Kernel panic while running sh/nosh with max lun cfg [SCSI] fnic: Hitting BUG_ON(io_req->abts_done) in fnic_rport_exch_reset [SCSI] fnic: Remove QUEUE_FULL handling code [SCSI] fnic: On system with >1.1TB RAM, VIC fails multipath after boot up [SCSI] fnic: FC stat param seconds_since_last_reset not getting updated [SCSI] sd: Fix potential out-of-bounds access [SCSI] lpfc 8.3.42: Update lpfc version to driver version 8.3.42 [SCSI] lpfc 8.3.42: Fixed issue of task management commands having a fixed timeout [SCSI] lpfc 8.3.42: Fixed inconsistent spin lock usage. [SCSI] lpfc 8.3.42: Fix driver's abort loop functionality to skip IOs already getting aborted [SCSI] lpfc 8.3.42: Fixed failure to allocate SCSI buffer on PPC64 platform for SLI4 devices [SCSI] lpfc 8.3.42: Fix WARN_ON when driver unloads [SCSI] lpfc 8.3.42: Avoided making pci bar ioremap call during dual-chute WQ/RQ pci bar selection [SCSI] lpfc 8.3.42: Fixed driver iocbq structure's iocb_flag field running out of space [SCSI] lpfc 8.3.42: Fix crash on driver load due to cpu affinity logic [SCSI] lpfc 8.3.42: Fixed logging format of setting driver sysfs attributes hard to interpret [SCSI] lpfc 8.3.42: Fixed back to back RSCNs discovery failure. [SCSI] lpfc 8.3.42: Fixed race condition between BSG I/O dispatch and timeout handling [SCSI] lpfc 8.3.42: Fixed function mode field defined too small for not recognizing dual-chute mode ...
| * | | | [SCSI] hpsa: add HP Smart Array Gen9 PCI ID'sMike Miller2013-09-101-0/+1
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | This patch adds the PCI ID's for HP Smart Array Gen9 controllers. Please consider this patch for inclusion. Signed-off-by: Mike Miller <mike.miller@hp.com> Signed-off-by: James Bottomley <JBottomley@Parallels.com>
* | | | | Merge branch 'slab/next' of ↵Linus Torvalds2013-09-154-279/+124
|\ \ \ \ \ | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | git://git.kernel.org/pub/scm/linux/kernel/git/penberg/linux Pull SLAB update from Pekka Enberg: "Nothing terribly exciting here apart from Christoph's kmalloc unification patches that brings sl[aou]b implementations closer to each other" * 'slab/next' of git://git.kernel.org/pub/scm/linux/kernel/git/penberg/linux: slab: Use correct GFP_DMA constant slub: remove verify_mem_not_deleted() mm/sl[aou]b: Move kmallocXXX functions to common code mm, slab_common: add 'unlikely' to size check of kmalloc_slab() mm/slub.c: beautify code for removing redundancy 'break' statement. slub: Remove unnecessary page NULL check slub: don't use cpu partial pages on UP mm/slub: beautify code for 80 column limitation and tab alignment mm/slub: remove 'per_cpu' which is useless variable
| * | | | | slab: Use correct GFP_DMA constantChristoph Lameter2013-09-041-1/+1
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | On Thu, 5 Sep 2013, kbuild test robot wrote: > >> include/linux/slab.h:433:53: sparse: restricted gfp_t degrades to integer > 429 static __always_inline void *kmalloc_node(size_t size, gfp_t flags, int node) > 430 { > 431 #ifndef CONFIG_SLOB > 432 if (__builtin_constant_p(size) && > > 433 size <= KMALLOC_MAX_CACHE_SIZE && !(flags & SLAB_CACHE_DMA)) { > 434 int i = kmalloc_index(size); > 435 flags is of type gfp_t and not a slab internal flag. Therefore use GFP_DMA. Signed-off-by: Christoph Lameter <cl@linux.com> Signed-off-by: Pekka Enberg <penberg@kernel.org>
| * | | | | slub: remove verify_mem_not_deleted()Christoph Lameter2013-09-041-13/+0
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | I do not see any user for this code in the tree. Signed-off-by: Christoph Lameter <cl@linux.com> Signed-off-by: Pekka Enberg <penberg@kernel.org>
| * | | | | mm/sl[aou]b: Move kmallocXXX functions to common codeChristoph Lameter2013-09-044-266/+124
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | The kmalloc* functions of all slab allcoators are similar now so lets move them into slab.h. This requires some function naming changes in slob. As a results of this patch there is a common set of functions for all allocators. Also means that kmalloc_large() is now available in general to perform large order allocations that go directly via the page allocator. kmalloc_large() can be substituted if kmalloc() throws warnings because of too large allocations. kmalloc_large() has exactly the same semantics as kmalloc but can only used for allocations > PAGE_SIZE. Signed-off-by: Christoph Lameter <cl@linux.com> Signed-off-by: Pekka Enberg <penberg@kernel.org>
* | | | | | Merge branch 'for-linus' of ↵Linus Torvalds2013-09-151-0/+1
|\ \ \ \ \ \ | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | git://git.kernel.org/pub/scm/linux/kernel/git/dtor/input Pull input update from Dmitry Torokhov: "The only change is David Hermann's new EVIOCREVOKE evdev ioctl that allows safely passing file descriptors to input devices to session processes and later being able to stop delivery of events through these fds so that inactive sessions will no longer receive user input that does not belong to them" * 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/dtor/input: Input: evdev - add EVIOCREVOKE ioctl
| * | | | | | Input: evdev - add EVIOCREVOKE ioctlDavid Herrmann2013-09-071-0/+1
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | If we have multiple sessions on a system, we normally don't want background sessions to read input events. Otherwise, it could capture passwords and more entered by the user on the foreground session. This is a real world problem as the recent XMir development showed: http://mjg59.dreamwidth.org/27327.html We currently rely on sessions to release input devices when being deactivated. This relies on trust across sessions. But that's not given on usual systems. We therefore need a way to control which processes have access to input devices. With VTs the kernel simply routed them through the active /dev/ttyX. This is not possible with evdev devices, though. Moreover, we want to avoid routing input-devices through some dispatcher-daemon in userspace (which would add some latency). This patch introduces EVIOCREVOKE. If called on an evdev fd, this revokes device-access irrecoverably for that *single* open-file. Hence, once you call EVIOCREVOKE on any dup()ed fd, all fds for that open-file will be rather useless now (but still valid compared to close()!). This allows us to pass fds directly to session-processes from a trusted source. The source keeps a dup()ed fd and revokes access once the session-process is no longer active. Compared to the EVIOCMUTE proposal, we can avoid the CAP_SYS_ADMIN restriction now as there is no way to revive the fd again. Hence, a user is free to call EVIOCREVOKE themself to kill the fd. Additionally, this ioctl allows multi-layer access-control (again compared to EVIOCMUTE which was limited to one layer via CAP_SYS_ADMIN). A middle layer can simply request a new open-file from the layer above and pass it to the layer below. Now each layer can call EVIOCREVOKE on the fds to revoke access for all layers below, at the expense of one fd per layer. There's already ongoing experimental user-space work which demonstrates how it can be used: http://lists.freedesktop.org/archives/systemd-devel/2013-August/012897.html Signed-off-by: David Herrmann <dh.herrmann@gmail.com> Signed-off-by: Dmitry Torokhov <dmitry.torokhov@gmail.com>
* | | | | | | Merge tag 'writeback-fixes' of ↵Linus Torvalds2013-09-141-0/+6
|\ \ \ \ \ \ \ | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | git://git.kernel.org/pub/scm/linux/kernel/git/wfg/linux Pull writeback fix from Wu Fengguang: "A trivial writeback fix" * tag 'writeback-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/wfg/linux: writeback: Do not sort b_io list only because of block device inode
| * | | | | | | writeback: Do not sort b_io list only because of block device inodeJan Kara2013-07-091-0/+6
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | It is very likely that block device inode will be part of BDI dirty list as well. However it doesn't make sence to sort inodes on the b_io list just because of this inode (as it contains buffers all over the device anyway). So save some CPU cycles which is valuable since we hold relatively contented wb->list_lock. Signed-off-by: Jan Kara <jack@suse.cz>
* | | | | | | | Merge git://git.kvack.org/~bcrl/aio-nextLinus Torvalds2013-09-134-20/+12
|\ \ \ \ \ \ \ \ | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Pull aio changes from Ben LaHaise: "First off, sorry for this pull request being late in the merge window. Al had raised a couple of concerns about 2 items in the series below. I addressed the first issue (the race introduced by Gu's use of mm_populate()), but he has not provided any further details on how he wants to rework the anon_inode.c changes (which were sent out months ago but have yet to be commented on). The bulk of the changes have been sitting in the -next tree for a few months, with all the issues raised being addressed" * git://git.kvack.org/~bcrl/aio-next: (22 commits) aio: rcu_read_lock protection for new rcu_dereference calls aio: fix race in ring buffer page lookup introduced by page migration support aio: fix rcu sparse warnings introduced by ioctx table lookup patch aio: remove unnecessary debugging from aio_free_ring() aio: table lookup: verify ctx pointer staging/lustre: kiocb->ki_left is removed aio: fix error handling and rcu usage in "convert the ioctx list to table lookup v3" aio: be defensive to ensure request batching is non-zero instead of BUG_ON() aio: convert the ioctx list to table lookup v3 aio: double aio_max_nr in calculations aio: Kill ki_dtor aio: Kill ki_users aio: Kill unneeded kiocb members aio: Kill aio_rw_vect_retry() aio: Don't use ctx->tail unnecessarily aio: io_cancel() no longer returns the io_event aio: percpu ioctx refcount aio: percpu reqs_available aio: reqs_active -> reqs_available aio: fix build when migration is disabled ...
| * | | | | | | | aio: convert the ioctx list to table lookup v3Benjamin LaHaise2013-07-301-2/+3
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | On Wed, Jun 12, 2013 at 11:14:40AM -0700, Kent Overstreet wrote: > On Mon, Apr 15, 2013 at 02:40:55PM +0300, Octavian Purdila wrote: > > When using a large number of threads performing AIO operations the > > IOCTX list may get a significant number of entries which will cause > > significant overhead. For example, when running this fio script: > > > > rw=randrw; size=256k ;directory=/mnt/fio; ioengine=libaio; iodepth=1 > > blocksize=1024; numjobs=512; thread; loops=100 > > > > on an EXT2 filesystem mounted on top of a ramdisk we can observe up to > > 30% CPU time spent by lookup_ioctx: > > > > 32.51% [guest.kernel] [g] lookup_ioctx > > 9.19% [guest.kernel] [g] __lock_acquire.isra.28 > > 4.40% [guest.kernel] [g] lock_release > > 4.19% [guest.kernel] [g] sched_clock_local > > 3.86% [guest.kernel] [g] local_clock > > 3.68% [guest.kernel] [g] native_sched_clock > > 3.08% [guest.kernel] [g] sched_clock_cpu > > 2.64% [guest.kernel] [g] lock_release_holdtime.part.11 > > 2.60% [guest.kernel] [g] memcpy > > 2.33% [guest.kernel] [g] lock_acquired > > 2.25% [guest.kernel] [g] lock_acquire > > 1.84% [guest.kernel] [g] do_io_submit > > > > This patchs converts the ioctx list to a radix tree. For a performance > > comparison the above FIO script was run on a 2 sockets 8 core > > machine. This are the results (average and %rsd of 10 runs) for the > > original list based implementation and for the radix tree based > > implementation: > > > > cores 1 2 4 8 16 32 > > list 109376 ms 69119 ms 35682 ms 22671 ms 19724 ms 16408 ms > > %rsd 0.69% 1.15% 1.17% 1.21% 1.71% 1.43% > > radix 73651 ms 41748 ms 23028 ms 16766 ms 15232 ms 13787 ms > > %rsd 1.19% 0.98% 0.69% 1.13% 0.72% 0.75% > > % of radix > > relative 66.12% 65.59% 66.63% 72.31% 77.26% 83.66% > > to list > > > > To consider the impact of the patch on the typical case of having > > only one ctx per process the following FIO script was run: > > > > rw=randrw; size=100m ;directory=/mnt/fio; ioengine=libaio; iodepth=1 > > blocksize=1024; numjobs=1; thread; loops=100 > > > > on the same system and the results are the following: > > > > list 58892 ms > > %rsd 0.91% > > radix 59404 ms > > %rsd 0.81% > > % of radix > > relative 100.87% > > to list > > So, I was just doing some benchmarking/profiling to get ready to send > out the aio patches I've got for 3.11 - and it looks like your patch is > causing a ~1.5% throughput regression in my testing :/ ... <snip> I've got an alternate approach for fixing this wart in lookup_ioctx()... Instead of using an rbtree, just use the reserved id in the ring buffer header to index an array pointing the ioctx. It's not finished yet, and it needs to be tidied up, but is most of the way there. -ben -- "Thought is the essence of where you are now." -- kmo> And, a rework of Ben's code, but this was entirely his idea kmo> -Kent bcrl> And fix the code to use the right mm_struct in kill_ioctx(), actually free memory. Signed-off-by: Benjamin LaHaise <bcrl@kvack.org>
| * | | | | | | | aio: Kill ki_dtorKent Overstreet2013-07-301-1/+0
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | sock_aio_dtor() is dead code - and stuff that does need to do cleanup can simply do it before calling aio_complete(). Signed-off-by: Kent Overstreet <koverstreet@google.com> Cc: Zach Brown <zab@redhat.com> Cc: Felipe Balbi <balbi@ti.com> Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org> Cc: Mark Fasheh <mfasheh@suse.com> Cc: Joel Becker <jlbec@evilplan.org> Cc: Rusty Russell <rusty@rustcorp.com.au> Cc: Jens Axboe <axboe@kernel.dk> Cc: Asai Thambi S P <asamymuthupa@micron.com> Cc: Selvan Mani <smani@micron.com> Cc: Sam Bradshaw <sbradshaw@micron.com> Cc: Jeff Moyer <jmoyer@redhat.com> Cc: Al Viro <viro@zeniv.linux.org.uk> Cc: Benjamin LaHaise <bcrl@kvack.org> Cc: Theodore Ts'o <tytso@mit.edu> Signed-off-by: Benjamin LaHaise <bcrl@kvack.org>
| * | | | | | | | aio: Kill ki_usersKent Overstreet2013-07-301-5/+0
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | The kiocb refcount is only needed for cancellation - to ensure a kiocb isn't freed while a ki_cancel callback is running. But if we restrict ki_cancel callbacks to not block (which they currently don't), we can simply drop the refcount. Signed-off-by: Kent Overstreet <koverstreet@google.com> Cc: Zach Brown <zab@redhat.com> Cc: Felipe Balbi <balbi@ti.com> Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org> Cc: Mark Fasheh <mfasheh@suse.com> Cc: Joel Becker <jlbec@evilplan.org> Cc: Rusty Russell <rusty@rustcorp.com.au> Cc: Jens Axboe <axboe@kernel.dk> Cc: Asai Thambi S P <asamymuthupa@micron.com> Cc: Selvan Mani <smani@micron.com> Cc: Sam Bradshaw <sbradshaw@micron.com> Cc: Jeff Moyer <jmoyer@redhat.com> Cc: Al Viro <viro@zeniv.linux.org.uk> Cc: Benjamin LaHaise <bcrl@kvack.org> Cc: Theodore Ts'o <tytso@mit.edu> Signed-off-by: Benjamin LaHaise <bcrl@kvack.org>
| * | | | | | | | aio: Kill unneeded kiocb membersKent Overstreet2013-07-301-9/+2
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | The old aio retry infrastucture needed to save the various arguments to to aio operations. But with the retry infrastructure gone, we can trim struct kiocb quite a bit. Signed-off-by: Kent Overstreet <koverstreet@google.com> Cc: Zach Brown <zab@redhat.com> Cc: Felipe Balbi <balbi@ti.com> Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org> Cc: Mark Fasheh <mfasheh@suse.com> Cc: Joel Becker <jlbec@evilplan.org> Cc: Rusty Russell <rusty@rustcorp.com.au> Cc: Jens Axboe <axboe@kernel.dk> Cc: Asai Thambi S P <asamymuthupa@micron.com> Cc: Selvan Mani <smani@micron.com> Cc: Sam Bradshaw <sbradshaw@micron.com> Cc: Jeff Moyer <jmoyer@redhat.com> Cc: Al Viro <viro@zeniv.linux.org.uk> Cc: Benjamin LaHaise <bcrl@kvack.org> Cc: Theodore Ts'o <tytso@mit.edu> Signed-off-by: Benjamin LaHaise <bcrl@kvack.org>
| * | | | | | | | aio: Kill aio_rw_vect_retry()Kent Overstreet2013-07-301-2/+0
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | This code doesn't serve any purpose anymore, since the aio retry infrastructure has been removed. This change should be safe because aio_read/write are also used for synchronous IO, and called from do_sync_read()/do_sync_write() - and there's no looping done in the sync case (the read and write syscalls). Signed-off-by: Kent Overstreet <koverstreet@google.com> Cc: Zach Brown <zab@redhat.com> Cc: Felipe Balbi <balbi@ti.com> Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org> Cc: Mark Fasheh <mfasheh@suse.com> Cc: Joel Becker <jlbec@evilplan.org> Cc: Rusty Russell <rusty@rustcorp.com.au> Cc: Jens Axboe <axboe@kernel.dk> Cc: Asai Thambi S P <asamymuthupa@micron.com> Cc: Selvan Mani <smani@micron.com> Cc: Sam Bradshaw <sbradshaw@micron.com> Cc: Jeff Moyer <jmoyer@redhat.com> Cc: Al Viro <viro@zeniv.linux.org.uk> Cc: Benjamin LaHaise <bcrl@kvack.org> Signed-off-by: Benjamin LaHaise <bcrl@kvack.org>
| * | | | | | | | aio: io_cancel() no longer returns the io_eventKent Overstreet2013-07-301-1/+1
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Originally, io_event() was documented to return the io_event if cancellation succeeded - the io_event wouldn't be delivered via the ring buffer like it normally would. But this isn't what the implementation was actually doing; the only driver implementing cancellation, the usb gadget code, never returned an io_event in its cancel function. And aio_complete() was recently changed to no longer suppress event delivery if the kiocb had been cancelled. This gets rid of the unused io_event argument to kiocb_cancel() and kiocb->ki_cancel(), and changes io_cancel() to return -EINPROGRESS if kiocb->ki_cancel() returned success. Also tweak the refcounting in kiocb_cancel() to make more sense. Signed-off-by: Kent Overstreet <koverstreet@google.com> Cc: Zach Brown <zab@redhat.com> Cc: Felipe Balbi <balbi@ti.com> Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org> Cc: Mark Fasheh <mfasheh@suse.com> Cc: Joel Becker <jlbec@evilplan.org> Cc: Rusty Russell <rusty@rustcorp.com.au> Cc: Jens Axboe <axboe@kernel.dk> Cc: Asai Thambi S P <asamymuthupa@micron.com> Cc: Selvan Mani <smani@micron.com> Cc: Sam Bradshaw <sbradshaw@micron.com> Cc: Jeff Moyer <jmoyer@redhat.com> Cc: Al Viro <viro@zeniv.linux.org.uk> Cc: Benjamin LaHaise <bcrl@kvack.org> Signed-off-by: Benjamin LaHaise <bcrl@kvack.org>
| * | | | | | | | fs/aio: Add support to aio ring pages migrationGu Zheng2013-07-161-0/+3
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | As the aio job will pin the ring pages, that will lead to mem migrated failed. In order to fix this problem we use an anon inode to manage the aio ring pages, and setup the migratepage callback in the anon inode's address space, so that when mem migrating the aio ring pages will be moved to other mem node safely. Signed-off-by: Gu Zheng <guz.fnst@cn.fujitsu.com> Signed-off-by: Benjamin LaHaise <bcrl@kvack.org>
| * | | | | | | | fs/anon_inode: Introduce a new lib function anon_inode_getfile_private()Gu Zheng2013-07-161-0/+3
| | |_|/ / / / / | |/| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Introduce a new lib function anon_inode_getfile_private(), it creates a new file instance by hooking it up to an anonymous inode, and a dentry that describe the "class" of the file, similar to anon_inode_getfile(), but each file holds a single inode. Furthermore, anyone who wants to create a private anon file will benefit from this change. Signed-off-by: Gu Zheng <guz.fnst@cn.fujitsu.com> Signed-off-by: Benjamin LaHaise <bcrl@kvack.org>
* | | | | | | | Remove GENERIC_HARDIRQ config optionMartin Schwidefsky2013-09-137-144/+3
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | After the last architecture switched to generic hard irqs the config options HAVE_GENERIC_HARDIRQS & GENERIC_HARDIRQS and the related code for !CONFIG_GENERIC_HARDIRQS can be removed. Signed-off-by: Martin Schwidefsky <schwidefsky@de.ibm.com>
* | | | | | | | Merge branch 'for-next' of ↵Linus Torvalds2013-09-136-4/+128
|\ \ \ \ \ \ \ \ | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | git://git.kernel.org/pub/scm/linux/kernel/git/nab/target-pending Pull SCSI target updates from Nicholas Bellinger: "Lots of activity again this round for I/O performance optimizations (per-cpu IDA pre-allocation for vhost + iscsi/target), and the addition of new fabric independent features to target-core (COMPARE_AND_WRITE + EXTENDED_COPY). The main highlights include: - Support for iscsi-target login multiplexing across individual network portals - Generic Per-cpu IDA logic (kent + akpm + clameter) - Conversion of vhost to use per-cpu IDA pre-allocation for descriptors, SGLs and userspace page pointer list - Conversion of iscsi-target + iser-target to use per-cpu IDA pre-allocation for descriptors - Add support for generic COMPARE_AND_WRITE (AtomicTestandSet) emulation for virtual backend drivers - Add support for generic EXTENDED_COPY (CopyOffload) emulation for virtual backend drivers. - Add support for fast memory registration mode to iser-target (Vu) The patches to add COMPARE_AND_WRITE and EXTENDED_COPY support are of particular significance, which make us the first and only open source target to support the full set of VAAI primitives. Currently Linux clients are lacking upstream support to actually utilize these primitives. However, with server side support now in place for folks like MKP + ZAB working on the client, this logic once reserved for the highest end of storage arrays, can now be run in VMs on their laptops" * 'for-next' of git://git.kernel.org/pub/scm/linux/kernel/git/nab/target-pending: (50 commits) target/iscsi: Bump versions to v4.1.0 target: Update copyright ownership/year information to 2013 iscsi-target: Bump default TCP listen backlog to 256 target: Fix >= v3.9+ regression in PR APTPL + ALUA metadata write-out iscsi-target; Bump default CmdSN Depth to 64 iscsi-target: Remove unnecessary wait_for_completion in iscsi_get_thread_set iscsi-target: Add thread_set->ts_activate_sem + use common deallocate iscsi-target: Fix race with thread_pre_handler flush_signals + ISCSI_THREAD_SET_DIE target: remove unused including <linux/version.h> iser-target: introduce fast memory registration mode (FRWR) iser-target: generalize rdma memory registration and cleanup iser-target: move rdma wr processing to a shared function target: Enable global EXTENDED_COPY setup/release target: Add Third Party Copy (3PC) bit in INQUIRY response target: Enable EXTENDED_COPY setup in spc_parse_cdb target: Add support for EXTENDED_COPY copy offload emulation target: Avoid non-existent tg_pt_gp_mem in target_alua_state_check target: Add global device list for EXTENDED_COPY target: Make helpers non static for EXTENDED_COPY command setup target: Make spc_parse_naa_6h_vendor_specific non static ...
| * | | | | | | | target/iscsi: Bump versions to v4.1.0Nicholas Bellinger2013-09-111-1/+1
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Signed-off-by: Nicholas Bellinger <nab@linux-iscsi.org>
| * | | | | | | | target: Add Third Party Copy (3PC) bit in INQUIRY responseNicholas Bellinger2013-09-111-0/+3
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | This patch adds the Third Party Copy (3PC) bit to signal support for EXTENDED_COPY within standard inquiry response data. Also add emulate_3pc device attribute in configfs (enabled by default) to allow the exposure of this bit to be disabled, if necessary. Cc: Christoph Hellwig <hch@lst.de> Cc: Hannes Reinecke <hare@suse.de> Cc: Martin Petersen <martin.petersen@oracle.com> Cc: Chris Mason <chris.mason@fusionio.com> Cc: Roland Dreier <roland@purestorage.com> Cc: Zach Brown <zab@redhat.com> Cc: James Bottomley <JBottomley@Parallels.com> Cc: Nicholas Bellinger <nab@linux-iscsi.org> Signed-off-by: Nicholas Bellinger <nab@daterainc.com>
| * | | | | | | | target: Add support for EXTENDED_COPY copy offload emulationNicholas Bellinger2013-09-111-0/+1
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | This patch adds support for EXTENDED_COPY emulation from SPC-3, that enables full copy offload target support within both a single virtual backend device, and across multiple virtual backend devices. It also functions independent of target fabric, and supports copy offload across multiple target fabric ports. This implemenation supports both EXTENDED_COPY PUSH and PULL models of operation, so the actual CDB may be received on either source or desination logical unit. For Target Descriptors, it currently supports the NAA IEEE Registered Extended designator (type 0xe4), which allows the reference of target ports to occur independent of fabric type using EVPD 0x83 WWNs. For Segment Descriptors, it currently supports copy from block to block (0x02) mode. It also honors any present SCSI reservations of the destination target port. Note that only Supports No List Identifier (SNLID=1) mode is supported. Also included is basic RECEIVE_COPY_RESULTS with service action type OPERATING PARAMETERS (0x03) required for SNLID=1 operation. v3 changes: - Fix incorrect return type in target_do_receive_copy_results() (Fengguang) v2 changes: - Use target_alloc_sgl() instead of transport_generic_get_mem() - Convert debug output to use pr_debug() - Convert target_xcopy_parse_target_descriptors() NAA IEEN WWN dump to use 0x%16phN format specification - Drop unnecessary xcopy_pt_cmd->xpt_passthrough_wsem, and associated usage in xcopy_pt_write_pending() and target_xcopy_issue_pt_cmd() - Add check for unsupported EXTENDED_COPY(LID4) service action bits in target_do_xcopy() Cc: Christoph Hellwig <hch@lst.de> Cc: Hannes Reinecke <hare@suse.de> Cc: Martin Petersen <martin.petersen@oracle.com> Cc: Chris Mason <chris.mason@fusionio.com> Cc: Roland Dreier <roland@purestorage.com> Cc: Zach Brown <zab@redhat.com> Cc: James Bottomley <JBottomley@Parallels.com> Cc: Nicholas Bellinger <nab@linux-iscsi.org> Signed-off-by: Nicholas Bellinger <nab@daterainc.com>
| * | | | | | | | target: Add global device list for EXTENDED_COPYNicholas Bellinger2013-09-111-0/+1
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | EXTENDED_COPY needs to be able to search a global list of devices based on NAA WWN device identifiers, so add a simple g_device_list protected by g_device_mutex. Cc: Christoph Hellwig <hch@lst.de> Cc: Hannes Reinecke <hare@suse.de> Cc: Martin Petersen <martin.petersen@oracle.com> Cc: Chris Mason <chris.mason@fusionio.com> Cc: Roland Dreier <roland@purestorage.com> Cc: Zach Brown <zab@redhat.com> Cc: James Bottomley <JBottomley@Parallels.com> Cc: Nicholas Bellinger <nab@linux-iscsi.org> Signed-off-by: Nicholas Bellinger <nab@daterainc.com>
| * | | | | | | | target: Make helpers non static for EXTENDED_COPY command setupNicholas Bellinger2013-09-111-0/+4
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Both target_alloc_sgl() and transport_generic_map_mem_to_cmd() are required by EXTENDED_COPY logic when setting up internally dispatched command descriptors, so go ahead and make both of these non static. Cc: Christoph Hellwig <hch@lst.de> Cc: Hannes Reinecke <hare@suse.de> Cc: Martin Petersen <martin.petersen@oracle.com> Cc: Chris Mason <chris.mason@fusionio.com> Cc: Roland Dreier <roland@purestorage.com> Cc: Zach Brown <zab@redhat.com> Cc: James Bottomley <JBottomley@Parallels.com> Cc: Nicholas Bellinger <nab@linux-iscsi.org> Signed-off-by: Nicholas Bellinger <nab@daterainc.com>
| * | | | | | | | target/tcm_qla2xxx: Add/use target_reverse_dma_direction() in ↵Nicholas Bellinger2013-09-111-0/+26
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | target_core_fabric.h Reversing the dma_data_direction for pci_map_sg() friends is useful for other drivers, so move it from tcm_qla2xxx into inline code within target_core_fabric.h. Also drop internal usage of equivlient in tcm_qla2xxx fabric code. Reported-by: Christoph Hellwig <hch@lst.de> Cc: Roland Dreier <roland@purestorage.com> Cc: Giridhar Malavali <giridhar.malavali@qlogic.com> Cc: Chad Dupuis <chad.dupuis@qlogic.com> Cc: Nicholas Bellinger <nab@linux-iscsi.org> Signed-off-by: Nicholas Bellinger <nab@daterainc.com>
| * | | | | | | | target: Add support for COMPARE_AND_WRITE emulationNicholas Bellinger2013-09-111-0/+1
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | This patch adds support for COMPARE_AND_WRITE emulation on a per block basis. This logic is used as an atomic test and set primative currently used by VMWare ESX VAAI for performing array side locking of individual VMFS extent ownership. This includes the COMPARE_AND_WRITE CDB parsing within sbc_parse_cdb(), and does the majority of the work within the compare_and_write_callback() to perform the verify instance user data comparision, and subsequent write instance user data I/O submission upon a successfull comparision. The synchronization is enforced by se_device->caw_sem, that is obtained before the initial READ I/O submission in sbc_compare_and_write(). The mutex is then released upon MISCOMPARE in compare_and_write_callback(), or upon WRITE instance user-data completion in compare_and_write_post(). The implementation currently assumes a single logical block (NoLB=1). v4 changes: - Explicitly clear cmd->transport_complete_callback for two failure cases in sbc_compare_and_write() in order to avoid double unlock of ->caw_sem in compare_and_write_callback() (Dan Carpenter) v3 changes: - Convert se_device->caw_mutex to ->caw_sem v2 changes: - Set SCF_COMPARE_AND_WRITE and cmd->execute_cmd() to sbc_compare_and_write() during setup in sbc_parse_cdb() - Use sbc_compare_and_write() for initial READ submission with DMA_FROM_DEVICE - Reset cmd->execute_cmd() to sbc_execute_rw() for write instance user-data in compare_and_write_callback() - Drop SCF_BIDI command flag usage - Set TRANSPORT_PROCESSING + transport_state flags before write instance submission, and convert to __target_execute_cmd() - Prevent sbc_get_size() from being being called twice to generate incorrect size in sbc_parse_cdb() - Enforce se_device->caw_mutex synchronization between initial READ I/O submission, and final WRITE I/O completion. Cc: Christoph Hellwig <hch@lst.de> Cc: Hannes Reinecke <hare@suse.de> Cc: Martin Petersen <martin.petersen@oracle.com> Cc: Chris Mason <chris.mason@fusionio.com> Cc: James Bottomley <JBottomley@Parallels.com> Cc: Nicholas Bellinger <nab@linux-iscsi.org> Signed-off-by: Nicholas Bellinger <nab@daterainc.com>
| * | | | | | | | target: Add MAXIMUM COMPARE AND WRITE LENGTH in Block Limits VPDNicholas Bellinger2013-09-091-0/+3
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | This patch adds the MAXIMUM COMPARE AND WRITE LENGTH bit, currently hardcoded to a single logical block (NoLB=1) within the Block Limits VPD in spc_emulate_evpd_b0(). Also add emulate_caw device attribute in configfs (enabled by default) to allow the exposure of this bit to be disabled, if necessary. Cc: Christoph Hellwig <hch@lst.de> Cc: Hannes Reinecke <hare@suse.de> Cc: Martin Petersen <martin.petersen@oracle.com> Cc: Chris Mason <chris.mason@fusionio.com> Cc: James Bottomley <JBottomley@Parallels.com> Cc: Nicholas Bellinger <nab@linux-iscsi.org> Signed-off-by: Nicholas Bellinger <nab@daterainc.com>
| * | | | | | | | target: Make __target_execute_cmd() available as externNicholas Bellinger2013-09-091-0/+1
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Required by COMPARE_AND_WRITE for write instance user-data submission, in order to bypass target_execute_cmd() checks. Reported-by: Christoph Hellwig <hch@lst.de> Cc: Roland Dreier <roland@purestorage.com> Cc: Nicholas Bellinger <nab@linux-iscsi.org> Signed-off-by: Nicholas Bellinger <nab@daterainc.com>
| * | | | | | | | target: Add transport_reset_sgl_orig() for COMPARE_AND_WRITENicholas Bellinger2013-09-091-0/+2
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | After COMPARE_AND_WRITE completes it's comparision, the WRITE payload SGLs head expect to be updated to point from the verify instance of user data, to the write instance of user data. So for this special case, add transport_reset_sgl_orig() usage within transport_free_pages() and add se_cmd->t_data_[sg,nents]_orig members to save the original assignments. Cc: Christoph Hellwig <hch@lst.de> Cc: Hannes Reinecke <hare@suse.de> Cc: Martin Petersen <martin.petersen@oracle.com> Cc: Chris Mason <chris.mason@fusionio.com> Cc: James Bottomley <JBottomley@Parallels.com> Cc: Nicholas Bellinger <nab@linux-iscsi.org> Signed-off-by: Nicholas Bellinger <nab@daterainc.com>
| * | | | | | | | target: Allow sbc_ops->execute_rw() to accept SGLs + data_directionNicholas Bellinger2013-09-092-1/+7
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | COMPARE_AND_WRITE expects to be able to send down a DMA_FROM_DEVICE to obtain the necessary READ payload for comparision against the first half of the WRITE payload containing the verify user data. Currently virtual backends expect to internally reference SGLs, SGL nents, and data_direction, so change IBLOCK, FILEIO and RD sbc_ops->execute_rw() to accept this values as function parameters. Also add default sbc_execute_rw() handler for the typical case for cmd->execute_rw() submission using cmd->t_data_sg, cmd->t_data_nents, and cmd->data_direction). v2 Changes: - Add SCF_COMPARE_AND_WRITE command flag - Use sbc_execute_rw() for normal cmd->execute_rw() submission with expected se_cmd members. Cc: Christoph Hellwig <hch@lst.de> Cc: Hannes Reinecke <hare@suse.de> Cc: Martin Petersen <martin.petersen@oracle.com> Cc: Chris Mason <chris.mason@fusionio.com> Cc: James Bottomley <JBottomley@Parallels.com> Cc: Nicholas Bellinger <nab@linux-iscsi.org> Signed-off-by: Nicholas Bellinger <nab@daterainc.com>
| * | | | | | | | target: Add TCM_MISCOMPARE_VERIFY sense handlingNicholas Bellinger2013-09-091-0/+1
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | This patch adds TCM_MISCOMPARE_VERIFY (ASC=0x1d, ASCQ=0x00) sense handling to transport_send_check_condition_and_sense(), which is required for a COMPARE_AND_WRITE comparision failure. Cc: Christoph Hellwig <hch@lst.de> Cc: Hannes Reinecke <hare@suse.de> Cc: Martin Petersen <martin.petersen@oracle.com> Cc: Chris Mason <chris.mason@fusionio.com> Cc: James Bottomley <JBottomley@Parallels.com> Cc: Nicholas Bellinger <nab@linux-iscsi.org> Signed-off-by: Nicholas Bellinger <nab@daterainc.com>
| * | | | | | | | target: Add return for se_cmd->transport_complete_callbackNicholas Bellinger2013-09-091-1/+1
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | This patch adds a sense_reason_t return to ->transport_complete_callback(), and updates target_complete_ok_work() to invoke the call if necessary to transport_send_check_condition_and_sense() during the failure case. Also update xdreadwrite_callback() to use this return value. Cc: Christoph Hellwig <hch@lst.de> Cc: Hannes Reinecke <hare@suse.de> Cc: Martin Petersen <martin.petersen@oracle.com> Cc: Chris Mason <chris.mason@fusionio.com> Cc: James Bottomley <JBottomley@Parallels.com> Cc: Nicholas Bellinger <nab@linux-iscsi.org> Signed-off-by: Nicholas Bellinger <nab@daterainc.com>
| * | | | | | | | scsi: Add CDB definition for COMPARE_AND_WRITENicholas Bellinger2013-09-091-0/+1
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Reviewed-by: Christoph Hellwig <hch@lst.de> Cc: Hannes Reinecke <hare@suse.de> Cc: Martin Petersen <martin.petersen@oracle.com> Cc: Chris Mason <chris.mason@fusionio.com> Cc: James Bottomley <JBottomley@Parallels.com> Cc: Nicholas Bellinger <nab@linux-iscsi.org> Signed-off-by: Nicholas Bellinger <nab@daterainc.com>