summaryrefslogtreecommitdiffstats
Commit message (Collapse)AuthorAgeFilesLines
* irqchip/gic-v3: Prefix all pr_* messages by "GICv3: "Julien Grall2016-05-031-0/+2
| | | | | | | | Currently, most of the pr_* messages in the GICv3 driver don't have a prefix. Add one to make clear where the messages come from. Signed-off-by: Julien Grall <julien.grall@arm.com> Signed-off-by: Christoffer Dall <christoffer.dall@linaro.org>
* irqchip/gic-v2: Parse and export virtual GIC informationJulien Grall2016-05-034-1/+124
| | | | | | | | | | | | | For now, the firmware tables are parsed 2 times: once in the GIC drivers, the other timer when initializing the vGIC. It means code duplication and make more tedious to add the support for another firmware table (like ACPI). Introduce a new structure and set of helpers to get/set the virtual GIC information. Also fill up the structure for GICv2. Signed-off-by: Julien Grall <julien.grall@arm.com> Signed-off-by: Christoffer Dall <christoffer.dall@linaro.org>
* irqchip/gic-v2: Gather ACPI specific data in a single structureJulien Grall2016-05-031-4/+7
| | | | | | | | | | | | | | The ACPI code requires to use global variables in order to collect information from the tables. For now, a single global variable is used, but more will be added in a subsequent patch. To make clear they are ACPI specific, gather all the information in a single structure. Signed-off-by: Julien Grall <julien.grall@arm.com> Acked-by: Christofer Dall <christoffer.dall@linaro.org> Acked-by: Hanjun Guo <hanjun.guo@linaro.org> Signed-off-by: Christoffer Dall <christoffer.dall@linaro.org>
* clocksource: arm_arch_timer: Extend arch_timer_kvm_info to get the virtual IRQJulien Grall2016-05-032-0/+3
| | | | | | | | | | | | | | | Currently, the firmware table is parsed by the virtual timer code in order to retrieve the virtual timer interrupt. However, this is already done by the arch timer driver. To avoid code duplication, extend arch_timer_kvm_info to get the virtual IRQ. Note that the KVM code will be modified in a subsequent patch. Signed-off-by: Julien Grall <julien.grall@arm.com> Acked-by: Christoffer Dall <christoffer.dall@linaro.org> Signed-off-by: Christoffer Dall <christoffer.dall@linaro.org>
* clocksource: arm_arch_timer: Gather KVM specific information in a structureJulien Grall2016-05-032-3/+14
| | | | | | | | | | | | | | | | | | Introduce a structure which are filled up by the arch timer driver and used by the virtual timer in KVM. The first member of this structure will be the timecounter. More members will be added later. A stub for the new helper isn't introduced because KVM requires the arch timer for both ARM64 and ARM32. The function arch_timer_get_timecounter is kept for the time being and will be dropped in a subsequent patch. Signed-off-by: Julien Grall <julien.grall@arm.com> Acked-by: Christoffer Dall <christoffer.dall@linaro.org> Signed-off-by: Christoffer Dall <christoffer.dall@linaro.org>
* arm/arm64: KVM: Enforce Break-Before-Make on Stage-2 page tablesMarc Zyngier2016-04-291-6/+11
| | | | | | | | | | | | | | The ARM architecture mandates that when changing a page table entry from a valid entry to another valid entry, an invalid entry is first written, TLB invalidated, and only then the new entry being written. The current code doesn't respect this, directly writing the new entry and only then invalidating TLBs. Let's fix it up. Cc: <stable@vger.kernel.org> Reported-by: Christoffer Dall <christoffer.dall@linaro.org> Signed-off-by: Marc Zyngier <marc.zyngier@arm.com> Signed-off-by: Christoffer Dall <christoffer.dall@linaro.org>
* arm64: kvm: Add support for 16K pagesSuzuki K Poulose2016-04-212-3/+11
| | | | | | | | | | Now that we can handle stage-2 page tables independent of the host page table levels, wire up the 16K page support. Cc: Marc Zyngier <marc.zyngier@arm.com> Reviewed-by: Christoffer Dall <christoffer.dall@linaro.org> Signed-off-by: Suzuki K Poulose <suzuki.poulose@arm.com>
* kvm-arm: Cleanup stage2 pgd handlingSuzuki K Poulose2016-04-214-69/+7
| | | | | | | | | Now that we don't have any fake page table levels for arm64, cleanup the common code to get rid of the dead code. Cc: Marc Zyngier <marc.zyngier@arm.com> Acked-by: Christoffer Dall <christoffer.dall@linaro.org> Signed-off-by: Suzuki K Poulose <suzuki.poulose@arm.com>
* kvm: arm64: Get rid of fake page table levelsSuzuki K Poulose2016-04-214-95/+172
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | On arm64, the hardware supports concatenation of upto 16 tables, at entry level for stage2 translations and we make use that whenever possible. This could lead to reduced number of translation levels than the normal (stage1 table) table. Also, since the IPA(40bit) is smaller than the some of the supported VA_BITS (e.g, 48bit), there could be different number of levels in stage-1 vs stage-2 tables. To reuse the kernel host page table walker for stage2 we have been using a fake software page table level, not known to the hardware. But with 16K translations, there could be upto 2 fake software levels (with 48bit VA and 40bit IPA), which complicates the code. Hence, we want to get rid of the hack. Now that we have explicit accessors for hyp vs stage2 page tables, define the stage2 walker helpers accordingly based on the actual table used by the hardware. Once we know the number of translation levels used by the hardware, it is merely a job of defining the helpers based on whether a particular level is folded or not, looking at the number of levels. Some facts before we calculate the translation levels: 1) Smallest page size supported by arm64 is 4K. 2) The minimum number of bits resolved at any page table level is (PAGE_SHIFT - 3) at intermediate levels. Both of them implies, minimum number of bits required for a level change is 9. Since we can concatenate upto 16 tables at stage2 entry, the total number of page table levels used by the hardware for resolving N bits is same as that for (N - 4) bits (with concatenation), as there cannot be a level in between (N, N-4) as per the above rules. Hence, we have STAGE2_PGTABLE_LEVELS = PGTABLE_LEVELS(KVM_PHYS_SHIFT - 4) With the current IPA limit (40bit), for all supported translations and VA_BITS, we have the following condition (even for 36bit VA with 16K page size): CONFIG_PGTABLE_LEVELS >= STAGE2_PGTABLE_LEVELS. So, for e.g, if PUD is present in stage2, it is present in the hyp(host). Hence, we fall back to the host definition if we find that a level is not folded. Otherwise we redefine it accordingly. A build time check is added to make sure the above condition holds. If this condition breaks in future, we can rearrange the host level helpers and fix our code easily. Cc: Marc Zyngier <marc.zyngier@arm.com> Cc: Christoffer Dall <christoffer.dall@linaro.org> Reviewed-by: Christoffer Dall <christoffer.dall@linaro.org> Signed-off-by: Suzuki K Poulose <suzuki.poulose@arm.com>
* kvm-arm: Cleanup kvm_* wrappersSuzuki K Poulose2016-04-213-48/+1
| | | | | | | | | | | | Now that we have switched to explicit page table routines, get rid of the obsolete kvm_* wrappers. Also, kvm_tlb_flush_vmid_by_ipa is now called only on stage2 page tables, hence get rid of the redundant check. Cc: Marc Zyngier <marc.zyngier@arm.com> Acked-by: Christoffer Dall <christoffer.dall@linaro.org> Signed-off-by: Suzuki K Poulose <suzuki.poulose@arm.com>
* kvm-arm: Add stage2 page table modifiersSuzuki K Poulose2016-04-211-53/+44
| | | | | | | | | | | | Now that the hyp page table is handled by different set of routines, rename the original shared routines to stage2 handlers. Also make explicit use of the stage2 page table helpers. unmap_range has been merged to existing unmap_stage2_range. Cc: Marc Zyngier <marc.zyngier@arm.com> Acked-by: Christoffer Dall <christoffer.dall@linaro.org> Signed-off-by: Suzuki K Poulose <suzuki.poulose@arm.com>
* kvm-arm: Add explicit hyp page table modifiersSuzuki K Poulose2016-04-211-5/+99
| | | | | | | | | | | | | | | | | | We have common routines to modify hyp and stage2 page tables based on the 'kvm' parameter. For a smoother transition to using separate routines for each, duplicate the routines and modify the copy to work on hyp. Marks the forked routines with _hyp_ and gets rid of the kvm parameter which is no longer needed and is NULL for hyp. Also, gets rid of calls to kvm_tlb_flush_by_vmid_ipa() calls from the hyp versions. Uses explicit host page table accessors instead of the kvm_* page table helpers. Suggested-by: Christoffer Dall <christoffer.dall@linaro.org> Cc: Marc Zyngier <marc.zyngier@arm.com> Reviewed-by: Christoffer Dall <christoffer.dall@linaro.org> Signed-off-by: Suzuki K Poulose <suzuki.poulose@arm.com>
* kvm-arm: Use explicit stage2 helper routinesSuzuki K Poulose2016-04-211-24/+24
| | | | | | | | | We have stage2 page table helpers for both arm and arm64. Switch to the stage2 helpers for routines that only deal with stage2 page table. Cc: Marc Zyngier <marc.zyngier@arm.com> Acked-by: Christoffer Dall <christoffer.dall@linaro.org> Signed-off-by: Suzuki K Poulose <suzuki.poulose@arm.com>
* kvm-arm: arm64: Introduce hyp page table empty checksSuzuki K Poulose2016-04-211-0/+14
| | | | | | | | | | Introduce hyp_pxx_table_empty helpers for checking whether a given table entry is empty. This will be used explicitly once we switch to explicit routines for hyp page table walk. Acked-by: Marc Zyngier <marc.zyngier@arm.com> Acked-by: Christoffer Dall <christoffer.dall@linaro.org> Signed-off-by: Suzuki K Poulose <suzuki.poulose@arm.com>
* kvm-arm: arm64: Introduce stage2 page table helpersSuzuki K Poulose2016-04-212-27/+88
| | | | | | | | | | | Introduce stage2 page table helpers for arm64. With the fake page table level still in place, the stage2 table has the same number of levels as that of the host (and hyp), so they all fallback to the host version. Acked-by: Marc Zyngier <marc.zyngier@arm.com> Reviewed-by: Christoffer Dall <christoffer.dall@linaro.org> Signed-off-by: Suzuki K Poulose <suzuki.poulose@arm.com>
* kvm-arm: arm: Introduce hyp page table empty checksSuzuki K Poulose2016-04-211-1/+5
| | | | | | | | | | | Introduce hyp_pxx_table_empty helpers for checking whether a given table entry is empty. This will be used explicitly once we switch to explicit routines for hyp page table walk. Acked-by: Marc Zyngier <marc.zyngier@arm.com> Reviewed-by: Christoffer Dall <christoffer.dall@linaro.org> Signed-off-by: Suzuki K Poulose <suzuki.poulose@arm.com> Signed-off-by: Christoffer Dall <christoffer.dall@linaro.org>
* kvm-arm: arm32: Introduce stage2 page table helpersSuzuki K Poulose2016-04-212-0/+62
| | | | | | | | | | | | | | | Define the page table helpers for walking the stage2 pagetable for arm. Since both hyp and stage2 have the same number of levels, as that of the host we reuse the host helpers. The exceptions are the p.d_addr_end routines which have to deal with IPA > 32bit, hence we use the open coded version of their host helpers which supports 64bit. Acked-by: Marc Zyngier <marc.zyngier@arm.com> Reviewed-by: Christoffer Dall <christoffer.dall@linaro.org> Signed-off-by: Suzuki K Poulose <suzuki.poulose@arm.com> Signed-off-by: Christoffer Dall <christoffer.dall@linaro.org>
* kvm-arm: Remove kvm_pud_huge()Suzuki K Poulose2016-04-211-3/+1
| | | | | | | | | Get rid of kvm_pud_huge() which falls back to pud_huge. Use pud_huge instead. Acked-by: Christoffer Dall <christoffer.dall@linaro.org> Acked-by: Marc Zyngier <marc.zyngier@arm.com> Signed-off-by: Suzuki K Poulose <suzuki.poulose@arm.com>
* kvm-arm: Replace kvm_pmd_huge with pmd_thp_or_hugeSuzuki K Poulose2016-04-211-9/+8
| | | | | | | | | | | | Both arm and arm64 now provides a helper, pmd_thp_or_huge() to check if the given pmd represents a huge page. Use that instead of our own custom check. Suggested-by: Mark Rutland <mark.rutland@arm.com> Cc: Marc Zyngier <marc.zyngier@arm.com> Acked-by: Marc Zyngier <marc.zyngier@arm.com> Acked-by: Christoffer Dall <christoffer.dall@linaro.org> Signed-off-by: Suzuki K Poulose <suzuki.poulose@arm.com>
* arm64: Introduce pmd_thp_or_hugeSuzuki K Poulose2016-04-211-0/+2
| | | | | | | | | | | | | | | Add a helper to determine if a given pmd represents a huge page either by hugetlb or thp, as we have for arm. This will be used by KVM MMU code. Suggested-by: Mark Rutland <mark.rutland@arm.com> Cc: Catalin Marinas <catalin.marinas@arm.com> Cc: Steve Capper <steve.capper@linaro.org> Cc: Will Deacon <will.deacon@arm.com> Acked-by: Marc Zyngier <marc.zyngier@arm.com> Acked-by: Will Deacon <will.deacon@arm.com> Acked-by: Christoffer Dall <christoffer.dall@linaro.org> Signed-off-by: Suzuki K Poulose <suzuki.poulose@arm.com>
* kvm arm: Move fake PGD handling to arch specific filesSuzuki K Poulose2016-04-213-42/+59
| | | | | | | | | | Rearrange the code for fake pgd handling, which is applicable only for arm64. This will later be removed once we introduce the stage2 page table walker macros. Reviewed-by: Marc Zyngier <marc.zyngier@arm.com> Reviewed-by: Christoffer Dall <christoffer.dall@linaro.org> Signed-off-by: Suzuki K Poulose <suzuki.poulose@arm.com>
* arm64: Cleanup VTCR_EL2 and VTTBR field valuesSuzuki K Poulose2016-04-211-10/+12
| | | | | | | | | | | | | We share most of the bits for VTCR_EL2 for different page sizes, except for the TG0 value and the entry level value. This patch makes the definitions a bit more cleaner to reflect this fact. Also cleans up the VTTBR_X calculation. No functional changes. Cc: Marc Zyngier <marc.zyngier@arm.com> Acked-by: Marc Zyngier <marc.zyngier@arm.com> Reviewed-by: Christoffer Dall <christoffer.dall@linaro.org> Signed-off-by: Suzuki K Poulose <suzuki.poulose@arm.com>
* arm64: Reuse TCR field definitions for EL1 and EL2Suzuki K Poulose2016-04-212-40/+88
| | | | | | | | | | | | | | | | | TCR_EL1, TCR_EL2 and VTCR_EL2, all share some field positions (TG0, ORGN0, IRGN0 and SH0) and their corresponding value definitions. This patch makes the TCR_EL1 definitions reusable and uses them for TCR_EL2 and VTCR_EL2 fields. This also fixes a bug where we assume TG0 in {V}TCR_EL2 is 1bit field. Cc: Catalin Marinas <catalin.marinas@arm.com> Cc: Mark Rutland <mark.rutland@arm.com> Acked-by: Marc Zyngier <marc.zyngier@arm.com> Acked-by: Will Deacon <will.deacon@arm.com> Reviewed-by: Christoffer Dall <christoffer.dall@linaro.org> Signed-off-by: Suzuki K Poulose <suzuki.poulose@arm.com>
* arm64: KVM: unregister notifiers in hyp mode teardown pathSudeep Holla2016-04-061-3/+10
| | | | | | | | | | | | | | | | | | Commit 1e947bad0b63 ("arm64: KVM: Skip HYP setup when already running in HYP") re-organized the hyp init code and ended up leaving the CPU hotplug and PM notifier even if hyp mode initialization fails. Since KVM is not yet supported with ACPI, the above mentioned commit breaks CPU hotplug in ACPI boot. This patch fixes teardown_hyp_mode to properly unregister both CPU hotplug and PM notifiers in the teardown path. Fixes: 1e947bad0b63 ("arm64: KVM: Skip HYP setup when already running in HYP") Cc: Christoffer Dall <christoffer.dall@linaro.org> Cc: Marc Zyngier <marc.zyngier@arm.com> Signed-off-by: Sudeep Holla <sudeep.holla@arm.com> Signed-off-by: Christoffer Dall <christoffer.dall@linaro.org>
* arm64: KVM: Warn when PARange is less than 40 bitsMarc Zyngier2016-04-064-10/+44
| | | | | | | | | | | | | | | | | | | | | We always thought that 40bits of PA range would be the minimum people would actually build. Anything less is terrifyingly small. Turns out that we were both right and wrong. Nobody has ever built such a system, but the ARM Foundation Model has a PARange set to 36bits. Just because we can. Oh well. Now, the KVM API explicitely says that we offer a 40bit PA space to the VM, so we shouldn't run KVM on the Foundation Model at all. That being said, this patch offers a less agressive alternative, and loudly warns about the configuration being unsupported. You'll still be able to run VMs (at your own risks, though). This is just a workaround until we have a proper userspace API where we report the PARange to userspace. Signed-off-by: Marc Zyngier <marc.zyngier@arm.com> Signed-off-by: Christoffer Dall <christoffer.dall@linaro.org>
* KVM: arm/arm64: Handle forward time correction gracefullyMarc Zyngier2016-04-061-10/+39
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | On a host that runs NTP, corrections can have a direct impact on the background timer that we program on the behalf of a vcpu. In particular, NTP performing a forward correction will result in a timer expiring sooner than expected from a guest point of view. Not a big deal, we kick the vcpu anyway. But on wake-up, the vcpu thread is going to perform a check to find out whether or not it should block. And at that point, the timer check is going to say "timer has not expired yet, go back to sleep". This results in the timer event being lost forever. There are multiple ways to handle this. One would be record that the timer has expired and let kvm_cpu_has_pending_timer return true in that case, but that would be fairly invasive. Another is to check for the "short sleep" condition in the hrtimer callback, and restart the timer for the remaining time when the condition is detected. This patch implements the latter, with a bit of refactoring in order to avoid too much code duplication. Cc: <stable@vger.kernel.org> Reported-by: Alexander Graf <agraf@suse.de> Reviewed-by: Alexander Graf <agraf@suse.de> Signed-off-by: Marc Zyngier <marc.zyngier@arm.com> Signed-off-by: Christoffer Dall <christoffer.dall@linaro.org>
* arm64: KVM: Add braces to multi-line if statement in virtual PMU codeWill Deacon2016-04-011-1/+2
| | | | | | | | | | | | | | | | | | | | The kernel is written in C, not python, so we need braces around multi-line if statements. GCC 6 actually warns about this, thanks to the fantastic new "-Wmisleading-indentation" flag: | virt/kvm/arm/pmu.c: In function ‘kvm_pmu_overflow_status’: | virt/kvm/arm/pmu.c:198:3: warning: statement is indented as if it were guarded by... [-Wmisleading-indentation] | reg &= vcpu_sys_reg(vcpu, PMCNTENSET_EL0); | ^~~ | arch/arm64/kvm/../../../virt/kvm/arm/pmu.c:196:2: note: ...this ‘if’ clause, but it is not | if ((vcpu_sys_reg(vcpu, PMCR_EL0) & ARMV8_PMU_PMCR_E)) | ^~ As it turns out, this particular case is harmless (we just do some &= operations with 0), but worth fixing nonetheless. Signed-off-by: Will Deacon <will.deacon@arm.com> Signed-off-by: Christoffer Dall <christoffer.dall@linaro.org>
* arm64: KVM: Register CPU notifiers when the kernel runs at HYPJames Morse2016-03-311-19/+33
| | | | | | | | | | | | | | | | | When the kernel is running at EL2, it doesn't need init_hyp_mode() to configure page tables for HYP. This function also registers the CPU hotplug and lower power notifiers that cause HYP to be re-initialised after the CPU has been reset. To avoid losing the register state that controls stage2 translation, move the registering of these notifiers into init_subsystems(), and add a is_kernel_in_hyp_mode() path to each callback. Acked-by: Marc Zyngier <marc.zyngier@arm.com> Acked-by: Christoffer Dall <christoffer.dall@linaro.org> Fixes: 1e947bad0b6 ("arm64: KVM: Skip HYP setup when already running in HYP") Signed-off-by: James Morse <james.morse@arm.com> Signed-off-by: Christoffer Dall <christoffer.dall@linaro.org>
* arm64: kvm: 4.6-rc1: Fix VTCR_EL2 VS settingSuzuki K Poulose2016-03-303-3/+10
| | | | | | | | | | | | | | | | When we detect support for 16bit VMID in ID_AA64MMFR1, we set the VTCR_EL2_VS field to 1 to make use of 16bit vmids. But, with commit 3a3604bc5eb4 ("arm64: KVM: Switch to C-based stage2 init") this is broken and we corrupt VTCR_EL2:T0SZ instead of updating the VS field. VTCR_EL2_VS was actually defined to the field shift (19) and not the real value for VS. This patch fixes the issue. Fixes: commit 3a3604bc5eb4 ("arm64: KVM: Switch to C-based stage2 init") Cc: Christoffer Dall <christoffer.dall@linaro.org> Cc: Mark Rutland <mark.rutland@arm.com> Acked-by: Marc Zyngier <marc.zyngier@arm.com> Signed-off-by: Suzuki K Poulose <suzuki.poulose@arm.com> Signed-off-by: Christoffer Dall <christoffer.dall@linaro.org>
* Linux 4.6-rc1v4.6-rc1Linus Torvalds2016-03-271-2/+2
|
* Merge branch 'for-linus' of ↵Linus Torvalds2016-03-2622-519/+811
|\ | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | git://git.kernel.org/pub/scm/linux/kernel/git/sage/ceph-client Pull Ceph updates from Sage Weil: "There is quite a bit here, including some overdue refactoring and cleanup on the mon_client and osd_client code from Ilya, scattered writeback support for CephFS and a pile of bug fixes from Zheng, and a few random cleanups and fixes from others" [ I already decided not to pull this because of it having been rebased recently, but ended up changing my mind after all. Next time I'll really hold people to it. Oh well. - Linus ] * 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/sage/ceph-client: (34 commits) libceph: use KMEM_CACHE macro ceph: use kmem_cache_zalloc rbd: use KMEM_CACHE macro ceph: use lookup request to revalidate dentry ceph: kill ceph_get_dentry_parent_inode() ceph: fix security xattr deadlock ceph: don't request vxattrs from MDS ceph: fix mounting same fs multiple times ceph: remove unnecessary NULL check ceph: avoid updating directory inode's i_size accidentally ceph: fix race during filling readdir cache libceph: use sizeof_footer() more ceph: kill ceph_empty_snapc ceph: fix a wrong comparison ceph: replace CURRENT_TIME by current_fs_time() ceph: scattered page writeback libceph: add helper that duplicates last extent operation libceph: enable large, variable-sized OSD requests libceph: osdc->req_mempool should be backed by a slab pool libceph: make r_request msg_size calculation clearer ...
| * libceph: use KMEM_CACHE macroGeliang Tang2016-03-251-8/+2
| | | | | | | | | | | | | | Use KMEM_CACHE() instead of kmem_cache_create() to simplify the code. Signed-off-by: Geliang Tang <geliangtang@163.com> Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
| * ceph: use kmem_cache_zallocGeliang Tang2016-03-252-2/+2
| | | | | | | | | | | | | | Use kmem_cache_zalloc() instead of kmem_cache_alloc() with flag GFP_ZERO. Signed-off-by: Geliang Tang <geliangtang@163.com> Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
| * rbd: use KMEM_CACHE macroGeliang Tang2016-03-251-8/+2
| | | | | | | | | | | | | | Use KMEM_CACHE() instead of kmem_cache_create() to simplify the code. Signed-off-by: Geliang Tang <geliangtang@163.com> Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
| * ceph: use lookup request to revalidate dentryYan, Zheng2016-03-252-0/+35
| | | | | | | | | | | | | | | | | | | | | | | | | | | | If dentry has no lease, ceph_d_revalidate() previously return 0. This causes VFS to invalidate the dentry and create a new dentry for later lookup. Invalidating a dentry also detach any underneath mount points. So mount point inside cephfs can disapear mystically (even the mount point is not modified by other hosts). The fix is using lookup request to revalidate dentry without lease. This can partly solve the mount points disapear issue (as long as the mount point is not modified by other hosts) Signed-off-by: Yan, Zheng <zyan@redhat.com>
| * ceph: kill ceph_get_dentry_parent_inode()Yan, Zheng2016-03-252-20/+5
| | | | | | | | | | | | use vfs helper dget_parent() instead Signed-off-by: Yan, Zheng <zyan@redhat.com>
| * ceph: fix security xattr deadlockYan, Zheng2016-03-258-11/+125
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | When security is enabled, security module can call filesystem's getxattr/setxattr callbacks during d_instantiate(). For cephfs, d_instantiate() is usually called by MDS' dispatch thread, while handling MDS reply. If the MDS reply does not include xattrs and corresponding caps, getxattr/setxattr need to send a new request to MDS and waits for the reply. This makes MDS' dispatch sleep, nobody handles later MDS replies. The fix is make sure lookup/atomic_open reply include xattrs and corresponding caps. So getxattr can be handled by cached xattrs. This requires some modification to both MDS and request message. (Client tells MDS what caps it wants; MDS encodes proper caps in the reply) Smack security module may call setxattr during d_instantiate(). Unlike getxattr, we can't force MDS to issue CEPH_CAP_XATTR_EXCL to us. So just make setxattr return error when called by MDS' dispatch thread. Signed-off-by: Yan, Zheng <zyan@redhat.com>
| * ceph: don't request vxattrs from MDSYan, Zheng2016-03-251-2/+4
| | | | | | | | | | | | It's uselese because MDS reply does not carry any vxattr. Signed-off-by: Yan, Zheng <zyan@redhat.com>
| * ceph: fix mounting same fs multiple timesYan, Zheng2016-03-251-18/+15
| | | | | | | | | | | | | | Now __ceph_open_session() only accepts closed client. An opened client will tigger BUG_ON(). Signed-off-by: Yan, Zheng <zyan@redhat.com>
| * ceph: remove unnecessary NULL checkYan, Zheng2016-03-251-2/+2
| | | | | | | | | | | | | | | | If page->mapping is NULL, releasepage() callback does not get called. Remove the unnecessary NULL check to make static code analysis tool happy Signed-off-by: Yan, Zheng <zyan@redhat.com>
| * ceph: avoid updating directory inode's i_size accidentallyYan, Zheng2016-03-251-0/+4
| | | | | | | | | | | | Directory inode's i_size is used by readdir cache. Signed-off-by: Yan, Zheng <zyan@redhat.com>
| * ceph: fix race during filling readdir cacheYan, Zheng2016-03-251-2/+7
| | | | | | | | | | | | | | | | | | Readdir cache uses page cache to save dentry pointers. When adding dentry pointers to middle of a page, we need to make sure the page already exists. Otherwise the beginning part of the page will be invalid pointers. Signed-off-by: Yan, Zheng <zyan@redhat.com>
| * libceph: use sizeof_footer() moreIlya Dryomov2016-03-251-16/+3
| | | | | | | | | | | | | | | | | | Don't open-code sizeof_footer() in read_partial_message() and ceph_msg_revoke(). Also, after switching to sizeof_footer(), it's now possible to use con_out_kvec_add() in prepare_write_message_footer(). Signed-off-by: Ilya Dryomov <idryomov@gmail.com> Reviewed-by: Alex Elder <elder@linaro.org>
| * ceph: kill ceph_empty_snapcIlya Dryomov2016-03-254-34/+6
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | ceph_empty_snapc->num_snaps == 0 at all times. Passing such a snapc to ceph_osdc_alloc_request() (possibly through ceph_osdc_new_request()) is equivalent to passing NULL, as ceph_osdc_alloc_request() uses it only for sizing the request message. Further, in all four cases the subsequent ceph_osdc_build_request() is passed NULL for snapc, meaning that 0 is encoded for seq and num_snaps and making ceph_empty_snapc entirely useless. The two cases where it actually mattered were removed in commits 860560904962 ("ceph: avoid sending unnessesary FLUSHSNAP message") and 23078637e054 ("ceph: fix queuing inode to mdsdir's snaprealm"). Signed-off-by: Ilya Dryomov <idryomov@gmail.com> Reviewed-by: Yan, Zheng <zyan@redhat.com>
| * ceph: fix a wrong comparisonAnton Protopopov2016-03-251-1/+1
| | | | | | | | | | | | | | | | A negative value rc compared to the positive value ENOENT in the finish_read() function. Signed-off-by: Anton Protopopov <a.s.protopopov@gmail.com> Signed-off-by: Yan, Zheng <zyan@redhat.com>
| * ceph: replace CURRENT_TIME by current_fs_time()Deepa Dinamani2016-03-254-6/+6
| | | | | | | | | | | | | | | | | | CURRENT_TIME macro is not appropriate for filesystems as it doesn't use the right granularity for filesystem timestamps. Use current_fs_time() instead. Signed-off-by: Deepa Dinamani <deepa.kernel@gmail.com> Signed-off-by: Yan, Zheng <zyan@redhat.com>
| * ceph: scattered page writebackYan, Zheng2016-03-251-109/+196
| | | | | | | | | | | | | | | | | | | | This patch makes ceph_writepages_start() try using single OSD request to write all dirty pages within a strip unit. When a nonconsecutive dirty page is found, ceph_writepages_start() tries starting a new write operation to existing OSD request. If it succeeds, it uses the new operation to writeback the dirty page. Signed-off-by: Yan, Zheng <zyan@redhat.com>
| * libceph: add helper that duplicates last extent operationYan, Zheng2016-03-252-0/+24
| | | | | | | | | | | | | | | | | | | | This helper duplicates last extent operation in OSD request, then adjusts the new extent operation's offset and length. The helper is for scatterd page writeback, which adds nonconsecutive dirty pages to single OSD request. Signed-off-by: Yan, Zheng <zyan@redhat.com> Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
| * libceph: enable large, variable-sized OSD requestsIlya Dryomov2016-03-253-19/+32
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Turn r_ops into a flexible array member to enable large, consisting of up to 16 ops, OSD requests. The use case is scattered writeback in cephfs and, as far as the kernel client is concerned, 16 is just a made up number. r_ops had size 3 for copyup+hint+write, but copyup is really a special case - it can only happen once. ceph_osd_request_cache is therefore stuffed with num_ops=2 requests, anything bigger than that is allocated with kmalloc(). req_mempool is backed by ceph_osd_request_cache, which means either num_ops=1 or num_ops=2 for use_mempool=true - all existing users (ceph_writepages_start(), ceph_osdc_writepages()) are fine with that. Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
| * libceph: osdc->req_mempool should be backed by a slab poolIlya Dryomov2016-03-251-2/+2
| | | | | | | | | | | | | | | | ceph_osd_request_cache was introduced a long time ago. Also, osd_req is about to get a flexible array member, which ceph_osd_request_cache is going to be aware of. Signed-off-by: Ilya Dryomov <idryomov@gmail.com>