summaryrefslogtreecommitdiffstats
Commit message (Collapse)AuthorAgeFilesLines
* mm/hmm: remove the unused HMM_FAULT_ALLOW_RETRY flagChristoph Hellwig2020-03-262-12/+0
| | | | | | | | | | The HMM_FAULT_ALLOW_RETRY isn't used anywhere in the tree. Remove it and the weird -EAGAIN handling where handle_mm_fault() drops the mmap_sem. Link: https://lore.kernel.org/r/20200316135310.899364-3-hch@lst.de Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Jason Gunthorpe <jgg@mellanox.com> Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
* mm/hmm: don't provide a stub for hmm_range_fault()Christoph Hellwig2020-03-261-7/+0
| | | | | | | | | | | All callers of hmm_range_fault depend on CONFIG_HMM_MIRROR, so don't bother with a stub. Link: https://lore.kernel.org/r/20200316135310.899364-2-hch@lst.de Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Zi Yan <ziy@nvidia.com> Reviewed-by: Jason Gunthorpe <jgg@mellanox.com> Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
* mm/hmm: do not check pmd_protnone twice in hmm_vma_handle_pmd()Jason Gunthorpe2020-03-261-1/+1
| | | | | | | | | | | | | pmd_to_hmm_pfn_flags() already checks it and makes the cpu flags 0. If no fault is requested then the pfns should be returned with the not valid flags. It should not unconditionally fault if faulting is not requested. Fixes: 2aee09d8c116 ("mm/hmm: change hmm_vma_fault() to allow write fault on page basis") Reviewed-by: Ralph Campbell <rcampbell@nvidia.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
* mm/hmm: add missing call to hmm_pte_need_fault in HMM_PFN_SPECIAL handlingJason Gunthorpe2020-03-261-7/+12
| | | | | | | | | | | | | | | | | Currently if a special PTE is encountered hmm_range_fault() immediately returns EFAULT and sets the HMM_PFN_SPECIAL error output (which nothing uses). EFAULT should only be returned after testing with hmm_pte_need_fault(). Also pte_devmap() and pte_special() are exclusive, and there is no need to check IS_ENABLED, pte_special() is stubbed out to return false on unsupported architectures. Fixes: 992de9a8b751 ("mm/hmm: allow to mirror vma of a file on a DAX backed filesystem") Reviewed-by: Ralph Campbell <rcampbell@nvidia.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
* mm/hmm: return -EFAULT when setting HMM_PFN_ERROR on requested valid pagesJason Gunthorpe2020-03-261-17/+21
| | | | | | | | | | | | | | | | | hmm_range_fault() should never return 0 if the caller requested a valid page, but the pfns output for that page would be HMM_PFN_ERROR. hmm_pte_need_fault() must always be called before setting HMM_PFN_ERROR to detect if the page is in faulting mode or not. Fix two cases in hmm_vma_walk_pmd() and reorganize some of the duplicated code. Fixes: d08faca018c4 ("mm/hmm: properly handle migration pmd") Fixes: da4c3c735ea4 ("mm/hmm/mirror: helper to snapshot CPU page table") Reviewed-by: Ralph Campbell <rcampbell@nvidia.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
* mm/hmm: reorganize how !pte_present is handled in hmm_vma_handle_pte()Jason Gunthorpe2020-03-261-20/+15
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | The intention with this code is to determine if the caller required the pages to be valid, and if so, then take some action to make them valid. The action varies depending on the page type. In all cases, if the caller doesn't ask for the page, then hmm_range_fault() should not return an error. Revise the implementation to be clearer, and fix some bugs: - hmm_pte_need_fault() must always be called before testing fault or write_fault otherwise the defaults of false apply and the if()'s don't work. This was missed on the is_migration_entry() branch - -EFAULT should not be returned unless hmm_pte_need_fault() indicates fault is required - ie snapshotting should not fail. - For !pte_present() the cpu_flags are always 0, except in the special case of is_device_private_entry(), calling pte_to_hmm_pfn_flags() is confusing. Reorganize the flow so that it always follows the pattern of calling hmm_pte_need_fault() and then checking fault || write_fault. Fixes: 2aee09d8c116 ("mm/hmm: change hmm_vma_fault() to allow write fault on page basis") Reviewed-by: Ralph Campbell <rcampbell@nvidia.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
* mm/hmm: add missing call to hmm_range_need_fault() before returning EFAULTJason Gunthorpe2020-03-261-11/+8
| | | | | | | | | | | | | All return paths that do EFAULT must call hmm_range_need_fault() to determine if the user requires this page to be valid. If the page cannot be made valid if the user later requires it, due to vma flags in this case, then the return should be HMM_PFN_ERROR. Fixes: a3e0d41c2b1f ("mm/hmm: improve driver API to work and wait over a range") Reviewed-by: Ralph Campbell <rcampbell@nvidia.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
* mm/hmm: add missing pfns set to hmm_vma_walk_pmd()Jason Gunthorpe2020-03-261-1/+1
| | | | | | | | | | | | All success exit paths from the walker functions must set the pfns array. A migration entry with no required fault is a HMM_PFN_NONE return, just like the pte case. Fixes: d08faca018c4 ("mm/hmm: properly handle migration pmd") Reviewed-by: Ralph Campbell <rcampbell@nvidia.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
* mm/hmm: do not call hmm_vma_walk_hole() while holding a spinlockJason Gunthorpe2020-03-261-7/+7
| | | | | | | | | | | | | | | This eventually calls into handle_mm_fault() which is a sleeping function. Release the lock first. hmm_vma_walk_hole() does not touch the contents of the PUD, so it does not need the lock. Fixes: 3afc423632a1 ("mm: pagewalk: add p4d_entry() and pgd_entry()") Cc: Steven Price <steven.price@arm.com> Reviewed-by: Ralph Campbell <rcampbell@nvidia.com> Reviewed-by: Steven Price <steven.price@arm.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
* mm/hmm: add missing unmaps of the ptep during hmm_vma_handle_pte()Jason Gunthorpe2020-03-261-2/+6
| | | | | | | | | | | | | | Many of the direct returns of error skipped doing the pte_unmap(). All non zero exit paths must unmap the pte. The pte_unmap() is split unnaturally like this because some of the error exit paths trigger a sleep and must release the lock before sleeping. Fixes: 992de9a8b751 ("mm/hmm: allow to mirror vma of a file on a DAX backed filesystem") Fixes: 53f5c3f489ec ("mm/hmm: factor out pte and pmd handling to simplify hmm_vma_walk_pmd()") Reviewed-by: Ralph Campbell <rcampbell@nvidia.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
* Linux 5.6-rc3v5.6-rc3Linus Torvalds2020-02-241-1/+1
|
* Merge tag 'for-5.6-rc2-tag' of ↵Linus Torvalds2020-02-237-10/+44
|\ | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux Pull btrfs fixes from David Sterba: "These are fixes that were found during testing with help of error injection, plus some other stable material. There's a fixup to patch added to rc1 causing locking in wrong context warnings, tests found one more deadlock scenario. The patches are tagged for stable, two of them now in the queue but we'd like all three released at the same time. I'm not happy about fixes to fixes in such a fast succession during rcs, but I hope we found all the fallouts of commit 28553fa992cb ('Btrfs: fix race between shrinking truncate and fiemap')" * tag 'for-5.6-rc2-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux: Btrfs: fix deadlock during fast fsync when logging prealloc extents beyond eof Btrfs: fix btrfs_wait_ordered_range() so that it waits for all ordered extents btrfs: fix bytes_may_use underflow in prealloc error condtition btrfs: handle logged extent failure properly btrfs: do not check delayed items are empty for single transaction cleanup btrfs: reset fs_root to NULL on error in open_ctree btrfs: destroy qgroup extent records on transaction abort
| * Btrfs: fix deadlock during fast fsync when logging prealloc extents beyond eofFilipe Manana2020-02-211-5/+5
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | While logging the prealloc extents of an inode during a fast fsync we call btrfs_truncate_inode_items(), through btrfs_log_prealloc_extents(), while holding a read lock on a leaf of the inode's root (not the log root, the fs/subvol root), and then that function locks the file range in the inode's iotree. This can lead to a deadlock when: * the fsync is ranged * the file has prealloc extents beyond eof * writeback for a range different from the fsync range starts during the fsync * the size of the file is not sector size aligned Because when finishing an ordered extent we lock first a file range and then try to COW the fs/subvol tree to insert an extent item. The following diagram shows how the deadlock can happen. CPU 1 CPU 2 btrfs_sync_file() --> for range [0, 1MiB) --> inode has a size of 1MiB and has 1 prealloc extent beyond the i_size, starting at offset 4MiB flushes all delalloc for the range [0MiB, 1MiB) and waits for the respective ordered extents to complete --> before task at CPU 1 locks the inode, a write into file range [1MiB, 2MiB + 1KiB) is made --> i_size is updated to 2MiB + 1KiB --> writeback is started for that range, [1MiB, 2MiB + 4KiB) --> end offset rounded up to be sector size aligned btrfs_log_dentry_safe() btrfs_log_inode_parent() btrfs_log_inode() btrfs_log_changed_extents() btrfs_log_prealloc_extents() --> does a search on the inode's root --> holds a read lock on leaf X btrfs_finish_ordered_io() --> locks range [1MiB, 2MiB + 4KiB) --> end offset rounded up to be sector size aligned --> tries to cow leaf X, through insert_reserved_file_extent() --> already locked by the task at CPU 1 btrfs_truncate_inode_items() --> gets an i_size of 2MiB + 1KiB, which is not sector size aligned --> tries to lock file range [2MiB, (u64)-1) --> the start range is rounded down from 2MiB + 1K to 2MiB to be sector size aligned --> but the subrange [2MiB, 2MiB + 4KiB) is already locked by task at CPU 2 which is waiting to get a write lock on leaf X for which we are holding a read lock *** deadlock *** This results in a stack trace like the following, triggered by test case generic/561 from fstests: [ 2779.973608] INFO: task kworker/u8:6:247 blocked for more than 120 seconds. [ 2779.979536] Not tainted 5.6.0-rc2-btrfs-next-53 #1 [ 2779.984503] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. [ 2779.990136] kworker/u8:6 D 0 247 2 0x80004000 [ 2779.990457] Workqueue: btrfs-endio-write btrfs_work_helper [btrfs] [ 2779.990466] Call Trace: [ 2779.990491] ? __schedule+0x384/0xa30 [ 2779.990521] schedule+0x33/0xe0 [ 2779.990616] btrfs_tree_read_lock+0x19e/0x2e0 [btrfs] [ 2779.990632] ? remove_wait_queue+0x60/0x60 [ 2779.990730] btrfs_read_lock_root_node+0x2f/0x40 [btrfs] [ 2779.990782] btrfs_search_slot+0x510/0x1000 [btrfs] [ 2779.990869] btrfs_lookup_file_extent+0x4a/0x70 [btrfs] [ 2779.990944] __btrfs_drop_extents+0x161/0x1060 [btrfs] [ 2779.990987] ? mark_held_locks+0x6d/0xc0 [ 2779.990994] ? __slab_alloc.isra.49+0x99/0x100 [ 2779.991060] ? insert_reserved_file_extent.constprop.19+0x64/0x300 [btrfs] [ 2779.991145] insert_reserved_file_extent.constprop.19+0x97/0x300 [btrfs] [ 2779.991222] ? start_transaction+0xdd/0x5c0 [btrfs] [ 2779.991291] btrfs_finish_ordered_io+0x4f4/0x840 [btrfs] [ 2779.991405] btrfs_work_helper+0xaa/0x720 [btrfs] [ 2779.991432] process_one_work+0x26d/0x6a0 [ 2779.991460] worker_thread+0x4f/0x3e0 [ 2779.991481] ? process_one_work+0x6a0/0x6a0 [ 2779.991489] kthread+0x103/0x140 [ 2779.991499] ? kthread_create_worker_on_cpu+0x70/0x70 [ 2779.991515] ret_from_fork+0x3a/0x50 (...) [ 2780.026211] INFO: task fsstress:17375 blocked for more than 120 seconds. [ 2780.027480] Not tainted 5.6.0-rc2-btrfs-next-53 #1 [ 2780.028482] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. [ 2780.030035] fsstress D 0 17375 17373 0x00004000 [ 2780.030038] Call Trace: [ 2780.030044] ? __schedule+0x384/0xa30 [ 2780.030052] schedule+0x33/0xe0 [ 2780.030075] lock_extent_bits+0x20c/0x320 [btrfs] [ 2780.030094] ? btrfs_truncate_inode_items+0xf4/0x1150 [btrfs] [ 2780.030098] ? rcu_read_lock_sched_held+0x59/0xa0 [ 2780.030102] ? remove_wait_queue+0x60/0x60 [ 2780.030122] btrfs_truncate_inode_items+0x133/0x1150 [btrfs] [ 2780.030151] ? btrfs_set_path_blocking+0xb2/0x160 [btrfs] [ 2780.030165] ? btrfs_search_slot+0x379/0x1000 [btrfs] [ 2780.030195] btrfs_log_changed_extents.isra.8+0x841/0x93e [btrfs] [ 2780.030202] ? do_raw_spin_unlock+0x49/0xc0 [ 2780.030215] ? btrfs_get_num_csums+0x10/0x10 [btrfs] [ 2780.030239] btrfs_log_inode+0xf83/0x1124 [btrfs] [ 2780.030251] ? __mutex_unlock_slowpath+0x45/0x2a0 [ 2780.030275] btrfs_log_inode_parent+0x2a0/0xe40 [btrfs] [ 2780.030282] ? dget_parent+0xa1/0x370 [ 2780.030309] btrfs_log_dentry_safe+0x4a/0x70 [btrfs] [ 2780.030329] btrfs_sync_file+0x3f3/0x490 [btrfs] [ 2780.030339] do_fsync+0x38/0x60 [ 2780.030343] __x64_sys_fdatasync+0x13/0x20 [ 2780.030345] do_syscall_64+0x5c/0x280 [ 2780.030348] entry_SYSCALL_64_after_hwframe+0x49/0xbe [ 2780.030356] RIP: 0033:0x7f2d80f6d5f0 [ 2780.030361] Code: Bad RIP value. [ 2780.030362] RSP: 002b:00007ffdba3c8548 EFLAGS: 00000246 ORIG_RAX: 000000000000004b [ 2780.030364] RAX: ffffffffffffffda RBX: 0000000000000003 RCX: 00007f2d80f6d5f0 [ 2780.030365] RDX: 00007ffdba3c84b0 RSI: 00007ffdba3c84b0 RDI: 0000000000000003 [ 2780.030367] RBP: 000000000000004a R08: 0000000000000001 R09: 00007ffdba3c855c [ 2780.030368] R10: 0000000000000078 R11: 0000000000000246 R12: 00000000000001f4 [ 2780.030369] R13: 0000000051eb851f R14: 00007ffdba3c85f0 R15: 0000557a49220d90 So fix this by making btrfs_truncate_inode_items() not lock the range in the inode's iotree when the target root is a log root, since it's not needed to lock the range for log roots as the protection from the inode's lock and log_mutex are all that's needed. Fixes: 28553fa992cb28 ("Btrfs: fix race between shrinking truncate and fiemap") CC: stable@vger.kernel.org # 4.4+ Reviewed-by: Josef Bacik <josef@toxicpanda.com> Signed-off-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
| * Btrfs: fix btrfs_wait_ordered_range() so that it waits for all ordered extentsFilipe Manana2020-02-191-1/+6
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | In btrfs_wait_ordered_range() once we find an ordered extent that has finished with an error we exit the loop and don't wait for any other ordered extents that might be still in progress. All the users of btrfs_wait_ordered_range() expect that there are no more ordered extents in progress after that function returns. So past fixes such like the ones from the two following commits: ff612ba7849964 ("btrfs: fix panic during relocation after ENOSPC before writeback happens") 28aeeac1dd3080 ("Btrfs: fix panic when starting bg cache writeout after IO error") don't work when there are multiple ordered extents in the range. Fix that by making btrfs_wait_ordered_range() wait for all ordered extents even after it finds one that had an error. Link: https://github.com/kdave/btrfs-progs/issues/228#issuecomment-569777554 CC: stable@vger.kernel.org # 4.4+ Reviewed-by: Qu Wenruo <wqu@suse.com> Reviewed-by: Josef Bacik <josef@toxicpanda.com> Signed-off-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
| * btrfs: fix bytes_may_use underflow in prealloc error condtitionJosef Bacik2020-02-191-3/+13
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | I hit the following warning while running my error injection stress testing: WARNING: CPU: 3 PID: 1453 at fs/btrfs/space-info.h:108 btrfs_free_reserved_data_space_noquota+0xfd/0x160 [btrfs] RIP: 0010:btrfs_free_reserved_data_space_noquota+0xfd/0x160 [btrfs] Call Trace: btrfs_free_reserved_data_space+0x4f/0x70 [btrfs] __btrfs_prealloc_file_range+0x378/0x470 [btrfs] elfcorehdr_read+0x40/0x40 ? elfcorehdr_read+0x40/0x40 ? btrfs_commit_transaction+0xca/0xa50 [btrfs] ? dput+0xb4/0x2a0 ? btrfs_log_dentry_safe+0x55/0x70 [btrfs] ? btrfs_sync_file+0x30e/0x420 [btrfs] ? do_fsync+0x38/0x70 ? __x64_sys_fdatasync+0x13/0x20 ? do_syscall_64+0x5b/0x1b0 ? entry_SYSCALL_64_after_hwframe+0x44/0xa9 This happens if we fail to insert our reserved file extent. At this point we've already converted our reservation from ->bytes_may_use to ->bytes_reserved. However once we break we will attempt to free everything from [cur_offset, end] from ->bytes_may_use, but our extent reservation will overlap part of this. Fix this problem by adding ins.offset (our extent allocation size) to cur_offset so we remove the actual remaining part from ->bytes_may_use. I validated this fix using my inject-error.py script python inject-error.py -o should_fail_bio -t cache_save_setup -t \ __btrfs_prealloc_file_range \ -t insert_reserved_file_extent.constprop.0 \ -r "-5" ./run-fsstress.sh where run-fsstress.sh simply mounts and runs fsstress on a disk. CC: stable@vger.kernel.org # 4.4+ Reviewed-by: Qu Wenruo <wqu@suse.com> Signed-off-by: Josef Bacik <josef@toxicpanda.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
| * btrfs: handle logged extent failure properlyJosef Bacik2020-02-191-0/+2
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | If we're allocating a logged extent we attempt to insert an extent record for the file extent directly. We increase space_info->bytes_reserved, because the extent entry addition will call btrfs_update_block_group(), which will convert the ->bytes_reserved to ->bytes_used. However if we fail at any point while inserting the extent entry we will bail and leave space on ->bytes_reserved, which will trigger a WARN_ON() on umount. Fix this by pinning the space if we fail to insert, which is what happens in every other failure case that involves adding the extent entry. CC: stable@vger.kernel.org # 5.4+ Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Reviewed-by: Nikolay Borisov <nborisov@suse.com> Reviewed-by: Qu Wenruo <wqu@suse.com> Signed-off-by: Josef Bacik <josef@toxicpanda.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
| * btrfs: do not check delayed items are empty for single transaction cleanupJosef Bacik2020-02-191-1/+0
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | btrfs_assert_delayed_root_empty() will check if the delayed root is completely empty, but this is a filesystem-wide check. On cleanup we may have allowed other transactions to begin, for whatever reason, and thus the delayed root is not empty. So remove this check from cleanup_one_transation(). This however can stay in btrfs_cleanup_transaction(), because it checks only after all of the transactions have been properly cleaned up, and thus is valid. CC: stable@vger.kernel.org # 4.4+ Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Reviewed-by: Nikolay Borisov <nborisov@suse.com> Reviewed-by: Qu Wenruo <wqu@suse.com> Signed-off-by: Josef Bacik <josef@toxicpanda.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
| * btrfs: reset fs_root to NULL on error in open_ctreeJosef Bacik2020-02-191-0/+1
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | While running my error injection script I hit a panic when we tried to clean up the fs_root when freeing the fs_root. This is because fs_info->fs_root == PTR_ERR(-EIO), which isn't great. Fix this by setting fs_info->fs_root = NULL; if we fail to read the root. CC: stable@vger.kernel.org # 4.4+ Reviewed-by: Nikolay Borisov <nborisov@suse.com> Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Reviewed-by: Qu Wenruo <wqu@suse.com> Signed-off-by: Josef Bacik <josef@toxicpanda.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
| * btrfs: destroy qgroup extent records on transaction abortJeff Mahoney2020-02-194-0/+17
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | We clean up the delayed references when we abort a transaction but we leave the pending qgroup extent records behind, leaking memory. This patch destroys the extent records when we destroy the delayed refs and makes sure ensure they're gone before releasing the transaction. Fixes: 3368d001ba5d ("btrfs: qgroup: Record possible quota-related extent for qgroup.") CC: stable@vger.kernel.org # 4.4+ Reviewed-by: Josef Bacik <josef@toxicpanda.com> Signed-off-by: Jeff Mahoney <jeffm@suse.com> [ Rebased to latest upstream, remove to_qgroup() helper, use rbtree_postorder_for_each_entry_safe() wrapper ] Signed-off-by: Qu Wenruo <wqu@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
* | Merge tag 'ext4_for_linus_stable' of ↵Linus Torvalds2020-02-2310-108/+256
|\ \ | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | git://git.kernel.org/pub/scm/linux/kernel/git/tytso/ext4 Pull ext4 fixes from Ted Ts'o: "More miscellaneous ext4 bug fixes (all stable fodder)" * tag 'ext4_for_linus_stable' of git://git.kernel.org/pub/scm/linux/kernel/git/tytso/ext4: ext4: fix mount failure with quota configured as module jbd2: fix ocfs2 corrupt when clearing block group bits ext4: fix race between writepages and enabling EXT4_EXTENTS_FL ext4: rename s_journal_flag_rwsem to s_writepages_rwsem ext4: fix potential race between s_flex_groups online resizing and access ext4: fix potential race between s_group_info online resizing and access ext4: fix potential race between online resizing and write operations ext4: add cond_resched() to __ext4_find_entry() ext4: fix a data race in EXT4_I(inode)->i_disksize
| * | ext4: fix mount failure with quota configured as moduleJan Kara2020-02-221-1/+1
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | When CONFIG_QFMT_V2 is configured as a module, the test in ext4_feature_set_ok() fails and so mount of filesystems with quota or project features fails. Fix the test to use IS_ENABLED macro which works properly even for modules. Link: https://lore.kernel.org/r/20200221100835.9332-1-jack@suse.cz Fixes: d65d87a07476 ("ext4: improve explanation of a mount failure caused by a misconfigured kernel") Signed-off-by: Jan Kara <jack@suse.cz> Signed-off-by: Theodore Ts'o <tytso@mit.edu> Cc: stable@kernel.org
| * | jbd2: fix ocfs2 corrupt when clearing block group bitswangyan2020-02-221-2/+6
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | I found a NULL pointer dereference in ocfs2_block_group_clear_bits(). The running environment: kernel version: 4.19 A cluster with two nodes, 5 luns mounted on two nodes, and do some file operations like dd/fallocate/truncate/rm on every lun with storage network disconnection. The fallocate operation on dm-23-45 caused an null pointer dereference. The information of NULL pointer dereference as follows: [577992.878282] JBD2: Error -5 detected when updating journal superblock for dm-23-45. [577992.878290] Aborting journal on device dm-23-45. ... [577992.890778] JBD2: Error -5 detected when updating journal superblock for dm-24-46. [577992.890908] __journal_remove_journal_head: freeing b_committed_data [577992.890916] (fallocate,88392,52):ocfs2_extend_trans:474 ERROR: status = -30 [577992.890918] __journal_remove_journal_head: freeing b_committed_data [577992.890920] (fallocate,88392,52):ocfs2_rotate_tree_right:2500 ERROR: status = -30 [577992.890922] __journal_remove_journal_head: freeing b_committed_data [577992.890924] (fallocate,88392,52):ocfs2_do_insert_extent:4382 ERROR: status = -30 [577992.890928] (fallocate,88392,52):ocfs2_insert_extent:4842 ERROR: status = -30 [577992.890928] __journal_remove_journal_head: freeing b_committed_data [577992.890930] (fallocate,88392,52):ocfs2_add_clusters_in_btree:4947 ERROR: status = -30 [577992.890933] __journal_remove_journal_head: freeing b_committed_data [577992.890939] __journal_remove_journal_head: freeing b_committed_data [577992.890949] Unable to handle kernel NULL pointer dereference at virtual address 0000000000000020 [577992.890950] Mem abort info: [577992.890951] ESR = 0x96000004 [577992.890952] Exception class = DABT (current EL), IL = 32 bits [577992.890952] SET = 0, FnV = 0 [577992.890953] EA = 0, S1PTW = 0 [577992.890954] Data abort info: [577992.890955] ISV = 0, ISS = 0x00000004 [577992.890956] CM = 0, WnR = 0 [577992.890958] user pgtable: 4k pages, 48-bit VAs, pgdp = 00000000f8da07a9 [577992.890960] [0000000000000020] pgd=0000000000000000 [577992.890964] Internal error: Oops: 96000004 [#1] SMP [577992.890965] Process fallocate (pid: 88392, stack limit = 0x00000000013db2fd) [577992.890968] CPU: 52 PID: 88392 Comm: fallocate Kdump: loaded Tainted: G W OE 4.19.36 #1 [577992.890969] Hardware name: Huawei TaiShan 2280 V2/BC82AMDD, BIOS 0.98 08/25/2019 [577992.890971] pstate: 60400009 (nZCv daif +PAN -UAO) [577992.891054] pc : _ocfs2_free_suballoc_bits+0x63c/0x968 [ocfs2] [577992.891082] lr : _ocfs2_free_suballoc_bits+0x618/0x968 [ocfs2] [577992.891084] sp : ffff0000c8e2b810 [577992.891085] x29: ffff0000c8e2b820 x28: 0000000000000000 [577992.891087] x27: 00000000000006f3 x26: ffffa07957b02e70 [577992.891089] x25: ffff807c59d50000 x24: 00000000000006f2 [577992.891091] x23: 0000000000000001 x22: ffff807bd39abc30 [577992.891093] x21: ffff0000811d9000 x20: ffffa07535d6a000 [577992.891097] x19: ffff000001681638 x18: ffffffffffffffff [577992.891098] x17: 0000000000000000 x16: ffff000080a03df0 [577992.891100] x15: ffff0000811d9708 x14: 203d207375746174 [577992.891101] x13: 73203a524f525245 x12: 20373439343a6565 [577992.891103] x11: 0000000000000038 x10: 0101010101010101 [577992.891106] x9 : ffffa07c68a85d70 x8 : 7f7f7f7f7f7f7f7f [577992.891109] x7 : 0000000000000000 x6 : 0000000000000080 [577992.891110] x5 : 0000000000000000 x4 : 0000000000000002 [577992.891112] x3 : ffff000001713390 x2 : 2ff90f88b1c22f00 [577992.891114] x1 : ffff807bd39abc30 x0 : 0000000000000000 [577992.891116] Call trace: [577992.891139] _ocfs2_free_suballoc_bits+0x63c/0x968 [ocfs2] [577992.891162] _ocfs2_free_clusters+0x100/0x290 [ocfs2] [577992.891185] ocfs2_free_clusters+0x50/0x68 [ocfs2] [577992.891206] ocfs2_add_clusters_in_btree+0x198/0x5e0 [ocfs2] [577992.891227] ocfs2_add_inode_data+0x94/0xc8 [ocfs2] [577992.891248] ocfs2_extend_allocation+0x1bc/0x7a8 [ocfs2] [577992.891269] ocfs2_allocate_extents+0x14c/0x338 [ocfs2] [577992.891290] __ocfs2_change_file_space+0x3f8/0x610 [ocfs2] [577992.891309] ocfs2_fallocate+0xe4/0x128 [ocfs2] [577992.891316] vfs_fallocate+0x11c/0x250 [577992.891317] ksys_fallocate+0x54/0x88 [577992.891319] __arm64_sys_fallocate+0x28/0x38 [577992.891323] el0_svc_common+0x78/0x130 [577992.891325] el0_svc_handler+0x38/0x78 [577992.891327] el0_svc+0x8/0xc My analysis process as follows: ocfs2_fallocate __ocfs2_change_file_space ocfs2_allocate_extents ocfs2_extend_allocation ocfs2_add_inode_data ocfs2_add_clusters_in_btree ocfs2_insert_extent ocfs2_do_insert_extent ocfs2_rotate_tree_right ocfs2_extend_rotate_transaction ocfs2_extend_trans jbd2_journal_restart jbd2__journal_restart /* handle->h_transaction is NULL, * is_handle_aborted(handle) is true */ handle->h_transaction = NULL; start_this_handle return -EROFS; ocfs2_free_clusters _ocfs2_free_clusters _ocfs2_free_suballoc_bits ocfs2_block_group_clear_bits ocfs2_journal_access_gd __ocfs2_journal_access jbd2_journal_get_undo_access /* I think jbd2_write_access_granted() will * return true, because do_get_write_access() * will return -EROFS. */ if (jbd2_write_access_granted(...)) return 0; do_get_write_access /* handle->h_transaction is NULL, it will * return -EROFS here, so do_get_write_access() * was not called. */ if (is_handle_aborted(handle)) return -EROFS; /* bh2jh(group_bh) is NULL, caused NULL pointer dereference */ undo_bg = (struct ocfs2_group_desc *) bh2jh(group_bh)->b_committed_data; If handle->h_transaction == NULL, then jbd2_write_access_granted() does not really guarantee that journal_head will stay around, not even speaking of its b_committed_data. The bh2jh(group_bh) can be removed after ocfs2_journal_access_gd() and before call "bh2jh(group_bh)->b_committed_data". So, we should move is_handle_aborted() check from do_get_write_access() into jbd2_journal_get_undo_access() and jbd2_journal_get_write_access() before the call to jbd2_write_access_granted(). Link: https://lore.kernel.org/r/f72a623f-b3f1-381a-d91d-d22a1c83a336@huawei.com Signed-off-by: Yan Wang <wangyan122@huawei.com> Signed-off-by: Theodore Ts'o <tytso@mit.edu> Reviewed-by: Jun Piao <piaojun@huawei.com> Reviewed-by: Jan Kara <jack@suse.cz> Cc: stable@kernel.org
| * | ext4: fix race between writepages and enabling EXT4_EXTENTS_FLEric Biggers2020-02-222-9/+23
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | If EXT4_EXTENTS_FL is set on an inode while ext4_writepages() is running on it, the following warning in ext4_add_complete_io() can be hit: WARNING: CPU: 1 PID: 0 at fs/ext4/page-io.c:234 ext4_put_io_end_defer+0xf0/0x120 Here's a minimal reproducer (not 100% reliable) (root isn't required): while true; do sync done & while true; do rm -f file touch file chattr -e file echo X >> file chattr +e file done The problem is that in ext4_writepages(), ext4_should_dioread_nolock() (which only returns true on extent-based files) is checked once to set the number of reserved journal credits, and also again later to select the flags for ext4_map_blocks() and copy the reserved journal handle to ext4_io_end::handle. But if EXT4_EXTENTS_FL is being concurrently set, the first check can see dioread_nolock disabled while the later one can see it enabled, causing the reserved handle to unexpectedly be NULL. Since changing EXT4_EXTENTS_FL is uncommon, and there may be other races related to doing so as well, fix this by synchronizing changing EXT4_EXTENTS_FL with ext4_writepages() via the existing s_writepages_rwsem (previously called s_journal_flag_rwsem). This was originally reported by syzbot without a reproducer at https://syzkaller.appspot.com/bug?extid=2202a584a00fffd19fbf, but now that dioread_nolock is the default I also started seeing this when running syzkaller locally. Link: https://lore.kernel.org/r/20200219183047.47417-3-ebiggers@kernel.org Reported-by: syzbot+2202a584a00fffd19fbf@syzkaller.appspotmail.com Fixes: 6b523df4fb5a ("ext4: use transaction reservation for extent conversion in ext4_end_io") Signed-off-by: Eric Biggers <ebiggers@google.com> Signed-off-by: Theodore Ts'o <tytso@mit.edu> Reviewed-by: Jan Kara <jack@suse.cz> Cc: stable@kernel.org
| * | ext4: rename s_journal_flag_rwsem to s_writepages_rwsemEric Biggers2020-02-223-11/+11
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | In preparation for making s_journal_flag_rwsem synchronize ext4_writepages() with changes to both the EXTENTS and JOURNAL_DATA flags (rather than just JOURNAL_DATA as it does currently), rename it to s_writepages_rwsem. Link: https://lore.kernel.org/r/20200219183047.47417-2-ebiggers@kernel.org Signed-off-by: Eric Biggers <ebiggers@google.com> Signed-off-by: Theodore Ts'o <tytso@mit.edu> Reviewed-by: Jan Kara <jack@suse.cz> Cc: stable@kernel.org
| * | ext4: fix potential race between s_flex_groups online resizing and accessSuraj Jitindar Singh2020-02-225-37/+76
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | During an online resize an array of s_flex_groups structures gets replaced so it can get enlarged. If there is a concurrent access to the array and this memory has been reused then this can lead to an invalid memory access. The s_flex_group array has been converted into an array of pointers rather than an array of structures. This is to ensure that the information contained in the structures cannot get out of sync during a resize due to an accessor updating the value in the old structure after it has been copied but before the array pointer is updated. Since the structures them- selves are no longer copied but only the pointers to them this case is mitigated. Link: https://bugzilla.kernel.org/show_bug.cgi?id=206443 Link: https://lore.kernel.org/r/20200221053458.730016-4-tytso@mit.edu Signed-off-by: Suraj Jitindar Singh <surajjs@amazon.com> Signed-off-by: Theodore Ts'o <tytso@mit.edu> Cc: stable@kernel.org
| * | ext4: fix potential race between s_group_info online resizing and accessSuraj Jitindar Singh2020-02-212-21/+39
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | During an online resize an array of pointers to s_group_info gets replaced so it can get enlarged. If there is a concurrent access to the array in ext4_get_group_info() and this memory has been reused then this can lead to an invalid memory access. Link: https://bugzilla.kernel.org/show_bug.cgi?id=206443 Link: https://lore.kernel.org/r/20200221053458.730016-3-tytso@mit.edu Signed-off-by: Suraj Jitindar Singh <surajjs@amazon.com> Signed-off-by: Theodore Ts'o <tytso@mit.edu> Reviewed-by: Balbir Singh <sblbir@amazon.com> Cc: stable@kernel.org
| * | ext4: fix potential race between online resizing and write operationsTheodore Ts'o2020-02-214-25/+97
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | During an online resize an array of pointers to buffer heads gets replaced so it can get enlarged. If there is a racing block allocation or deallocation which uses the old array, and the old array has gotten reused this can lead to a GPF or some other random kernel memory getting modified. Link: https://bugzilla.kernel.org/show_bug.cgi?id=206443 Link: https://lore.kernel.org/r/20200221053458.730016-2-tytso@mit.edu Reported-by: Suraj Jitindar Singh <surajjs@amazon.com> Signed-off-by: Theodore Ts'o <tytso@mit.edu> Cc: stable@kernel.org
| * | ext4: add cond_resched() to __ext4_find_entry()Shijie Luo2020-02-201-0/+1
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | We tested a soft lockup problem in linux 4.19 which could also be found in linux 5.x. When dir inode takes up a large number of blocks, and if the directory is growing when we are searching, it's possible the restart branch could be called many times, and the do while loop could hold cpu a long time. Here is the call trace in linux 4.19. [ 473.756186] Call trace: [ 473.756196] dump_backtrace+0x0/0x198 [ 473.756199] show_stack+0x24/0x30 [ 473.756205] dump_stack+0xa4/0xcc [ 473.756210] watchdog_timer_fn+0x300/0x3e8 [ 473.756215] __hrtimer_run_queues+0x114/0x358 [ 473.756217] hrtimer_interrupt+0x104/0x2d8 [ 473.756222] arch_timer_handler_virt+0x38/0x58 [ 473.756226] handle_percpu_devid_irq+0x90/0x248 [ 473.756231] generic_handle_irq+0x34/0x50 [ 473.756234] __handle_domain_irq+0x68/0xc0 [ 473.756236] gic_handle_irq+0x6c/0x150 [ 473.756238] el1_irq+0xb8/0x140 [ 473.756286] ext4_es_lookup_extent+0xdc/0x258 [ext4] [ 473.756310] ext4_map_blocks+0x64/0x5c0 [ext4] [ 473.756333] ext4_getblk+0x6c/0x1d0 [ext4] [ 473.756356] ext4_bread_batch+0x7c/0x1f8 [ext4] [ 473.756379] ext4_find_entry+0x124/0x3f8 [ext4] [ 473.756402] ext4_lookup+0x8c/0x258 [ext4] [ 473.756407] __lookup_hash+0x8c/0xe8 [ 473.756411] filename_create+0xa0/0x170 [ 473.756413] do_mkdirat+0x6c/0x140 [ 473.756415] __arm64_sys_mkdirat+0x28/0x38 [ 473.756419] el0_svc_common+0x78/0x130 [ 473.756421] el0_svc_handler+0x38/0x78 [ 473.756423] el0_svc+0x8/0xc [ 485.755156] watchdog: BUG: soft lockup - CPU#2 stuck for 22s! [tmp:5149] Add cond_resched() to avoid soft lockup and to provide a better system responding. Link: https://lore.kernel.org/r/20200215080206.13293-1-luoshijie1@huawei.com Signed-off-by: Shijie Luo <luoshijie1@huawei.com> Signed-off-by: Theodore Ts'o <tytso@mit.edu> Reviewed-by: Jan Kara <jack@suse.cz> Cc: stable@kernel.org
| * | ext4: fix a data race in EXT4_I(inode)->i_disksizeQian Cai2020-02-202-2/+2
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | EXT4_I(inode)->i_disksize could be accessed concurrently as noticed by KCSAN, BUG: KCSAN: data-race in ext4_write_end [ext4] / ext4_writepages [ext4] write to 0xffff91c6713b00f8 of 8 bytes by task 49268 on cpu 127: ext4_write_end+0x4e3/0x750 [ext4] ext4_update_i_disksize at fs/ext4/ext4.h:3032 (inlined by) ext4_update_inode_size at fs/ext4/ext4.h:3046 (inlined by) ext4_write_end at fs/ext4/inode.c:1287 generic_perform_write+0x208/0x2a0 ext4_buffered_write_iter+0x11f/0x210 [ext4] ext4_file_write_iter+0xce/0x9e0 [ext4] new_sync_write+0x29c/0x3b0 __vfs_write+0x92/0xa0 vfs_write+0x103/0x260 ksys_write+0x9d/0x130 __x64_sys_write+0x4c/0x60 do_syscall_64+0x91/0xb47 entry_SYSCALL_64_after_hwframe+0x49/0xbe read to 0xffff91c6713b00f8 of 8 bytes by task 24872 on cpu 37: ext4_writepages+0x10ac/0x1d00 [ext4] mpage_map_and_submit_extent at fs/ext4/inode.c:2468 (inlined by) ext4_writepages at fs/ext4/inode.c:2772 do_writepages+0x5e/0x130 __writeback_single_inode+0xeb/0xb20 writeback_sb_inodes+0x429/0x900 __writeback_inodes_wb+0xc4/0x150 wb_writeback+0x4bd/0x870 wb_workfn+0x6b4/0x960 process_one_work+0x54c/0xbe0 worker_thread+0x80/0x650 kthread+0x1e0/0x200 ret_from_fork+0x27/0x50 Reported by Kernel Concurrency Sanitizer on: CPU: 37 PID: 24872 Comm: kworker/u261:2 Tainted: G W O L 5.5.0-next-20200204+ #5 Hardware name: HPE ProLiant DL385 Gen10/ProLiant DL385 Gen10, BIOS A40 07/10/2019 Workqueue: writeback wb_workfn (flush-7:0) Since only the read is operating as lockless (outside of the "i_data_sem"), load tearing could introduce a logic bug. Fix it by adding READ_ONCE() for the read and WRITE_ONCE() for the write. Signed-off-by: Qian Cai <cai@lca.pw> Link: https://lore.kernel.org/r/1581085751-31793-1-git-send-email-cai@lca.pw Signed-off-by: Theodore Ts'o <tytso@mit.edu> Cc: stable@kernel.org
* | | Merge tag 'csky-for-linus-5.6-rc3' of git://github.com/c-sky/csky-linuxLinus Torvalds2020-02-2334-164/+663
|\ \ \ | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Pull csky updates from Guo Ren: "Sorry, I missed 5.6-rc1 merge window, but in this pull request the most are the fixes and the rests are between fixes and features. The only outside modification is the MAINTAINERS file update with our mailing list. - cache flush implementation fixes - ftrace modify panic fix - CONFIG_SMP boot problem fix - fix pt_regs saving for atomic.S - fix fixaddr_init without highmem. - fix stack protector support - fix fake Tightly-Coupled Memory code compile and use - fix some typos and coding convention" * tag 'csky-for-linus-5.6-rc3' of git://github.com/c-sky/csky-linux: (23 commits) csky: Replace <linux/clk-provider.h> by <linux/of_clk.h> csky: Implement copy_thread_tls csky: Add PCI support csky: Minimize defconfig to support buildroot config.fragment csky: Add setup_initrd check code csky: Cleanup old Kconfig options arch/csky: fix some Kconfig typos csky: Fixup compile warning for three unimplemented syscalls csky: Remove unused cache implementation csky: Fixup ftrace modify panic csky: Add flush_icache_mm to defer flush icache all csky: Optimize abiv2 copy_to_user_page with VM_EXEC csky: Enable defer flush_dcache_page for abiv2 cpus (807/810/860) csky: Remove unnecessary flush_icache_* implementation csky: Support icache flush without specific instructions csky/Kconfig: Add Kconfig.platforms to support some drivers csky/smp: Fixup boot failed when CONFIG_SMP csky: Set regs->usp to kernel sp, when the exception is from kernel csky/mm: Fixup export invalid_pte_table symbol csky: Separate fixaddr_init from highmem ...
| * | | csky: Replace <linux/clk-provider.h> by <linux/of_clk.h>Geert Uytterhoeven2020-02-231-1/+1
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | The C-Sky platform code is not a clock provider, and just needs to call of_clk_init(). Hence it can include <linux/of_clk.h> instead of <linux/clk-provider.h>. Signed-off-by: Geert Uytterhoeven <geert+renesas@glider.be> Signed-off-by: Guo Ren <guoren@linux.alibaba.com>
| * | | csky: Implement copy_thread_tlsGuo Ren2020-02-212-3/+5
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | This is required for clone3 which passes the TLS value through a struct rather than a register. Cc: Amanieu d'Antras <amanieu@gmail.com> Signed-off-by: Guo Ren <guoren@linux.alibaba.com>
| * | | csky: Add PCI supportMaJun2020-02-213-1/+39
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Add the pci related code for csky arch to support basic pci virtual function, such as qemu virt-pci-9pfs. Signed-off-by: MaJun <majun258@linux.alibaba.com> Signed-off-by: Guo Ren <guoren@linux.alibaba.com>
| * | | csky: Minimize defconfig to support buildroot config.fragmentMa Jun2020-02-211-7/+0
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Some bsp (eg: buildroot) has defconfig.fragment design to add more configs into the defconfig in linux source code tree. For example, we could put different cpu configs into different defconfig.fragments, but they all use the same defconfig in Linux. Signed-off-by: Ma Jun <majun258@linux.alibaba.com> Signed-off-by: Guo Ren <guoren@linux.alibaba.com>
| * | | csky: Add setup_initrd check codeGuo Ren2020-02-212-3/+44
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | We should give some necessary check for initrd just like other architectures and it seems that setup_initrd() could be a common code for all architectures. Signed-off-by: Guo Ren <guoren@linux.alibaba.com>
| * | | csky: Cleanup old Kconfig optionsKrzysztof Kozlowski2020-02-212-3/+0
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | CONFIG_CLKSRC_OF is gone since commit bb0eb050a577 ("clocksource/drivers: Rename CLKSRC_OF to TIMER_OF"). The platform already selects TIMER_OF. CONFIG_HAVE_DMA_API_DEBUG is gone since commit 6e88628d03dd ("dma-debug: remove CONFIG_HAVE_DMA_API_DEBUG"). CONFIG_DEFAULT_DEADLINE is gone since commit f382fb0bcef4 ("block: remove legacy IO schedulers"). Signed-off-by: Krzysztof Kozlowski <krzk@kernel.org> Signed-off-by: Guo Ren <guoren@linux.alibaba.com>
| * | | arch/csky: fix some Kconfig typosRandy Dunlap2020-02-211-1/+1
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Fix wording in help text for the CPU_HAS_LDSTEX symbol. Signed-off-by: Randy Dunlap <rdunlap@infradead.org> Signed-off-by: Guo Ren <guoren@kernel.org> Signed-off-by: Guo Ren <guoren@linux.alibaba.com>
| * | | csky: Fixup compile warning for three unimplemented syscallsGuo Ren2020-02-211-0/+3
| | | | | | | | | | | | | | | | | | | | | | | | | | | | Implement fstat64, fstatat64, clone3 syscalls to fixup checksyscalls.sh compile warnings. Signed-off-by: Guo Ren <guoren@linux.alibaba.com>
| * | | csky: Remove unused cache implementationGuo Ren2020-02-211-15/+1
| | | | | | | | | | | | | | | | | | | | | | | | Only for coding convention, these codes are unnecessary for abiv2. Signed-off-by: Guo Ren <guoren@linux.alibaba.com>
| * | | csky: Fixup ftrace modify panicGuo Ren2020-02-211-0/+2
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | During ftrace init, linux will replace all function prologues (call_mcout) with nops, but it need flush_dcache and invalidate_icache to make it work. So flush_cache functions couldn't be nested called by ftrace framework. Signed-off-by: Guo Ren <guoren@linux.alibaba.com>
| * | | csky: Add flush_icache_mm to defer flush icache allGuo Ren2020-02-217-11/+77
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Some CPUs don't support icache.va instruction to maintain the whole smp cores' icache. Using icache.all + IPI casue a lot on performace and using defer mechanism could reduce the number of calling icache _flush_all functions. Signed-off-by: Guo Ren <guoren@linux.alibaba.com>
| * | | csky: Optimize abiv2 copy_to_user_page with VM_EXECGuo Ren2020-02-211-1/+3
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Only when vma is for VM_EXEC, we need sync dcache & icache. eg: - gdb ptrace modify user space instruction code area. Add VM_EXEC condition to reduce unnecessary cache flush. The abiv1 cpus' cache are all VIPT, so we still need to deal with dcache aliasing problem. But there is optimized way to use cache color, just like what's done in arch/csky/abiv1/inc/abi/page.h. Signed-off-by: Guo Ren <guoren@linux.alibaba.com>
| * | | csky: Enable defer flush_dcache_page for abiv2 cpus (807/810/860)Guo Ren2020-02-212-8/+18
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Instead of flushing cache per update_mmu_cache() called, we use flush_dcache_page to reduce the frequency of flashing the cache. As abiv2 cpus are all PIPT for icache & dcache, we needn't handle dcache aliasing problem. But their icache can't snoop dcache, so we still need sync_icache_dcache in update_mmu_cache(). Signed-off-by: Guo Ren <guoren@linux.alibaba.com>
| * | | csky: Remove unnecessary flush_icache_* implementationGuo Ren2020-02-213-37/+2
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | The abiv2 CPUs are all PIPT cache, so there is no need to implement flush_icache_page function. The function flush_icache_user_range hasn't been used, so just remove it. The function flush_cache_range is not necessary for PIPT cache when tlb mapping changed. Signed-off-by: Guo Ren <guoren@linux.alibaba.com>
| * | | csky: Support icache flush without specific instructionsGuo Ren2020-02-214-8/+31
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Some CPUs don't support icache specific instructions to flush icache lines in broadcast way. We use cpu control registers to flush local icache and use IPI to notify other cores. Signed-off-by: Guo Ren <guoren@linux.alibaba.com>
| * | | csky/Kconfig: Add Kconfig.platforms to support some driversGuo Ren2020-02-212-0/+11
| | | | | | | | | | | | | | | | | | | | | | | | Such as snps,dw-apb-ictl Signed-off-by: Guo Ren <guoren@linux.alibaba.com>
| * | | csky/smp: Fixup boot failed when CONFIG_SMPGuo Ren2020-02-211-1/+1
| | | | | | | | | | | | | | | | | | | | | | | | | | | | If we use a non-ipi-support interrupt controller, it will cause panic here. We should let cpu up and work with CONFIG_SMP, when we use a non-ipi intc. Signed-off-by: Guo Ren <guoren@linux.alibaba.com>
| * | | csky: Set regs->usp to kernel sp, when the exception is from kernelGuo Ren2020-02-213-7/+31
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | In the past, we didn't care about kernel sp when saving pt_reg. But in some cases, we still need pt_reg->usp to represent the kernel stack before enter exception. For cmpxhg in atomic.S, we need save and restore usp for above. Signed-off-by: Guo Ren <guoren@linux.alibaba.com>
| * | | csky/mm: Fixup export invalid_pte_table symbolGuo Ren2020-02-211-0/+1
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | There is no present bit in csky pmd hardware, so we need to prepare invalid_pte_table for empty pmd entry and the functions (pmd_none & pmd_present) in pgtable.h need invalid_pte_talbe to get result. If a module use these functions, we need export the symbol for it. Signed-off-by: Guo Ren <guoren@linux.alibaba.com> Cc: Mo Qihui <qihui.mo@verisilicon.com> Cc: Zhange Jian <zhang_jian5@dahuatech.com>
| * | | csky: Separate fixaddr_init from highmemGuo Ren2020-02-217-66/+61
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | After fixaddr_init is separated from highmem, we could use tcm without highmem selected. (610 (abiv1) don't support highmem, but it could use tcm now.) Signed-off-by: Guo Ren <guoren@linux.alibaba.com>