summaryrefslogtreecommitdiffstats
path: root/fs/btrfs/ordered-data.h (follow)
Commit message (Collapse)AuthorAgeFilesLines
* Btrfs: fix race setting block group readonly during device replaceFilipe Manana2016-05-301-1/+1
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | When we do a device replace, for each device extent we find from the source device, we set the corresponding block group to readonly mode to prevent writes into it from happening while we are copying the device extent from the source to the target device. However just before we set the block group to readonly mode some concurrent task might have already allocated an extent from it or decided it could perform a nocow write into one of its extents, which can make the device replace process to miss copying an extent since it uses the extent tree's commit root to search for extents and only once it finishes searching for all extents belonging to the block group it does set the left cursor to the logical end address of the block group - this is a problem if the respective ordered extents finish while we are searching for extents using the extent tree's commit root and no transaction commit happens while we are iterating the tree, since it's the delayed references created by the ordered extents (when they complete) that insert the extent items into the extent tree (using the non-commit root of course). Example: CPU 1 CPU 2 btrfs_dev_replace_start() btrfs_scrub_dev() scrub_enumerate_chunks() --> finds device extent belonging to block group X <transaction N starts> starts buffered write against some inode writepages is run against that inode forcing dellaloc to run btrfs_writepages() extent_writepages() extent_write_cache_pages() __extent_writepage() writepage_delalloc() run_delalloc_range() cow_file_range() btrfs_reserve_extent() --> allocates an extent from block group X (which is not yet in RO mode) btrfs_add_ordered_extent() --> creates ordered extent Y flush_epd_write_bio() --> bio against the extent from block group X is submitted btrfs_inc_block_group_ro(bg X) --> sets block group X to readonly scrub_chunk(bg X) scrub_stripe(device extent from srcdev) --> keeps searching for extent items belonging to the block group using the extent tree's commit root --> it never blocks due to fs_info->scrub_pause_req as no one tries to commit transaction N --> copies all extents found from the source device into the target device --> finishes search loop bio completes ordered extent Y completes and creates delayed data reference which will add an extent item to the extent tree when run (typically at transaction commit time) --> so the task doing the scrub/device replace at CPU 1 misses this and does not copy this extent into the new/target device btrfs_dec_block_group_ro(bg X) --> turns block group X back to RW mode dev_replace->cursor_left is set to the logical end offset of block group X So fix this by waiting for all cow and nocow writes after setting a block group to readonly mode. Signed-off-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: Josef Bacik <jbacik@fb.com>
* Merge branch 'cleanups-4.7' into for-chris-4.7-20160525David Sterba2016-05-251-1/+1
|\
| * btrfs: fix string and comment grammatical issues and typosNicholas D Steeves2016-05-251-1/+1
| | | | | | | | | | Signed-off-by: Nicholas D Steeves <nsteeves@gmail.com> Signed-off-by: David Sterba <dsterba@suse.com>
* | Btrfs: don't wait for unrelated IO to finish before relocationFilipe Manana2016-05-131-2/+4
|/ | | | | | | | | | | | | | | Before the relocation process of a block group starts, it sets the block group to readonly mode, then flushes all delalloc writes and then finally it waits for all ordered extents to complete. This last step includes waiting for ordered extents destinated at extents allocated in other block groups, making us waste unecessary time. So improve this by waiting only for ordered extents that fall into the block group's range. Signed-off-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: Josef Bacik <jbacik@fb.com> Reviewed-by: Liu Bo <bo.li.liu@oracle.com>
* Btrfs: change how we wait for pending ordered extentsJosef Bacik2015-10-221-0/+2
| | | | | | | | | | | | | | | | | | We have a mechanism to make sure we don't lose updates for ordered extents that were logged in the transaction that is currently running. We add the ordered extent to a transaction list and then the transaction waits on all the ordered extents in that list. However are substantially large file systems this list can be extremely large, and can give us soft lockups, since the ordered extents don't remove themselves from the list when they do complete. To fix this we simply add a counter to the transaction that is incremented any time we have a logged extent that needs to be completed in the current transaction. Then when the ordered extent finally completes it decrements the per transaction counter and wakes up the transaction if we are the last ones. This will eliminate the softlockup. Thanks, Signed-off-by: Josef Bacik <jbacik@fb.com> Signed-off-by: Chris Mason <clm@fb.com>
* Btrfs: avoid syncing log in the fast fsync path when not necessaryFilipe Manana2015-06-101-0/+3
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Commit 3a8b36f37806 ("Btrfs: fix data loss in the fast fsync path") added a performance regression for that causes an unnecessary sync of the log trees (fs/subvol and root log trees) when 2 consecutive fsyncs are done against a file, without no writes or any metadata updates to the inode in between them and if a transaction is committed before the second fsync is called. Huang Ying reported this to lkml (https://lkml.org/lkml/2015/3/18/99) after a test sysbench test that measured a -62% decrease of file io requests per second for that tests' workload. The test is: echo performance > /sys/devices/system/cpu/cpu0/cpufreq/scaling_governor echo performance > /sys/devices/system/cpu/cpu1/cpufreq/scaling_governor echo performance > /sys/devices/system/cpu/cpu2/cpufreq/scaling_governor echo performance > /sys/devices/system/cpu/cpu3/cpufreq/scaling_governor mkfs -t btrfs /dev/sda2 mount -t btrfs /dev/sda2 /fs/sda2 cd /fs/sda2 for ((i = 0; i < 1024; i++)); do fallocate -l 67108864 testfile.$i; done sysbench --test=fileio --max-requests=0 --num-threads=4 --max-time=600 \ --file-test-mode=rndwr --file-total-size=68719476736 --file-io-mode=sync \ --file-num=1024 run A test on kvm guest, running a debug kernel gave me the following results: Without 3a8b36f378060d: 16.01 reqs/sec With 3a8b36f378060d: 3.39 reqs/sec With 3a8b36f378060d and this patch: 16.04 reqs/sec Reported-by: Huang Ying <ying.huang@intel.com> Tested-by: Huang, Ying <ying.huang@intel.com> Signed-off-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: Chris Mason <clm@fb.com>
* Btrfs: remove csum_bytes_leftLiu Bo2015-06-031-3/+0
| | | | | | | | | | | | After commit 8407f553268a ("Btrfs: fix data corruption after fast fsync and writeback error"), during wait_ordered_extents(), we wait for ordered extent setting BTRFS_ORDERED_IO_DONE or BTRFS_ORDERED_IOERR, at which point we've already got checksum information, so we don't need to check (csum_bytes_left == 0) in the whole logging path. Signed-off-by: Liu Bo <bo.li.liu@oracle.com> Signed-off-by: Chris Mason <clm@fb.com>
* Btrfs: collect only the necessary ordered extents on ranged fsyncFilipe Manana2014-11-211-1/+3
| | | | | | | | | Instead of collecting all ordered extents from the inode's ordered tree and then wait for all of them to complete, just collect the ones that overlap the fsync range. Signed-off-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: Chris Mason <clm@fb.com>
* Btrfs: make sure logged extents complete in the current transaction V3Josef Bacik2014-11-211-1/+7
| | | | | | | | | | | | | | | | | | | | | | | | | | | Liu Bo pointed out that my previous fix would lose the generation update in the scenario I described. It is actually much worse than that, we could lose the entire extent if we lose power right after the transaction commits. Consider the following write extent 0-4k log extent in log tree commit transaction < power fail happens here ordered extent completes We would lose the 0-4k extent because it hasn't updated the actual fs tree, and the transaction commit will reset the log so it isn't replayed. If we lose power before the transaction commit we are save, otherwise we are not. Fix this by keeping track of all extents we logged in this transaction. Then when we go to commit the transaction make sure we wait for all of those ordered extents to complete before proceeding. This will make sure that if we lose power after the transaction commit we still have our data. This also fixes the problem of the improperly updated extent generation. Thanks, cc: stable@vger.kernel.org Signed-off-by: Josef Bacik <jbacik@fb.com> Signed-off-by: Chris Mason <clm@fb.com>
* btrfs: disable strict file flushes for renames and truncatesChris Mason2014-08-151-5/+0
| | | | | | | | | | | | | | | | | | | | | | Truncates and renames are often used to replace old versions of a file with new versions. Applications often expect this to be an atomic replacement, even if they haven't done anything to make sure the new version is fully on disk. Btrfs has strict flushing in place to make sure that renaming over an old file with a new file will fully flush out the new file before allowing the transaction commit with the rename to complete. This ordering means the commit code needs to be able to lock file pages, and there are a few paths in the filesystem where we will try to end a transaction with the page lock held. It's rare, but these things can deadlock. This patch removes the ordered flushes and switches to a best effort filemap_flush like ext4 uses. It's not perfect, but it should fix the deadlocks. Signed-off-by: Chris Mason <clm@fb.com>
* btrfs: Cleanup the "_struct" suffix in btrfs_workequeueQu Wenruo2014-03-101-2/+2
| | | | | | | | | | | | | | Since the "_struct" suffix is mainly used for distinguish the differnt btrfs_work between the original and the newly created one, there is no need using the suffix since all btrfs_workers are changed into btrfs_workqueue. Also this patch fixed some codes whose code style is changed due to the too long "_struct" suffix. Signed-off-by: Qu Wenruo <quwenruo@cn.fujitsu.com> Tested-by: David Sterba <dsterba@suse.cz> Signed-off-by: Josef Bacik <jbacik@fb.com>
* btrfs: Replace fs_info->endio_* workqueue with btrfs_workqueue.Qu Wenruo2014-03-101-1/+1
| | | | | | | | | Replace the fs_info->endio_* workqueues with the newly created btrfs_workqueue. Signed-off-by: Qu Wenruo <quwenruo@cn.fujitsu.com> Tested-by: David Sterba <dsterba@suse.cz> Signed-off-by: Josef Bacik <jbacik@fb.com>
* btrfs: Replace fs_info->flush_workers with btrfs_workqueue.Qu Wenruo2014-03-101-1/+1
| | | | | | | | | Replace the fs_info->submit_workers with the newly created btrfs_workqueue. Signed-off-by: Qu Wenruo <quwenruo@cn.fujitsu.com> Tested-by: David Sterba <dsterba@suse.cz> Signed-off-by: Josef Bacik <jbacik@fb.com>
* Btrfs: don't mix the ordered extents of all files together during logging ↵Miao Xie2014-03-101-1/+5
| | | | | | | | | | | | | | | | | | the inodes There was a problem in the old code: If we failed to log the csum, we would free all the ordered extents in the log list including those ordered extents that were logged successfully, it would make the log committer not to wait for the completion of the ordered extents. This patch doesn't insert the ordered extents that is about to be logged into a global list, instead, we insert them into a local list. If we log the ordered extents successfully, we splice them with the global list, or we will throw them away, then do full sync. It can also reduce the lock contention and the traverse time of list. Signed-off-by: Miao Xie <miaox@cn.fujitsu.com> Signed-off-by: Josef Bacik <jbacik@fb.com>
* Btrfs: don't wait for the completion of all the ordered extentsMiao Xie2013-11-121-2/+2
| | | | | | | | | | | It is very likely that there are lots of ordered extents in the filesytem, if we wait for the completion of all of them when we want to reclaim some space for the metadata space reservation, we would be blocked for a long time. The performance would drop down suddenly for a long time. Signed-off-by: Miao Xie <miaox@cn.fujitsu.com> Signed-off-by: Josef Bacik <jbacik@fusionio.com> Signed-off-by: Chris Mason <chris.mason@fusionio.com>
* Btrfs: return an error from btrfs_wait_ordered_rangeJosef Bacik2013-11-121-1/+1
| | | | | | | | | | | I noticed that if the free space cache has an error writing out it's data it won't actually error out, it will just carry on. This is because it doesn't check the return value of btrfs_wait_ordered_range, which didn't actually return anything. So fix this in order to keep us from making free space cache look valid when it really isnt. Thanks, Signed-off-by: Josef Bacik <jbacik@fusionio.com> Signed-off-by: Chris Mason <chris.mason@fusionio.com>
* Btrfs: kill delay_iput arg to the wait_ordered functionsJosef Bacik2013-09-211-3/+2
| | | | | | | | | | | This is a left over of how we used to wait for ordered extents, which was to grab the inode and then run filemap flush on it. However if we have an ordered extent then we already are holding a ref on the inode, and we just use btrfs_start_ordered_extent anyway, so there is no reason to have an extra ref on the inode to start work on the ordered extent. Thanks, Signed-off-by: Josef Bacik <jbacik@fusionio.com> Signed-off-by: Chris Mason <chris.mason@fusionio.com>
* Btrfs: allow partial ordered extent completionJosef Bacik2013-09-011-0/+7
| | | | | | | | | | | | | | | | | | | | | | | | | We currently have this problem where you can truncate pages that have not yet been written for an ordered extent. We do this because the truncate will be coming behind to clean us up anyway so what's the harm right? Well if truncate fails for whatever reason we leave an orphan item around for the file to be cleaned up later. But if the user goes and truncates up the file and tries to read from the area that had been discarded previously they will get a csum error because we never actually wrote that data out. This patch fixes this by allowing us to either discard the ordered extent completely, by which I mean we just free up the space we had allocated and not add the file extent, or adjust the length of the file extent we write. We do this by setting the length we truncated down to in the ordered extent, and then we set the file extent length and ram bytes to this length. The total disk space stays unchanged since we may be compressed and we can't just chop off the disk space, but at least this way the file extent only points to the valid data. Then when the file extent is free'd the extent and csums will be freed normally. This patch is needed for the next series which will give us more graceful recovery of failed truncates. Thanks, Signed-off-by: Josef Bacik <jbacik@fusionio.com> Signed-off-by: Chris Mason <chris.mason@fusionio.com>
* Btrfs: remove btrfs_sector_sum structureMiao Xie2013-07-021-20/+5
| | | | | | | | | | | | | | | | | | | | Using the structure btrfs_sector_sum to keep the checksum value is unnecessary, because the extents that btrfs_sector_sum points to are continuous, we can find out the expected checksums by btrfs_ordered_sum's bytenr and the offset, so we can remove btrfs_sector_sum's bytenr. After removing bytenr, there is only one member in the structure, so it makes no sense to keep the structure, just remove it, and use a u32 array to store the checksum value. By this change, we don't use the while loop to get the checksums one by one. Now, we can get several checksum value at one time, it improved the performance by ~74% on my SSD (31MB/s -> 54MB/s). test command: # dd if=/dev/zero of=/mnt/btrfs/file0 bs=1M count=1024 oflag=sync Signed-off-by: Miao Xie <miaox@cn.fujitsu.com> Signed-off-by: Josef Bacik <jbacik@fusionio.com>
* Btrfs: introduce per-subvolume ordered extent listMiao Xie2013-06-141-0/+2
| | | | | | | | The reason we introduce per-subvolume ordered extent list is the same as the per-subvolume delalloc inode list. Signed-off-by: Miao Xie <miaox@cn.fujitsu.com> Signed-off-by: Josef Bacik <jbacik@fusionio.com>
* Btrfs: improve the performance of the csums lookupMiao Xie2013-05-061-1/+2
| | | | | | | | | | | | | | It is very likely that there are several blocks in bio, it is very inefficient if we get their csums one by one. This patch improves this problem by getting the csums in batch. According to the result of the following test, the execute time of __btrfs_lookup_bio_sums() is down by ~28%(300us -> 217us). # dd if=<mnt>/file of=/dev/null bs=1M count=1024 Signed-off-by: Miao Xie <miaox@cn.fujitsu.com> Signed-off-by: Josef Bacik <jbacik@fusionio.com>
* Merge branch 'master' of ↵Chris Mason2013-02-201-1/+13
|\ | | | | | | | | | | | | | | | | git://git.kernel.org/pub/scm/linux/kernel/git/josef/btrfs-next into for-linus-3.9 Signed-off-by: Chris Mason <chris.mason@fusionio.com> Conflicts: fs/btrfs/disk-io.c
| * Btrfs: place ordered operations on a per transaction listJosef Bacik2013-02-201-1/+2
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Miao made the ordered operations stuff run async, which introduced a deadlock where we could get somebody (sync) racing in and committing the transaction while a commit was already happening. The new committer would try and flush ordered operations which would hang waiting for the commit to finish because it is done asynchronously and no longer inherits the callers trans handle. To fix this we need to make the ordered operations list a per transaction list. We can get new inodes added to the ordered operation list by truncating them and then having another process writing to them, so this makes it so that anybody trying to add an ordered operation _must_ start a transaction in order to add itself to the list, which will keep new inodes from getting added to the ordered operations list after we start committing. This should fix the deadlock and also keeps us from doing a lot more work than we need to during commit. Thanks, Signed-off-by: Josef Bacik <jbacik@fusionio.com>
| * Btrfs: wait on ordered extents at the last possible momentJosef Bacik2013-02-201-0/+11
| | | | | | | | | | | | | | | | | | | | | | Since we don't actually copy the extent information from the source tree in the fast case we don't need to wait for ordered io to be completed in order to fsync, we just need to wait for the io to be completed. So when we're logging our file just attach all of the ordered extents to the log, and then when the log syncs just wait for IO_DONE on the ordered extents and then write the super. Thanks, Signed-off-by: Josef Bacik <jbacik@fusionio.com>
* | Merge branch 'for-linus' of ↵Linus Torvalds2012-12-181-2/+5
|\| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | git://git.kernel.org/pub/scm/linux/kernel/git/mason/linux-btrfs Pull btrfs update from Chris Mason: "A big set of fixes and features. In terms of line count, most of the code comes from Stefan, who added the ability to replace a single drive in place. This is different from how btrfs normally replaces drives, and is much much much faster. Josef is plowing through our synchronous write performance. This pull request does not include the DIO_OWN_WAITING patch that was discussed on the list, but it has a number of other improvements to cut down our latencies and CPU time during fsync/O_DIRECT writes. Miao Xie has a big series of fixes and is spreading out ordered operations over more CPUs. This improves performance and reduces contention. I've put in fixes for error handling around hash collisions. These are going back to individual stable kernels as I test against them. Otherwise we have a lot of fixes and cleanups, thanks everyone! raid5/6 is being rebased against the device replacement code. I'll have it posted this Friday along with a nice series of benchmarks." * 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/mason/linux-btrfs: (115 commits) Btrfs: fix a bug of per-file nocow Btrfs: fix hash overflow handling Btrfs: don't take inode delalloc mutex if we're a free space inode Btrfs: fix autodefrag and umount lockup Btrfs: fix permissions of empty files not affected by umask Btrfs: put raid properties into global table Btrfs: fix BUG() in scrub when first superblock reading gives EIO Btrfs: do not call file_update_time in aio_write Btrfs: only unlock and relock if we have to Btrfs: use tokens where we can in the tree log Btrfs: optimize leaf_space_used Btrfs: don't memset new tokens Btrfs: only clear dirty on the buffer if it is marked as dirty Btrfs: move checks in set_page_dirty under DEBUG Btrfs: log changed inodes based on the extent map tree Btrfs: add path->really_keep_locks Btrfs: do not mark ems as prealloc if we are writing to them Btrfs: keep track of the extents original block length Btrfs: inline csums if we're fsyncing Btrfs: don't bother copying if we're only logging the inode ...
| * Btrfs: make ordered extent be flushed by multi-taskMiao Xie2012-12-111-1/+4
| | | | | | | | | | | | | | | | | | Though the process of the ordered extents is a bit different with the delalloc inode flush, but we can see it as a subset of the delalloc inode flush, so we also handle them by flush workers. Signed-off-by: Miao Xie <miaox@cn.fujitsu.com> Signed-off-by: Chris Mason <chris.mason@fusionio.com>
| * Btrfs: make ordered operations be handled by multi-taskMiao Xie2012-12-111-1/+1
| | | | | | | | | | | | | | | | The process of the ordered operations is similar to the delalloc inode flush, so we handle them by flush workers. Signed-off-by: Miao Xie <miaox@cn.fujitsu.com> Signed-off-by: Chris Mason <chris.mason@fusionio.com>
* | Fix misspellings of "whether" in comments.Adam Buchbinder2012-11-191-1/+1
|/ | | | | | | | "Whether" is misspelled in various comments across the tree; this fixes them. No code changes. Signed-off-by: Adam Buchbinder <adam.buchbinder@gmail.com> Signed-off-by: Jiri Kosina <jkosina@suse.cz>
* Btrfs: kill obsolete arguments in btrfs_wait_ordered_extentsLiu Bo2012-10-041-2/+1
| | | | | | nocow_only is now an obsolete argument. Signed-off-by: Liu Bo <bo.li.liu@oracle.com>
* Btrfs: use a slab for ordered extents allocationMiao Xie2012-10-011-0/+2
| | | | | | | | | | | | | | The ordered extent allocation is in the fast path of the IO, so use a slab to improve the speed of the allocation. "Size of the struct is 280, so this will fall into the size-512 bucket, giving 8 objects per page, while own slab will pack 14 objects into a page. Another benefit I see is to check for leaked objects when the module is removed (and the cache destroy takes place)." -- David Sterba Signed-off-by: Miao Xie <miaox@cn.fujitsu.com>
* Btrfs: fix file extent discount problem in the, snapshotMiao Xie2012-10-011-0/+7
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | If a snapshot is created while we are writing some data into the file, the i_size of the corresponding file in the snapshot will be wrong, it will be beyond the end of the last file extent. And btrfsck will report: root 256 inode 257 errors 100 Steps to reproduce: # mkfs.btrfs <partition> # mount <partition> <mnt> # cd <mnt> # dd if=/dev/zero of=tmpfile bs=4M count=1024 & # for ((i=0; i<4; i++)) > do > btrfs sub snap . $i > done This because the algorithm of disk_i_size update is wrong. Though there are some ordered extents behind the current one which we use to update disk_i_size, it doesn't mean those extents will be dealt with in the same transaction. So We shouldn't use the offset of those extents to update disk_i_size. Or we will get the wrong i_size in the snapshot. We fix this problem by recording the max real i_size. If we find there is a ordered extent which is in front of the current one and doesn't complete, we will record the end of the current one into that ordered extent. Surely, if the current extent holds the end of other extent(it must be greater than the current one because it is behind the current one), we will record the number that the current extent holds. In this way, we can exclude the ordered extents that may not be dealth with in the same transaction, and be easy to know the real disk_i_size. Signed-off-by: Miao Xie <miaox@cn.fujitsu.com>
* Btrfs: finish ordered extents in their own threadJosef Bacik2012-05-301-2/+11
| | | | | | | | | | | | | | We noticed that the ordered extent completion doesn't really rely on having a page and that it could be done independantly of ending the writeback on a page. This patch makes us not do the threaded endio stuff for normal buffered writes and direct writes so we can end page writeback as soon as possible (in irq context) and only start threads to do the ordered work when it is actually done. Compression needs to be reworked some to take advantage of this as well, but atm it has to do a find_get_page in its endio handler so it must be done in its own thread. This makes direct writes quite a bit faster. Thanks, Signed-off-by: Josef Bacik <josef@redhat.com>
* btrfs: return void in functions without error conditionsJeff Mahoney2012-03-221-12/+12
| | | | Signed-off-by: Jeff Mahoney <jeffm@suse.com>
* btrfs: Allow to add new compression algorithmLi Zefan2010-12-221-1/+7
| | | | | | | | | | Make the code aware of compression type, instead of always assuming zlib compression. Also make the zlib workspace function as common code for all compression types. Signed-off-by: Li Zefan <lizf@cn.fujitsu.com>
* Btrfs: deal with DIO bios that span more than one ordered extentChris Mason2010-11-291-0/+3
| | | | | | | | | | | | | The new DIO bio splitting code has problems when the bio spans more than one ordered extent. This will happen as the generic DIO code merges our get_blocks calls together into a bigger single bio. This fixes things by walking forward in the ordered extent code finding all the overlapping ordered extents and completing them all at once. Signed-off-by: Chris Mason <chris.mason@oracle.com>
* Btrfs: add basic DIO read/write supportJosef Bacik2010-05-251-1/+8
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | This provides basic DIO support for reading and writing. It does not do the work to recover from mismatching checksums, that will come later. A few design changes have been made from Jim's code (sorry Jim!) 1) Use the generic direct-io code. Jim originally re-wrote all the generic DIO code in order to account for all of BTRFS's oddities, but thanks to that work it seems like the best bet is to just ignore compression and such and just opt to fallback on buffered IO. 2) Fallback on buffered IO for compressed or inline extents. Jim's code did it's own buffering to make dio with compressed extents work. Now we just fallback onto normal buffered IO. 3) Use ordered extents for the writes so that all of the lock_extent() lookup_ordered() type checks continue to work. 4) Do the lock_extent() lookup_ordered() loop in readpage so we don't race with DIO writes. I've tested this with fsx and everything works great. This patch depends on my dio and filemap.c patches to work. Thanks, Signed-off-by: Josef Bacik <josef@redhat.com> Signed-off-by: Chris Mason <chris.mason@oracle.com>
* Btrfs: cache ordered extent when completing ioJosef Bacik2010-03-151-1/+2
| | | | | | | | | | | | | When finishing io we run btrfs_dec_test_ordered_pending, and then immediately run btrfs_lookup_ordered_extent, but btrfs_dec_test_ordered_pending does that already, so we're searching twice when we don't have to. This patch lets us pass a btrfs_ordered_extent in to btrfs_dec_test_ordered_pending so if we do complete io on that ordered extent we can just use the one we found then instead of having to do another btrfs_lookup_ordered_extent. This made my fio job with the other patch go from 24 mb/s to 29 mb/s. Signed-off-by: Josef Bacik <josef@redhat.com> Signed-off-by: Chris Mason <chris.mason@oracle.com>
* Btrfs: change the ordered tree to use a spinlock instead of a mutexJosef Bacik2010-03-151-2/+2
| | | | | | | | | | The ordered tree used to need a mutex, but currently all we use it for is to protect the rb_tree, and a spin_lock is just fine for that. Using a spin_lock instead makes dbench run a little faster, 58 mb/s instead of 51 mb/s, and have less latency, 3445.138 ms instead of 3820.633 ms. Signed-off-by: Josef Bacik <josef@redhat.com> Signed-off-by: Chris Mason <chris.mason@oracle.com>
* Btrfs: use RB_ROOT to intialize rb_trees instead of setting rb_node to NULLEric Paris2010-03-081-1/+1
| | | | | | | | | | | | | btrfs inialize rb trees in quite a number of places by settin rb_node = NULL; The problem with this is that 17d9ddc72fb8bba0d4f678 in the linux-next tree adds a new field to that struct which needs to be NULL for the new rbtree library code to work properly. This patch uses RB_ROOT as the intializer so all of the relevant fields will be NULL'd. Without the patch I get a panic. Signed-off-by: Eric Paris <eparis@redhat.com> Acked-by: Venkatesh Pallipadi <venkatesh.pallipadi@intel.com> Signed-off-by: Chris Mason <chris.mason@oracle.com>
* Btrfs: Add delayed iputYan, Zheng2009-12-171-1/+2
| | | | | | | | | | iput() can trigger new transactions if we are dropping the final reference, so calling it in btrfs_commit_transaction may end up deadlock. This patch adds delayed iput to avoid the issue. Signed-off-by: Yan Zheng <zheng.yan@oracle.com> Signed-off-by: Chris Mason <chris.mason@oracle.com>
* Btrfs: Fix disk_i_size update corner caseYan, Zheng2009-12-171-1/+1
| | | | | | | | | | There are some cases file extents are inserted without involving ordered struct. In these cases, we update disk_i_size directly, without checking pending ordered extent and DELALLOC bit. This patch extends btrfs_ordered_update_i_size() to handle these cases. Signed-off-by: Yan Zheng <zheng.yan@oracle.com> Signed-off-by: Chris Mason <chris.mason@oracle.com>
* Btrfs: remove duplicates of filemap_ helpersChristoph Hellwig2009-10-011-4/+0
| | | | | | | | | | Use filemap_fdatawrite_range and filemap_fdatawait_range instead of local copies of the functions. For filemap_fdatawait_range that also means replacing the awkward old wait_on_page_writeback_range calling convention with the regular filemap byte offsets. Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Chris Mason <chris.mason@oracle.com>
* Btrfs: Use PagePrivate2 to track pages in the data=ordered code.Chris Mason2009-09-111-0/+3
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Btrfs writes go through delalloc to the data=ordered code. This makes sure that all of the data is on disk before the metadata that references it. The tracking means that we have to make sure each page in an extent is fully written before we add that extent into the on-disk btree. This was done in the past by setting the EXTENT_ORDERED bit for the range of an extent when it was added to the data=ordered code, and then clearing the EXTENT_ORDERED bit in the extent state tree as each page finished IO. One of the reasons we had to do this was because sometimes pages are magically dirtied without page_mkwrite being called. The EXTENT_ORDERED bit is checked at writepage time, and if it isn't there, our page become dirty without going through the proper path. These bit operations make for a number of rbtree searches for each page, and can cause considerable lock contention. This commit switches from the EXTENT_ORDERED bit to use PagePrivate2. As pages go into the ordered code, PagePrivate2 is set on each one. This is a cheap operation because we already have all the pages locked and ready to go. As IO finishes, the PagePrivate2 bit is cleared and the ordered accoutning is updated for each page. At writepage time, if the PagePrivate2 bit is missing, we go into the writepage fixup code to handle improperly dirtied pages. Signed-off-by: Chris Mason <chris.mason@oracle.com>
* Btrfs: add extra flushing for renames and truncatesChris Mason2009-03-311-0/+4
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Renames and truncates are both common ways to replace old data with new data. The filesystem can make an effort to make sure the new data is on disk before actually replacing the old data. This is especially important for rename, which many application use as though it were atomic for both the data and the metadata involved. The current btrfs code will happily replace a file that is fully on disk with one that was just created and still has pending IO. If we crash after transaction commit but before the IO is done, we'll end up replacing a good file with a zero length file. The solution used here is to create a list of inodes that need special ordering and force them to disk before the commit is done. This is similar to the ext3 style data=ordering, except it is only done on selected files. Btrfs is able to get away with this because it does not wait on commits very often, even for fsync (which use a sub-commit). For renames, we order the file when it wasn't already on disk and when it is replacing an existing file. Larger files are sent to filemap_flush right away (before the transaction handle is opened). For truncates, we order if the file goes from non-zero size down to zero size. This is a little different, because at the time of the truncate the file has no dirty bytes to order. But, we flag the inode so that it is added to the ordered list on close (via release method). We also immediately add it to the ordered list of the current transaction so that we can try to flush down any writes the application sneaks in before commit. Signed-off-by: Chris Mason <chris.mason@oracle.com>
* Btrfs: move data checksumming into a dedicated treeChris Mason2008-12-081-4/+6
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Btrfs stores checksums for each data block. Until now, they have been stored in the subvolume trees, indexed by the inode that is referencing the data block. This means that when we read the inode, we've probably read in at least some checksums as well. But, this has a few problems: * The checksums are indexed by logical offset in the file. When compression is on, this means we have to do the expensive checksumming on the uncompressed data. It would be faster if we could checksum the compressed data instead. * If we implement encryption, we'll be checksumming the plain text and storing that on disk. This is significantly less secure. * For either compression or encryption, we have to get the plain text back before we can verify the checksum as correct. This makes the raid layer balancing and extent moving much more expensive. * It makes the front end caching code more complex, as we have touch the subvolume and inodes as we cache extents. * There is potentitally one copy of the checksum in each subvolume referencing an extent. The solution used here is to store the extent checksums in a dedicated tree. This allows us to index the checksums by phyiscal extent start and length. It means: * The checksum is against the data stored on disk, after any compression or encryption is done. * The checksum is stored in a central location, and can be verified without following back references, or reading inodes. This makes compression significantly faster by reducing the amount of data that needs to be checksummed. It will also allow much faster raid management code in general. The checksums are indexed by a key with a fixed objectid (a magic value in ctree.h) and offset set to the starting byte of the extent. This allows us to copy the checksum items into the fsync log tree directly (or any other tree), without having to invent a second format for them. Signed-off-by: Chris Mason <chris.mason@oracle.com>
* Btrfs: Add fallocate support v2Yan Zheng2008-10-301-1/+3
| | | | | | | | | | | | This patch updates btrfs-progs for fallocate support. fallocate is a little different in Btrfs because we need to tell the COW system that a given preallocated extent doesn't need to be cow'd as long as there are no snapshots of it. This leverages the -o nodatacow checks. Signed-off-by: Yan Zheng <zheng.yan@oracle.com>
* Btrfs: update nodatacow code v2Yan Zheng2008-10-301-2/+1
| | | | | | | | | | This patch simplifies the nodatacow checker. If all references were created after the latest snapshot, then we can avoid COW safely. This patch also updates run_delalloc_nocow to do more fine-grained checking. Signed-off-by: Yan Zheng <zheng.yan@oracle.com>
* Btrfs: Add zlib compression supportChris Mason2008-10-291-2/+8
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | This is a large change for adding compression on reading and writing, both for inline and regular extents. It does some fairly large surgery to the writeback paths. Compression is off by default and enabled by mount -o compress. Even when the -o compress mount option is not used, it is possible to read compressed extents off the disk. If compression for a given set of pages fails to make them smaller, the file is flagged to avoid future compression attempts later. * While finding delalloc extents, the pages are locked before being sent down to the delalloc handler. This allows the delalloc handler to do complex things such as cleaning the pages, marking them writeback and starting IO on their behalf. * Inline extents are inserted at delalloc time now. This allows us to compress the data before inserting the inline extent, and it allows us to insert an inline extent that spans multiple pages. * All of the in-memory extent representations (extent_map.c, ordered-data.c etc) are changed to record both an in-memory size and an on disk size, as well as a flag for compression. From a disk format point of view, the extent pointers in the file are changed to record the on disk size of a given extent and some encoding flags. Space in the disk format is allocated for compression encoding, as well as encryption and a generic 'other' field. Neither the encryption or the 'other' field are currently used. In order to limit the amount of data read for a single random read in the file, the size of a compressed extent is limited to 128k. This is a software only limit, the disk format supports u64 sized compressed extents. In order to limit the ram consumed while processing extents, the uncompressed size of a compressed extent is limited to 256k. This is a software only limit and will be subject to tuning later. Checksumming is still done on compressed extents, and it is done on the uncompressed version of the data. This way additional encodings can be layered on without having to figure out which encoding to checksum. Compression happens at delalloc time, which is basically singled threaded because it is usually done by a single pdflush thread. This makes it tricky to spread the compression load across all the cpus on the box. We'll have to look at parallel pdflush walks of dirty inodes at a later time. Decompression is hooked into readpages and it does spread across CPUs nicely. Signed-off-by: Chris Mason <chris.mason@oracle.com>
* Btrfs: O_DIRECT writes via buffered writes + invaldiateChris Mason2008-10-031-1/+1
| | | | | | | | | | | | | This reworks the btrfs O_DIRECT write code a bit. It had always fallen back to buffered IO and done an invalidate, but needed to be updated for the data=ordered code. The invalidate wasn't actually removing pages because they were still inside an ordered extent. This also combines the O_DIRECT/O_SYNC paths where possible, and kicks off IO in the main btrfs_file_write loop to keep the pipe down the the disk full as we process long writes. Signed-off-by: Chris Mason <chris.mason@oracle.com>
* Btrfs: Fix nodatacow for the new data=ordered modeYan Zheng2008-09-251-2/+4
| | | | Signed-off-by: Chris Mason <chris.mason@oracle.com>