aboutsummaryrefslogtreecommitdiffstats
Commit message (Collapse)AuthorAgeFilesLines
* Fix install_process_keyring error handlingAndi Kleen2010-10-281-1/+1
| | | | | | | | | Fix an incorrect error check that returns 1 for error instead of the expected error code. Signed-off-by: Andi Kleen <ak@linux.intel.com> Signed-off-by: David Howells <dhowells@redhat.com> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
* parisc: fix compile failure with kmap_atomic changesJames Bottomley2010-10-281-4/+4
| | | | | | | | | | | | Commit 3e4d3af501cc ("mm: stack based kmap_atomic()") overlooked the fact that parisc uses kmap as a coherence mechanism, so even though we have no highmem, we do need to supply our own versions of kmap (and atomic). This patch converts the parisc kmap to the form which is needed to keep it compiling (it's a simple prototype and name change). Signed-off-by: James Bottomley <James.Bottomley@suse.de> Acked-by: Kyle McMartin <kyle@redhat.com> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
* Fix compile brekage with !CONFIG_BLOCKIngo Molnar2010-10-281-0/+1
| | | | | | | | | | | | | | | | | | Today's git tree fails to build on !CONFIG_BLOCK, due to upstream commit 367a51a33902 ("fs: Add FITRIM ioctl"): include/linux/fs.h:36: error: expected specifier-qualifier-list before ‘uint64_t’ include/linux/fs.h:36: error: expected specifier-qualifier-list before ‘uint64_t’ include/linux/fs.h:36: error: expected specifier-qualifier-list before ‘uint64_t’ The commit adds uint64_t type usage to fs.h, but linux/types.h is not included explicitly - it's only included implicitly via linux/blk_types.h, and there only if CONFIG_BLOCK is enabled. Add the explicit #include to fix this. Signed-off-by: Ingo Molnar <mingo@elte.hu> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
* Merge git://git.infradead.org/battery-2.6Linus Torvalds2010-10-281-0/+1
|\ | | | | | | | | * git://git.infradead.org/battery-2.6: power_supply: Mark twl4030_charger as broken
| * power_supply: Mark twl4030_charger as brokenAnton Vorontsov2010-10-281-0/+1
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | The driver is not buildable without MFD changes. For now, let's disable the driver as it breaks build for major platforms (i.e. x86). CC [M] drivers/power/twl4030_charger.o drivers/power/twl4030_charger.c: In function 'twl4030_clear_set_boot_bci': drivers/power/twl4030_charger.c:105: error: 'TWL4030_PM_MASTER_BOOT_BCI' undeclared (first use in this function) drivers/power/twl4030_charger.c:105: error: (Each undeclared identifier is reported only once drivers/power/twl4030_charger.c:105: error: for each function it appears in.) drivers/power/twl4030_charger.c: In function 'twl4030_bci_have_vbus': drivers/power/twl4030_charger.c:137: error: 'TWL4030_PM_MASTER_STS_HW_CONDITIONS' undeclared (first use in this function) drivers/power/twl4030_charger.c: In function 'twl4030_bci_probe': drivers/power/twl4030_charger.c:477: warning: overflow in implicit constant conversion drivers/power/twl4030_charger.c:485: warning: overflow in implicit constant conversion make[2]: *** [drivers/power/twl4030_charger.o] Error 1 We can re-enable it if MFD tree will finally merge into 2.6.37. Reported-by: Ingo Molnar <mingo@elte.hu> Signed-off-by: Anton Vorontsov <cbouatmailru@gmail.com>
* | Merge branch 'for-linus' of git://git.kernel.dk/linux-2.6-blockLinus Torvalds2010-10-286-44/+11
|\ \ | |/ |/| | | | | | | | | | | | | | | | | * 'for-linus' of git://git.kernel.dk/linux-2.6-block: cciss: remove overlapping PCI IDs block: cciss: fix information leak to userland drivers/block/aoe/aoeblk.c: ratelimit a warning printk drivers/block/z2ram.c: correct printing of sector_t aoe: don't use flush_scheduled_work() drivers/block/drbd/drbd_main.c: fix error path loop: Properly clear sysfs in autoclear mode
| * cciss: remove overlapping PCI IDsMike Miller2010-10-281-36/+0
| | | | | | | | | | | | | | | | This patch removes the controller overlap between cciss and hpsa. It was decided that no overlap should exist. All new controllers will use the hpsa SCSI based driver. Signed-off-by: Jens Axboe <jaxboe@fusionio.com>
| * block: cciss: fix information leak to userlandVasiliy Kulikov2010-10-281-0/+1
| | | | | | | | | | | | | | | | | | Structure IOCTL_Command_struct is copied to userland with some padding fields at the end of the struct unitialized. It leads to leaking of contents of kernel stack memory. Signed-off-by: Vasiliy Kulikov <segooon@gmail.com> Signed-off-by: Jens Axboe <jaxboe@fusionio.com>
| * drivers/block/aoe/aoeblk.c: ratelimit a warning printkAndrew Morton2010-10-281-1/+3
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | As described in https://bugzilla.kernel.org/show_bug.cgi?id=19922 : I had an AoE device go down overnight, and while a server was trying to : write to it, it was also writing this message to its logs: : : 209 printk(KERN_INFO "aoe: device %ld.%d is not up\n", : 210 d->aoemajor, d->aoeminor); : : The message appeared many times per second, and over several hours : produced about 7.5 gigabytes of log files, filling up all free space on : the root filesystem. Cc: "Ed L. Cashin" <ecashin@coraid.com> Suggested-by: Roman Mamedov <roman@rm.pp.ru> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Jens Axboe <jaxboe@fusionio.com>
| * drivers/block/z2ram.c: correct printing of sector_tGeert Uytterhoeven2010-10-281-2/+4
| | | | | | | | | | | | | | | | | | | | | | | | | | | | If CONFIG_LBDAF=y, `sector_t' becomes `u64' instead of `unsigned long': drivers/block/z2ram.c: In function ¡do_z2_request¢: drivers/block/z2ram.c:83: warning: format %lu expects type `long unsigned int', but argument 2 has type `sector_t' Hence always cast it to `unsigned long long' for printing. Also do the pr_err() dance, while we're at it. Signed-off-by: Geert Uytterhoeven <geert@linux-m68k.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Jens Axboe <jaxboe@fusionio.com>
| * aoe: don't use flush_scheduled_work()Tejun Heo2010-10-281-3/+1
| | | | | | | | | | | | | | | | | | | | | | flush_scheduled_work() is deprecated and scheduled to be removed. Directly cancel aoedev->work on free instead of depending on flush_scheduled_works(). Signed-off-by: Tejun Heo <tj@kernel.org> Cc: "Ed L. Cashin" <ecashin@coraid.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Jens Axboe <jaxboe@fusionio.com>
| * drivers/block/drbd/drbd_main.c: fix error pathNicolas Kaiser2010-10-281-1/+1
| | | | | | | | | | | | | | | | | | | | | | Failure to create drbd_ee_mempool appears not to get checked. Looks like a copy-and-paste problem to me. Signed-off-by: Nicolas Kaiser <nikai@nikai.net> Cc: Lars Ellenberg <drbd-dev@lists.linbit.com> Cc: Philipp Reisner <philipp.reisner@linbit.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Jens Axboe <jaxboe@fusionio.com>
| * loop: Properly clear sysfs in autoclear modeMilan Broz2010-10-271-1/+1
| | | | | | | | | | | | | | | | | | | | | | | | | | In autoclear mode bdev is NULL but the sysfs entry should be destroyed otherwise this warning appears: WARNING: at fs/sysfs/dir.c:451 sysfs_add_one+0x82/0x95() sysfs: cannot create duplicate filename '/devices/virtual/block/loop0/loop' Fixes commit ee86273062cbb310665fe49e1f1937d2cf85b0b9 Signed-off-by: Milan Broz <mbroz@redhat.com> Signed-off-by: Jens Axboe <jaxboe@fusionio.com>
* | Merge branch 'upstream-merge' of ↵Linus Torvalds2010-10-2733-1131/+2510
|\ \ | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | git://git.kernel.org/pub/scm/linux/kernel/git/tytso/ext4 * 'upstream-merge' of git://git.kernel.org/pub/scm/linux/kernel/git/tytso/ext4: (50 commits) ext4,jbd2: convert tracepoints to use major/minor numbers ext4: optimize orphan_list handling for ext4_setattr ext4: fix unbalanced mutex unlock in error path of ext4_li_request_new ext4: fix compile error in ext4_fallocate() ext4: move ext4_mb_{get,put}_buddy_cache_lock and make them static ext4: rename mark_bitmap_end() to ext4_mark_bitmap_end() ext4: move flush_completed_IO to fs/ext4/fsync.c and make it static ext4: rename {ext,idx}_pblock and inline small extent functions ext4: make various ext4 functions be static ext4: rename {exit,init}_ext4_*() to ext4_{exit,init}_*() ext4: fix kernel oops if the journal superblock has a non-zero j_errno ext4: update writeback_index based on last page scanned ext4: implement writeback livelock avoidance using page tagging ext4: tidy up a void argument in inode.c ext4: add batched_discard into ext4 feature list ext4: Add batched discard support for ext4 fs: Add FITRIM ioctl ext4: Use return value from sb_issue_discard() ext4: Check return value of sb_getblk() and friends ext4: use bio layer instead of buffer layer in mpage_da_submit_io ...
| * \ Merge branch 'next' into upstream-mergeTheodore Ts'o2010-10-2733-1131/+2510
| |\ \ | | | | | | | | | | | | | | | | | | | | | | | | Conflicts: fs/ext4/inode.c fs/ext4/mballoc.c include/trace/events/ext4.h
| | * | ext4,jbd2: convert tracepoints to use major/minor numbersTheodore Ts'o2010-10-272-156/+251
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Unfortunately perf can't deal with anything other than direct structure accesses in the TP_printk() section. It will drop dead when it sees jbd2_dev_to_name() in the "print fmt" section of the tracepoint. Addresses-Google-Bug: 3138508 Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
| | * | ext4: optimize orphan_list handling for ext4_setattrDmitry Monakhov2010-10-271-3/+7
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Surprisingly chown() on ext4 is not SMP scalable operation. Due to unconditional orphan_del(NULL, inode) in ext4_setattr() result in significant performance overhead because of global orphan mutex, especially in no-journal mode (where orphan_add() is noop). It is possible to skip explicit orphan_del if possible. Results of fchown() micro-benchmark in no-journal mode while (1) { iteration++; fchown(fd, uid, gid); fchown(fd, uid + 1, gid + 1) } measured: iterations per millisecond | nr_tasks | w/o patch | with patch | | 1 | 142 | 185 | | 4 | 109 | 642 | Signed-off-by: Dmitry Monakhov <dmonakhov@openvz.org> Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
| | * | ext4: fix unbalanced mutex unlock in error path of ext4_li_request_newNicolas Kaiser2010-10-271-14/+8
| | | | | | | | | | | | | | | | | | | | Signed-off-by: Nicolas Kaiser <nikai@nikai.net> Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
| | * | ext4: fix compile error in ext4_fallocate()Kazuya Mio2010-10-271-1/+1
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | When I compiled 2.6.36-rc3 kernel with EXT4FS_DEBUG definition, I got the following compile error. CC [M] fs/ext4/extents.o fs/ext4/extents.c: In function 'ext4_fallocate': fs/ext4/extents.c:3772: error: 'block' undeclared (first use in this function) fs/ext4/extents.c:3772: error: (Each undeclared identifier is reported only once fs/ext4/extents.c:3772: error: for each function it appears in.) make[2]: *** [fs/ext4/extents.o] Error 1 The patch fixes this problem. Signed-off-by: Kazuya Mio <k-mio@sx.jp.nec.com> Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
| | * | ext4: move ext4_mb_{get,put}_buddy_cache_lock and make them staticEric Sandeen2010-10-272-81/+79
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | These functions are only used within fs/ext4/mballoc.c, so move them so they are used after they are defined, and then make them be static. Signed-off-by: Eric Sandeen <sandeen@redhat.com> Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
| | * | ext4: rename mark_bitmap_end() to ext4_mark_bitmap_end()Theodore Ts'o2010-10-274-7/+9
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Fix a namespace leak from fs/ext4 Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
| | * | ext4: move flush_completed_IO to fs/ext4/fsync.c and make it staticTheodore Ts'o2010-10-273-84/+83
| | | | | | | | | | | | | | | | | | | | | | | | | | | | Fix a namespace leak by moving the function to the file where it is used and making it static. Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
| | * | ext4: rename {ext,idx}_pblock and inline small extent functionsTheodore Ts'o2010-10-274-118/+121
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Cleanup namespace leaks from fs/ext4 and the inline trivial functions ext4_{ext,idx}_pblock() and ext4_{ext,idx}_store_pblock() since the code size actually shrinks when we make these functions inline, they're so trivial. Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
| | * | ext4: make various ext4 functions be staticTheodore Ts'o2010-10-277-44/+27
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | These functions have no need to be exported beyond file context. No functions needed to be moved for this commit; just some function declarations changed to be static and removed from header files. (A similar patch was submitted by Eric Sandeen, but I wanted to handle code movement in separate patches to make sure code changes didn't accidentally get dropped.) Signed-off-by: Eric Sandeen <sandeen@redhat.com> Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
| | * | ext4: rename {exit,init}_ext4_*() to ext4_{exit,init}_*()Theodore Ts'o2010-10-277-34/+34
| | | | | | | | | | | | | | | | | | | | | | | | This is a cleanup to avoid namespace leaks out of fs/ext4 Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
| | * | ext4: fix kernel oops if the journal superblock has a non-zero j_errnoTheodore Ts'o2010-10-272-2/+15
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Commit 84061e0 fixed an accounting bug only to introduce the possibility of a kernel OOPS if the journal has a non-zero j_errno field indicating that the file system had detected a fs inconsistency. After the journal replay, if the journal superblock indicates that the file system has an error, this indication is transfered to the file system and then ext4_commit_super() is called to write this to the disk. But since the percpu counters are now initialized after the journal replay, the call to ext4_commit_super() will cause a kernel oops since it needs to use the percpu counters the ext4 superblock structure. The fix is to skip setting the ext4 free block and free inode fields if the percpu counter has not been set. Thanks to Ken Sumrall for reporting and analyzing the root causes of this bug. Addresses-Google-Bug: #3054080 Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
| | * | ext4: update writeback_index based on last page scannedEric Sandeen2010-10-271-5/+9
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | As pointed out in a prior patch, updating the mapping's writeback_index based on pages written isn't quite right; what the writeback index is really supposed to reflect is the next page which should be scanned for writeback during periodic flush. As in write_cache_pages(), write_cache_pages_da() does this scanning for us as we assemble the mpd for later writeout. If we keep track of the next page after the current scan, we can easily update writeback_index without worrying about pages written vs. pages skipped, etc. Without this, an fsync will reset writeback_index to 0 (its starting index) + however many pages it wrote, which can mess up the progress of periodic flush. Signed-off-by: Eric Sandeen <sandeen@redhat.com> Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
| | * | ext4: implement writeback livelock avoidance using page taggingEric Sandeen2010-10-272-3/+17
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | This is analogous to Jan Kara's commit, f446daaea9d4a420d16c606f755f3689dcb2d0ce mm: implement writeback livelock avoidance using page tagging but since we forked write_cache_pages, we need to reimplement it there (and in ext4_da_writepages, since range_cyclic handling was moved to there) If you start a large buffered IO to a file, and then set fsync after it, you'll find that fsync does not complete until the other IO stops. If you continue re-dirtying the file (say, putting dd with conv=notrunc in a loop), when fsync finally completes (after all IO is done), it reports via tracing that it has written many more pages than the file contains; in other words it has synced and re-synced pages in the file multiple times. This then leads to problems with our writeback_index update, since it advances it by pages written, and essentially sets writeback_index off the end of the file... With the following patch, we only sync as much as was dirty at the time of the sync. Signed-off-by: Eric Sandeen <sandeen@redhat.com> Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
| | * | ext4: tidy up a void argument in inode.cEric Sandeen2010-10-271-2/+2
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | This doesn't fix anything at all, it just removes a vestige of prior use from __mpage_da_writepage() __mpage_da_writepage() had a *void argument leftover from its previous life as a callback; make it reflect the actual type. Fixing this up makes it slightly more obvious to read, and enables proper typechecking. Signed-off-by: Eric Sandeen <sandeen@redhat.com> Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
| | * | ext4: add batched_discard into ext4 feature listLukas Czerner2010-10-271-0/+2
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Should be applied on the top of "lazy inode table initialization" and "batched discard support" patch-sets. Signed-off-by: Lukas Czerner <lczerner@redhat.com> Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
| | * | ext4: Add batched discard support for ext4Lukas Czerner2010-10-273-0/+188
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Walk through allocation groups and trim all free extents. It can be invoked through FITRIM ioctl on the file system. The main idea is to provide a way to trim the whole file system if needed, since some SSD's may suffer from performance loss after the whole device was filled (it does not mean that fs is full!). It search for free extents in allocation groups specified by Byte range start -> start+len. When the free extent is within this range, blocks are marked as used and then trimmed. Afterwards these blocks are marked as free in per-group bitmap. Since fstrim is a long operation it is good to have an ability to interrupt it by a signal. This was added by Dmitry Monakhov. Thanks Dimitry. Signed-off-by: Lukas Czerner <lczerner@redhat.com> Signed-off-by: Dmitry Monakhov <dmonakhov@openvz.org> Reviewed-by: Jan Kara <jack@suse.cz> Reviewed-by: Dmitry Monakhov <dmonakhov@openvz.org> Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
| | * | fs: Add FITRIM ioctlLukas Czerner2010-10-272-0/+47
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Adds an filesystem independent ioctl to allow implementation of file system batched discard support. I takes fstrim_range structure as an argument. fstrim_range is definec in the include/fs.h and its definition is as follows. struct fstrim_range { start; len; minlen; } start - first Byte to trim len - number of Bytes to trim from start minlen - minimum extent length to trim, free extents shorter than this number of Bytes will be ignored. This will be rounded up to fs block size. It is also possible to specify NULL as an argument. In this case the arguments will set itself as follows: start = 0; len = ULLONG_MAX; minlen = 0; So it will trim the whole file system at one run. After the FITRIM is done, the number of actually discarded Bytes is stored in fstrim_range.len to give the user better insight on how much storage space has been really released for wear-leveling. Signed-off-by: Lukas Czerner <lczerner@redhat.com> Reviewed-by: Dmitry Monakhov <dmonakhov@openvz.org> Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
| | * | ext4: Use return value from sb_issue_discard()Lukas Czerner2010-10-271-1/+2
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Use return value from sb_issue_discard() as return value in ext4_issue_discard(). Since sb_issue_discard() may result in more serious errors than just -EOPNOTSUPP it is worth to inform user of this function about them to handle error cases properly. Signed-off-by: Lukas Czerner <lczerner@redhat.com> Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
| | * | ext4: Check return value of sb_getblk() and friendsNamhyung Kim2010-10-272-0/+7
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Fail block allocation if sb_getblk() returns NULL. In that case, sb_find_get_block() also likely to fail so that it should skip calling ext4_forget(). Signed-off-by: Namhyung Kim <namhyung@gmail.com> Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
| | * | ext4: use bio layer instead of buffer layer in mpage_da_submit_ioTheodore Ts'o2010-10-276-109/+489
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Call the block I/O layer directly instad of going through the buffer layer. This should give us much better performance and scalability, as well as lowering our CPU utilization when doing buffered writeback. Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
| | * | ext4: move mpage_put_bnr_to_bhs()'s functionality to mpage_da_submit_io()Theodore Ts'o2010-10-271-101/+38
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | This massively simplifies the ext4_da_writepages() code path by completely removing mpage_put_bnr_bhs(), which is almost 100 lines of code iterating over a set of pages using pagevec_lookup(), and folds that functionality into mpage_da_submit_io()'s existing pagevec_lookup() loop. Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
| | * | ext4: inline walk_page_buffers() into mpage_da_submit_ioTheodore Ts'o2010-10-271-11/+13
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Expand the call: if (walk_page_buffers(NULL, page_bufs, 0, len, NULL, ext4_bh_delay_or_unwritten)) goto redirty_page into mpage_da_submit_io(). This will allow us to merge in mpage_put_bnr_to_bhs() in the next patch. Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
| | * | ext4: inline ext4_writepage() into mpage_da_submit_io()Theodore Ts'o2010-10-271-11/+59
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | As a prepratory step to switching to bio_submit, inline ext4_writepage() into mpage_da_submit() and then simplify things a bit. This makes it clearer what mpage_da_submit needs to do. Also, move the ClearPageChecked(page) call into __ext4_journalled_writepage(), as a minor bit of cleanup refactoring. This also allows us to pull i_size_read() and ext4_should_journal_data() out of the loop, which should be a very minor CPU savings. Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
| | * | ext4: simplify ext4_writepage()Theodore Ts'o2010-10-271-48/+25
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | The actual code in ext4_writepage() is unnecessarily convoluted. Simplify it so it is easier to understand, but otherwise logically equivalent. Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
| | * | ext4: call mpage_da_submit_io() from mpage_da_map_blocks()Theodore Ts'o2010-10-271-33/+33
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Eventually we need to completely reorganize the ext4 writepage callpath, but for now, we simplify things a little by calling mpage_da_submit_io() from mpage_da_map_blocks(), since all of the places where we call mpage_da_map_blocks() it is followed up by a call to mpage_da_submit_io(). We're also a wee bit better with respect to error handling, but there are still a number of issues where it's not clear what the right thing is to do with ext4 functions deep in the writeback codepath fails. Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
| | * | ext4: use KMEM_CACHE instead of kmem_cache_createTheodore Ts'o2010-10-272-14/+7
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Also remove the SLAB_RECLAIM_ACCOUNT flag from the system zone kmem cache. This slab tends to be fairly static, so it shouldn't be marked as likely to have free pages that can be reclaimed. Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
| | * | ext4: use search_dirblock() in ext4_dx_find_entry()Theodore Ts'o2010-10-271-21/+12
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Use the search_dirblock() in ext4_dx_find_entry(). It makes the code easier to read, and it takes advantage of common code. It also saves 100 bytes or so of text space. Signed-off-by: "Theodore Ts'o" <tytso@mit.edu> Cc: Brad Spengler <spender@grsecurity.net>
| | * | ext4: avoid uninitialized memory references in ext3_htree_next_block()Theodore Ts'o2010-10-271-15/+15
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | If the first block of htree directory is missing '.' or '..' but is otherwise a valid directory, and we do a lookup for '.' or '..', it's possible to dereference an uninitialized memory pointer in ext4_htree_next_block(). We avoid this by moving the special case from ext4_dx_find_entry() to ext4_find_entry(); this also means we can optimize ext4_find_entry() slightly when NFS looks up "..". Thanks to Brad Spengler for pointing a Clang warning that led me to look more closely at this code. The warning was harmless, but it was useful in pointing out code that was too ugly to live. This warning was also reported by Roman Borisov. Signed-off-by: "Theodore Ts'o" <tytso@mit.edu> Cc: Brad Spengler <spender@grsecurity.net>
| | * | ext4: remove unused ext4_sb_info membersEric Sandeen2010-10-271-5/+0
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Not that these take up a lot of room, but the structure is long enough as it is, and there's no need to confuse people with these various undocumented & unused structure members... Signed-off-by: Eric Sandeen <sandeen@redaht.com> Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
| | * | ext4: queue conversion after adding to inode's completed IO listEric Sandeen2010-10-271-3/+3
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | By queuing the io end on the unwritten workqueue before adding it to our inode's list of completed IOs, I think we run the risk of the work getting completed, and the IO freed, before we try to add it to the inode's i_completed_io_list. It should be safe to add it to the inode's list of completed IOs, and -then- queue it for completion, I think. Thanks to Dave Chinner for pointing out the race. Signed-off-by: Eric Sandeen <sandeen@redhat.com> Reviewed-by: Jiaying Zhang <jiayingz@google.com> Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
| | * | ext4: don't use ext4_allocation_contexts for tracingEric Sandeen2010-10-272-91/+41
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Many tracepoints were populating an ext4_allocation_context to pass in, but this requires a slab allocation even when tracepoints are off. In fact, 4 of 5 of these allocations were only for tracing. In addition, we were only using a small fraction of the 144 bytes of this structure for this purpose. We can do away with all these alloc/frees of the ac and simply pass in the bits we care about, instead. I tested this by turning on tracing and running through xfstests on x86_64. I did not actually do anything with the trace output, however. Signed-off-by: Eric Sandeen <sandeen@redhat.com> Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
| | * | ext4: fix oops in trace_ext4_mb_release_group_paEric Sandeen2010-10-271-3/+0
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Our QA reported an oops in the ext4_mb_release_group_pa tracing, and Josef Bacik pointed out that it was because we may have a non-null but uninitialized ac_inode in the allocation context. I can reproduce it when running xfstests with ext4 tracepoints on, on a CONFIG_SLAB_DEBUG kernel. We call trace_ext4_mb_release_group_pa from 2 places, ext4_mb_discard_group_preallocations and ext4_mb_discard_lg_preallocations In both cases we allocate an ac as a container just for tracing (!) and never fill in the ac_inode. There's no reason to be assigning, testing, or printing it as far as I can see, so just remove it from the tracepoint. Signed-off-by: Eric Sandeen <sandeen@redhat.com> Reviewed-by: Josef Bacik <josef@redhat.com> Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
| | * | ext4: fix potential infinite loop in ext4_da_writepages()Toshiyuki Okajima2010-10-271-1/+1
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | On linux-2.6.36-rc2, if we execute the following script, we can hang the system when the /bin/sync command is executed: ======================================================================== #!/bin/sh echo -n "HANG UP TEST: " /bin/dd if=/dev/zero of=/tmp/img bs=1k count=1 seek=1M 2> /dev/null /sbin/mkfs.ext4 -Fq /tmp/img /bin/mount -o loop -t ext4 /tmp/img /mnt /bin/dd if=/dev/zero of=/mnt/file bs=1 count=1 \ seek=$((16*1024*1024*1024*1024-4096)) 2> /dev/null /bin/sync /bin/umount /mnt echo "DONE" exit 0 ======================================================================== We can see the following backtrace if we get the kdump when this hangup occurs: ====================================================================== kthread() => bdi_writeback_thread() => wb_do_writeback() => wb_writeback() => writeback_inodes_wb() => writeback_sb_inodes() => writeback_single_inode() => ext4_da_writepages() ---+ ^ infinite | | loop | +-------------+ ====================================================================== The reason why this hangup happens is described as follows: 1) We write the last extent block of the file whose size is the filesystem maximum size. 2) "BH_Delay" flag is set on the buffer_head of its block. 3) - the member, "m_lblk" of struct mpage_da_data is 4294967295 (UINT_MAX) - the member, "m_len" of struct mpage_da_data is 1 mpage_put_bnr_to_bhs() which is called via ext4_da_writepages() cannot clear "BH_Delay" flag of the buffer_head because the type of m_lblk is ext4_lblk_t and then m_lblk + m_len is overflow. Therefore an infinite loop occurs because ext4_da_writepages() cannot write the page (which corresponds to the block) since "BH_Delay" flag isn't cleared. ---------------------------------------------------------------------- static void mpage_put_bnr_to_bhs(struct mpage_da_data *mpd, struct ext4_map_blocks *map) { ... int blocks = map->m_len; ... do { // cur_logical = 4294967295 // map->m_lblk = 4294967295 // blocks = 1 // *** map->m_lblk + blocks == 0 (OVERFLOW!) *** // (cur_logical >= map->m_lblk + blocks) => true if (cur_logical >= map->m_lblk + blocks) break; ---------------------------------------------------------------------- NOTE: Mounting with the nodelalloc option will avoid this codepath, and thus, avoid this hang Signed-off-by: Toshiyuki Okajima <toshi.okajima@jp.fujitsu.com> Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
| | * | ext4: improve llseek error handling for overly large seek offsetsToshiyuki Okajima2010-10-273-2/+45
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | The llseek system call should return EINVAL if passed a seek offset which results in a write error. What this maximum offset should be depends on whether or not the huge_file file system feature is set, and whether or not the file is extent based or not. If the file has no "EXT4_EXTENTS_FL" flag, the maximum size which can be written (write systemcall) is different from the maximum size which can be sought (lseek systemcall). For example, the following 2 cases demonstrates the differences between the maximum size which can be written, versus the seek offset allowed by the llseek system call: #1: mkfs.ext3 <dev>; mount -t ext4 <dev> #2: mkfs.ext3 <dev>; tune2fs -Oextent,huge_file <dev>; mount -t ext4 <dev> Table. the max file size which we can write or seek at each filesystem feature tuning and file flag setting +============+===============================+===============================+ | \ File flag| | | | \ | !EXT4_EXTENTS_FL | EXT4_EXTETNS_FL | |case \| | | +------------+-------------------------------+-------------------------------+ | #1 | write: 2194719883264 | write: -------------- | | | seek: 2199023251456 | seek: -------------- | +------------+-------------------------------+-------------------------------+ | #2 | write: 4402345721856 | write: 17592186044415 | | | seek: 17592186044415 | seek: 17592186044415 | +------------+-------------------------------+-------------------------------+ The differences exist because ext4 has 2 maxbytes which are sb->s_maxbytes (= extent-mapped maxbytes) and EXT4_SB(sb)->s_bitmap_maxbytes (= block-mapped maxbytes). Although generic_file_llseek uses only extent-mapped maxbytes. (llseek of ext4_file_operations is generic_file_llseek which uses sb->s_maxbytes.) Therefore we create ext4 llseek function which uses 2 maxbytes. The new own function originates from generic_file_llseek(). If the file flag, "EXT4_EXTENTS_FL" is not set, the function alters inode->i_sb->s_maxbytes into EXT4_SB(inode->i_sb)->s_bitmap_maxbytes. Signed-off-by: Toshiyuki Okajima <toshi.okajima@jp.fujitsu.com> Signed-off-by: "Theodore Ts'o" <tytso@mit.edu> Cc: Andreas Dilger <adilger.kernel@dilger.ca>
| | * | ext4: don't update sb journal_devnum when RO devMaciej Żenczykowski2010-10-271-1/+1
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | An ext4 filesystem on a read-only device, with an external journal which is at a different device number then recorded in the superblock will fail to honor the read-only setting of the device and trigger a superblock update (write). For example: - ext4 on a software raid which is in read-only mode - external journal on a read-write device which has changed device num - attempt to mount with -o journal_dev=<new_number> - hits BUG_ON(mddev->ro = 1) in md.c Cc: Theodore Ts'o <tytso@mit.edu> Signed-off-by: Maciej Żenczykowski <zenczykowski@gmail.com> Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>