| From foo@baz Sun May 27 17:33:38 CEST 2018 |
| From: "shidao.ytt" <shidao.ytt@alibaba-inc.com> |
| Date: Wed, 31 Jan 2018 16:19:55 -0800 |
| Subject: mm/fadvise: discard partial page if endbyte is also EOF |
| |
| From: "shidao.ytt" <shidao.ytt@alibaba-inc.com> |
| |
| [ Upstream commit a7ab400d6fe73d0119fdc234e9982a6f80faea9f ] |
| |
| During our recent testing with fadvise(FADV_DONTNEED), we find that if |
| given offset/length is not page-aligned, the last page will not be |
| discarded. The tool we use is vmtouch (https://hoytech.com/vmtouch/), |
| we map a 10KB-sized file into memory and then try to run this tool to |
| evict the whole file mapping, but the last single page always remains |
| staying in the memory: |
| |
| $./vmtouch -e test_10K |
| Files: 1 |
| Directories: 0 |
| Evicted Pages: 3 (12K) |
| Elapsed: 2.1e-05 seconds |
| |
| $./vmtouch test_10K |
| Files: 1 |
| Directories: 0 |
| Resident Pages: 1/3 4K/12K 33.3% |
| Elapsed: 5.5e-05 seconds |
| |
| However when we test with an older kernel, say 3.10, this problem is |
| gone. So we wonder if this is a regression: |
| |
| $./vmtouch -e test_10K |
| Files: 1 |
| Directories: 0 |
| Evicted Pages: 3 (12K) |
| Elapsed: 8.2e-05 seconds |
| |
| $./vmtouch test_10K |
| Files: 1 |
| Directories: 0 |
| Resident Pages: 0/3 0/12K 0% <-- partial page also discarded |
| Elapsed: 5e-05 seconds |
| |
| After digging a little bit into this problem, we find it seems not a |
| regression. Not discarding partial page is likely to be on purpose |
| according to commit 441c228f817f ("mm: fadvise: document the |
| fadvise(FADV_DONTNEED) behaviour for partial pages") written by Mel |
| Gorman. He explained why partial pages should be preserved instead of |
| being discarded when using fadvise(FADV_DONTNEED). |
| |
| However, the interesting part is that the actual code did NOT work as |
| the same as it was described, the partial page was still discarded |
| anyway, due to a calculation mistake of `end_index' passed to |
| invalidate_mapping_pages(). This mistake has not been fixed until |
| recently, that's why we fail to reproduce our problem in old kernels. |
| The fix is done in commit 18aba41cbf ("mm/fadvise.c: do not discard |
| partial pages with POSIX_FADV_DONTNEED") by Oleg Drokin. |
| |
| Back to the original testing, our problem becomes that there is a |
| special case that, if the page-unaligned `endbyte' is also the end of |
| file, it is not necessary at all to preserve the last partial page, as |
| we all know no one else will use the rest of it. It should be safe |
| enough if we just discard the whole page. So we add an EOF check in |
| this patch. |
| |
| We also find a poosbile real world issue in mainline kernel. Assume |
| such scenario: A userspace backup application want to backup a huge |
| amount of small files (<4k) at once, the developer might (I guess) want |
| to use fadvise(FADV_DONTNEED) to save memory. However, FADV_DONTNEED |
| won't really happen since the only page mapped is a partial page, and |
| kernel will preserve it. Our patch also fixes this problem, since we |
| know the endbyte is EOF, so we discard it. |
| |
| Here is a simple reproducer to reproduce and verify each scenario we |
| described above: |
| |
| test_fadvise.c |
| ============================== |
| #include <sys/mman.h> |
| #include <sys/stat.h> |
| #include <fcntl.h> |
| #include <stdlib.h> |
| #include <string.h> |
| #include <stdio.h> |
| #include <unistd.h> |
| |
| int main(int argc, char **argv) |
| { |
| int i, fd, ret, len; |
| struct stat buf; |
| void *addr; |
| unsigned char *vec; |
| char *strbuf; |
| ssize_t pagesize = getpagesize(); |
| ssize_t filesize; |
| |
| fd = open(argv[1], O_RDWR|O_CREAT, S_IRUSR|S_IWUSR); |
| if (fd < 0) |
| return -1; |
| filesize = strtoul(argv[2], NULL, 10); |
| |
| strbuf = malloc(filesize); |
| memset(strbuf, 42, filesize); |
| write(fd, strbuf, filesize); |
| free(strbuf); |
| fsync(fd); |
| |
| len = (filesize + pagesize - 1) / pagesize; |
| printf("length of pages: %d\n", len); |
| |
| addr = mmap(NULL, filesize, PROT_READ, MAP_SHARED, fd, 0); |
| if (addr == MAP_FAILED) |
| return -1; |
| |
| ret = posix_fadvise(fd, 0, filesize, POSIX_FADV_DONTNEED); |
| if (ret < 0) |
| return -1; |
| |
| vec = malloc(len); |
| ret = mincore(addr, filesize, (void *)vec); |
| if (ret < 0) |
| return -1; |
| |
| for (i = 0; i < len; i++) |
| printf("pages[%d]: %x\n", i, vec[i] & 0x1); |
| |
| free(vec); |
| close(fd); |
| |
| return 0; |
| } |
| ============================== |
| |
| Test 1: running on kernel with commit 18aba41cbf reverted: |
| |
| [root@caspar ~]# uname -r |
| 4.15.0-rc6.revert+ |
| [root@caspar ~]# ./test_fadvise file1 1024 |
| length of pages: 1 |
| pages[0]: 0 # <-- partial page discarded |
| [root@caspar ~]# ./test_fadvise file2 8192 |
| length of pages: 2 |
| pages[0]: 0 |
| pages[1]: 0 |
| [root@caspar ~]# ./test_fadvise file3 10240 |
| length of pages: 3 |
| pages[0]: 0 |
| pages[1]: 0 |
| pages[2]: 0 # <-- partial page discarded |
| |
| Test 2: running on mainline kernel: |
| |
| [root@caspar ~]# uname -r |
| 4.15.0-rc6+ |
| [root@caspar ~]# ./test_fadvise test1 1024 |
| length of pages: 1 |
| pages[0]: 1 # <-- partial and the only page not discarded |
| [root@caspar ~]# ./test_fadvise test2 8192 |
| length of pages: 2 |
| pages[0]: 0 |
| pages[1]: 0 |
| [root@caspar ~]# ./test_fadvise test3 10240 |
| length of pages: 3 |
| pages[0]: 0 |
| pages[1]: 0 |
| pages[2]: 1 # <-- partial page not discarded |
| |
| Test 3: running on kernel with this patch: |
| |
| [root@caspar ~]# uname -r |
| 4.15.0-rc6.patched+ |
| [root@caspar ~]# ./test_fadvise test1 1024 |
| length of pages: 1 |
| pages[0]: 0 # <-- partial page and EOF, discarded |
| [root@caspar ~]# ./test_fadvise test2 8192 |
| length of pages: 2 |
| pages[0]: 0 |
| pages[1]: 0 |
| [root@caspar ~]# ./test_fadvise test3 10240 |
| length of pages: 3 |
| pages[0]: 0 |
| pages[1]: 0 |
| pages[2]: 0 # <-- partial page and EOF, discarded |
| |
| [akpm@linux-foundation.org: tweak code comment] |
| Link: http://lkml.kernel.org/r/5222da9ee20e1695eaabb69f631f200d6e6b8876.1515132470.git.jinli.zjl@alibaba-inc.com |
| Signed-off-by: shidao.ytt <shidao.ytt@alibaba-inc.com> |
| Signed-off-by: Caspar Zhang <jinli.zjl@alibaba-inc.com> |
| Reviewed-by: Oliver Yang <zhiche.yy@alibaba-inc.com> |
| Cc: Mel Gorman <mgorman@techsingularity.net> |
| Signed-off-by: Andrew Morton <akpm@linux-foundation.org> |
| Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> |
| Signed-off-by: Sasha Levin <alexander.levin@microsoft.com> |
| Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org> |
| --- |
| mm/fadvise.c | 10 +++++++++- |
| 1 file changed, 9 insertions(+), 1 deletion(-) |
| |
| --- a/mm/fadvise.c |
| +++ b/mm/fadvise.c |
| @@ -126,7 +126,15 @@ SYSCALL_DEFINE4(fadvise64_64, int, fd, l |
| */ |
| start_index = (offset+(PAGE_SIZE-1)) >> PAGE_SHIFT; |
| end_index = (endbyte >> PAGE_SHIFT); |
| - if ((endbyte & ~PAGE_MASK) != ~PAGE_MASK) { |
| + /* |
| + * The page at end_index will be inclusively discarded according |
| + * by invalidate_mapping_pages(), so subtracting 1 from |
| + * end_index means we will skip the last page. But if endbyte |
| + * is page aligned or is at the end of file, we should not skip |
| + * that page - discarding the last page is safe enough. |
| + */ |
| + if ((endbyte & ~PAGE_MASK) != ~PAGE_MASK && |
| + endbyte != inode->i_size - 1) { |
| /* First page is tricky as 0 - 1 = -1, but pgoff_t |
| * is unsigned, so the end_index >= start_index |
| * check below would be true and we'll discard the whole |