| From: Mike Snitzer <snitzer@redhat.com> |
| Date: Tue, 26 Jun 2018 12:04:23 -0400 |
| Subject: dm thin: handle running out of data space vs concurrent discard |
| |
| commit a685557fbbc3122ed11e8ad3fa63a11ebc5de8c3 upstream. |
| |
| Discards issued to a DM thin device can complete to userspace (via |
| fstrim) _before_ the metadata changes associated with the discards is |
| reflected in the thinp superblock (e.g. free blocks). As such, if a |
| user constructs a test that loops repeatedly over these steps, block |
| allocation can fail due to discards not having completed yet: |
| 1) fill thin device via filesystem file |
| 2) remove file |
| 3) fstrim |
| |
| From initial report, here: |
| https://www.redhat.com/archives/dm-devel/2018-April/msg00022.html |
| |
| "The root cause of this issue is that dm-thin will first remove |
| mapping and increase corresponding blocks' reference count to prevent |
| them from being reused before DISCARD bios get processed by the |
| underlying layers. However. increasing blocks' reference count could |
| also increase the nr_allocated_this_transaction in struct sm_disk |
| which makes smd->old_ll.nr_allocated + |
| smd->nr_allocated_this_transaction bigger than smd->old_ll.nr_blocks. |
| In this case, alloc_data_block() will never commit metadata to reset |
| the begin pointer of struct sm_disk, because sm_disk_get_nr_free() |
| always return an underflow value." |
| |
| While there is room for improvement to the space-map accounting that |
| thinp is making use of: the reality is this test is inherently racey and |
| will result in the previous iteration's fstrim's discard(s) completing |
| vs concurrent block allocation, via dd, in the next iteration of the |
| loop. |
| |
| No amount of space map accounting improvements will be able to allow |
| user's to use a block before a discard of that block has completed. |
| |
| So the best we can really do is allow DM thinp to gracefully handle such |
| aggressive use of all the pool's data by degrading the pool into |
| out-of-data-space (OODS) mode. We _should_ get that behaviour already |
| (if space map accounting didn't falsely cause alloc_data_block() to |
| believe free space was available).. but short of that we handle the |
| current reality that dm_pool_alloc_data_block() can return -ENOSPC. |
| |
| Reported-by: Dennis Yang <dennisyang@qnap.com> |
| Signed-off-by: Mike Snitzer <snitzer@redhat.com> |
| Signed-off-by: Ben Hutchings <ben@decadent.org.uk> |
| --- |
| drivers/md/dm-thin.c | 11 +++++++++-- |
| 1 file changed, 9 insertions(+), 2 deletions(-) |
| |
| --- a/drivers/md/dm-thin.c |
| +++ b/drivers/md/dm-thin.c |
| @@ -938,6 +938,8 @@ static void schedule_zero(struct thin_c |
| |
| static void set_pool_mode(struct pool *pool, enum pool_mode new_mode); |
| |
| +static void requeue_bios(struct pool *pool); |
| + |
| static void check_for_space(struct pool *pool) |
| { |
| int r; |
| @@ -950,8 +952,10 @@ static void check_for_space(struct pool |
| if (r) |
| return; |
| |
| - if (nr_free) |
| + if (nr_free) { |
| set_pool_mode(pool, PM_WRITE); |
| + requeue_bios(pool); |
| + } |
| } |
| |
| /* |
| @@ -1028,7 +1032,10 @@ static int alloc_data_block(struct thin_ |
| |
| r = dm_pool_alloc_data_block(pool->pmd, result); |
| if (r) { |
| - metadata_operation_failed(pool, "dm_pool_alloc_data_block", r); |
| + if (r == -ENOSPC) |
| + set_pool_mode(pool, PM_OUT_OF_DATA_SPACE); |
| + else |
| + metadata_operation_failed(pool, "dm_pool_alloc_data_block", r); |
| return r; |
| } |
| |