design/xfs-smr-structure.asciidoc - pub/scm/fs/xfs/xfs-documentation - Git at Google

 = SMR Layout Optimisation for XFS (v0.2, March 2015)
 Dave Chinner, <dchinner@redhat.com>

 == Overview

 This document describes a relatively simple way of modifying XFS using existing
 on disk structures to be able to use host-managed SMR drives.

 This assumes a userspace ZBC implementation such as libzbc will do all the heavy
 lifting work of laying out the structure of the filesystem, and that it will
 perform things like zone write pointer checking/resetting before the filesystem
 is mounted.

 == Concepts

 SMR is architected to have a set of sequentially written zones which don't allow
 out of order writes, nor do they allow overwrites of data already written in the
 zone. Zones are typically in the order of 256MB, though may actually be of
 variable size as physical geometry of the drives differ from inner to outer
 edges.

 SMR drives also typically have an outer section that is CMR technology - it
 allows random writes and overwrites to any area within those zones. Drive
 managed SMR devices use this region for internal metadata
 journalling  for block remapping tables and as a staging area for data writes
 before being written out in sequential fashion ito zones after block remapping
 has been performed.

 Recent research has shown that 6TB seagate drives have a 20-25GB CMR zone,
 which is more than enough for our purposes. Information from other vendors
 indicate that some drives will have much more CMR, hence if we design for the
 known sizes in the Seagate drives we will be fine for other drives just coming
 onto the market right now.

 For host managed/aware drives, we are going to assume that we can use this area
 directly for filesystem metadata - for our own mapping tables and things like
 the journal, inodes, directories and free space tracking. We are also going to
 assume that we can find these regions easily in the ZBC information, and that
 they are going to be contiguous rather than spread all over the drive.

 XFS already has a data-only device call the "real time" device, whose free space
 information is tracked externally in bitmaps attached to inodes that exist in
 the "data" device. All filesystem metadata exists in the "data" device, except
 maybe the journal which can also be in an external device.

 A key constraint we need to work within here is that RAID on SMR drives is a
 long way off. The main use case is for bulk storage of data in the back end of
 distributed object stores (i.e. cat pictures on the intertubes) and hence a
 filesystem per drive is the typical configuration we'll be chasing here.
 Similarly, partitioning of SMR drives makes no sense for host aware drives,
 so we are going to constrain the architecture to a single drive for now.

 == Journal modifications

 Because the XFS journal is a sequentially written circular log, we can actually
 use SMR zones for it - it does not need to be in the metadata region. This
 requires a small amount of additional complexity - we can't wrap the log as we
 currnetly do, we'll need to split the log across two zones so that we can push
 the tail into the same zone as the head, then reset the now unused zone
 and then when the log wraps it can simply start again form the beginning of the
 erased zone.

 Like a normal spinning disk, we'll want to place the log in a pair of zones near
 the middle of the drive so that we minimise the worst case seek cost of a log
 write to half of a full disk seek. There may be advantage to putting it right
 next to the metadata zone, but typically metadata writes are not correlated with
 log writes.

 Hence the only real functionality we need to add to the log is the tail pushing
 modifications to move the tail into the same zone as the head, as well as being
 able to trigger and block on zone write pointer reset operations.

 The log doesn't actually need to track the zone write pointer, though log
 recovery will need to limit the recovery head to the current write pointer of
 the lead zone.  Modifications here are limited to the function that finds the
 head of the log, and can actually be used to speed up the search algorithm.

 However, given the size of the CMR zones, we can host the journal in an
 unmodified manner inside the CMR zone and not have to worry about zone
 awareness. This is by far the simplest solution to the problem.

 == Data zones

 What we need is a mechanism for tracking the location of zones (i.e. start LBA),
 free space/write pointers within each zone, and some way of keeping track of
 that information across mounts. If we assign a real time bitmap/summary inode
 pair to each zone, we have a method of tracking free space in the zone. We can
 use the existing bitmap allocator with a small tweak (sequentially ascending,
 packed extent allocation only) to ensure that newly written blocks are allocated
 in a sane manner.

 We're going to need userspace to be able to see the contents of these inodes;
 read only access will be needed to analyse the contents of the zone, so we're
 going to need a special directory to expose this information. It would be useful
 to have a ".zones" directory hanging off the root directory that contains all
 the zone allocation inodes so userspace can simply open them.

 THis biggest issue that has come to light here is the number of zones in a
 device. Zones are typically 256MB in size, and so we are looking at 4,000
 zones/TB. For a 10TB drive, that's 40,000 zones we have to keep track of. And if
 the devices keep getting larger at the expected rate, we're going to have to
 deal with zone counts in the hundreds of thousands. Hence a single flat
 directory containing all these inodes is not going to scale, nor will we be able
 to keep them all in memory at once.

 As a result, we are going to need to group the zones for locality and efficiency
 purposes, likely as "zone groups" of, say, up to 1TB in size. Luckily, by
 keeping the zone information in inodes the information can be demand paged and
 so we don't need to pin thousands of inodes and bitmaps in memory. Zone groups
 also have other benefits...

 While it seems like tracking free space is trivial for the purposes of
 allocation (and it is!), the complexity comes when we start to delete or
 overwrite data. Suddenly zones no longer contain contiguous ranges of valid
 data; they have "freed" extents in the middle of them that contain stale data.
 We can't use that "stale space" until the entire zone is made up of "stale"
 extents. Hence we need a Cleaner.

 === Zone Cleaner

 The purpose of the cleaner is to find zones that are mostly stale space and
 consolidate the remaining referenced data into a new, contiguous zone, enabling
 us to then "clean" the stale zone and make it available for writing new data
 again.

 The real complexity here is finding the owner of the data that needs to be move,
 but we are in the process of solving that with the reverse mapping btree and
 parent pointer functionality. This gives us the mechanism by which we can
 quickly re-organise files that have extents in zones that need cleaning.

 The key word here is "reorganise". We have a tool that already reorganises file
 layout: xfs_fsr. The "Cleaner" is a finely targeted policy for xfs_fsr -
 instead of trying to minimise fixpel fragments, it finds zones that need
 cleaning by reading their summary info from the /.zones/ directory and analysing
 the free bitmap state if there is a high enough percentage of stale blocks. From
 there we can use the reverse mapping to find the inodes that own the extents
 those zones.  And from there, we can run the existing defrag code to rewrite the
 data in the file, thereby marking all the old blocks stale. This will make
 almost stale zones entirely stale, and hence then be able to be reset.

 Hence we don't actually need any major new data moving functionality in the
 kernel to enable this, except maybe an event channel for the kernel to tell
 xfs_fsr it needs to do some cleaning work.

 If we arrange zones into zoen groups, we also have a method for keeping new
 allocations out of regions we are re-organising. That is, we need to be able to
 mark zone groups as "read only" so the kernel will not attempt to allocate from
 them while the cleaner is running and re-organising the data within the zones in
 a zone group. This ZG also allows the cleaner to maintain some level of locality
 to the data that it is re-arranging.

 === Reverse mapping btrees

 One of the complexities is that the current reverse map btree is a per
 allocation group construct. This means that, as per the current design and
 implementation, it will not work with the inode based bitmap allocator. This,
 however, is not actually a major problem thanks to the generic btree library
 that XFS uses.

 That is, the generic btree library in XFS is used to implement the block mapping
 btree held in the data fork of the inode. Hence we can use the same btree
 implementation as the per-AG rmap btree, but simply add a couple of functions,
 set a couple of flags and host it in the inode data fork of a third per-zone
 inode to track the zone's owner information.

 == Mkfs

 Mkfs is going to have to integrate with the userspace zbc libraries to query the
 layout of zones from the underlying disk and then do some magic to lay out al
 the necessary metadata correctly. I don't see there being any significant
 challenge to doing this, but we will need a stable libzbc API to work with and
 it will need ot be packaged by distros.

 If mkfs cannot find ensough random write space for the amount of metadata we
 need to track all the space in the sequential write zones and a decent amount of
 internal fielsystem metadata (inodes, etc) then it will need to fail. Drive
 vendors are going to need to provide sufficient space in these regions for us
 to be able to make use of it, otherwise we'll simply not be able to do what we
 need to do.

 mkfs will need to initialise all the zone allocation inodes, reset all the zone
 write pointers, create the /.zones directory, place the log in an appropriate
 place and initialise the metadata device as well.

 == Repair

 Because we've limited the metadata to a section of the drive that can be
 overwritten, we don't have to make significant changes to xfs_repair. It will
 need to be taught about the multiple zone allocation bitmaps for it's space
 reference checking, but otherwise all the infrastructure we need ifor using
 bitmaps for verifying used space should already be there.

 THere be dragons waiting for us if we don't have random write zones for
 metadata. If that happens, we cannot repair metadata in place and we will have
 to redesign xfs_repair from the ground up to support such functionality. That's
 jus tnot going to happen, so we'll need drives with a significant amount of
 random write space for all our metadata......

 == Quantification of Random Write Zone Capacity

 A basic guideline is that for 4k blocks and zones of 256MB, we'll need 8kB of
 bitmap space and two inodes, so call it 10kB per 256MB zone. That's 40MB per TB
 for free space bitmaps. We'll want to support at least 1 million inodes per TB,
 so that's another 512MB per TB, plus another 256MB per TB for directory
 structures. There's other bits and pieces of metadata as well (attribute space,
 internal freespace btrees, reverse map btrees, etc.

 So, at minimum we will probably need at least 2GB of random write space per TB
 of SMR zone data space. Plus a couple of GB for the journal if we want the easy
 option. For those drive vendors out there that are listening and want good
 performance, replace the CMR region with a SSD....

 == Kernel implementation

 The allocator will need to learn about multiple allocation zones based on
 bitmaps. They aren't really allocation groups, but the initialisation and
 iteration of them is going to be similar to allocation groups. To get use going
 we can do some simple mapping between inode AG and data AZ mapping so that we
 keep some form of locality to related data (e.g. grouping of data by parent
 directory).

 We can do simple things first - simply rotoring allocation across zones will get
 us moving very quickly, and then we can refine it once we have more than just a
 proof of concept prototype.

 Optimising data allocation for SMR is going to be tricky, and I hope to be able
 to leave that to drive vendor engineers....

 Ideally, we won't need a zbc interface in the kernel, except to erase zones.
 I'd like to see an interface that doesn't even require that. For example, we
 issue a discard (TRIM) on an entire  zone and that erases it and resets the write
 pointer. This way we need no new infrastructure at the filesystem layer to
 implement SMR awareness. In effect, the kernel isn't even aware that it's an SMR
 drive underneath it.

 == Problem cases

 There are a few elephants in the room.

 === Concurrent writes

 What happens when an application does concurrent writes into a file (either by
 threads or AIO), and allocation happens in the opposite order to the IO being
 dispatched. i.e., with a zone write pointer at block X, this happens:

 ----
 Task A			Task B
 write N			write N + 1
 allocate X
 			allocate X + 1
 submit_bio		submit_bio
 <blocks in Io stack>	IO to block X+1 dispatched.
 ----

 And so even though we allocated the IO in incoming order, the dispatch order was
 different.

 I don't see how the filesystem can prevent this from occurring, except to
 completely serialise IO to zone. i.e. while we have a block allocation and no
 write completion, no other allocations to that zone can take place. If that's
 the case, this is going to cause massive fragmentation and/or severe IO latency
 problems for any application that has this sort of IO engine.

 There is a block layer solution to this in the works - the block layer will
 track the write pointer in each zone and if it gets writes out of order it will
 requeue the IO at the tail of the queue, hence allowing the IO that has been
 delayed to be issued before the out of order write.

 === Crash recovery

 Write pointer location is undefined after power failure. It could be at an old
 location, the current location or anywhere in between. The only guarantee that
 we have is that if we flushed the cache (i.e. fsync'd a file) then they will at
 least be in a position at or past the location of the fsync.

 Hence before a filesystem runs journal recovery, all it's zone allocation write
 pointers need to be set to what the drive thinks they are, and all of the zone
 allocation beyond the write pointer need to be cleared. We could do this during
 log recovery in kernel, but that means we need full ZBC awareness in log
 recovery to iterate and query all the zones.

 Hence it's not clear if we want to do this in userspace as that has it's own
 problems e.g. we'd need to  have xfs.fsck detect that it's a smr filesystem and
 perform that recovery, or write a mount.xfs helper that does it prior to
 mounting the filesystem. Either way, we need to synchronise the on-disk
 filesystem state to the internal disk zone state before doing anything else.

 This needs more thought, because I have a nagging suspiscion that we need to do
 this write pointer resynchronisation *after log recovery* has completed so we
 can determine if we've got to now go and free extents that the filesystem has
 allocated and are referenced by some inode out there. This, again, will require
 reverse mapping lookups to solve.

 === Preallocation Issues

 Because we can only do sequential writes, we can only allocate space that
 exactly matches the write being performed. That means we *cannot preallocate
 extents*. The reason for this is that preallocation will physically separate the
 data write location from the zone write pointer. e.g. if we use preallocation to
 allocate space we are about to do random writes into to prevent fragmentation.
 We cannot do this on ZBC drives, we have to allocate specifically for the IO we
 are going to perform.

 As a result, we lose  almost all the existing mechanisms we use for preventing
 fragmentation. Speculative EOF preallocation with delayed allocation cannot be
 used, fallocate cannot be used to preallocate physical extents, and extent size
 hints cannot be used because they do "allocate around" writes.

 We're trying to do better without much investment in time and resources here, so
 the compromise is that we are going to have to rely on xfs_fsr to clean up
 fragmentation after the fact. Luckily, the other functions we need from xfs_fsr
 (zone cleaning) also act to defragment free space so we don't have to care about
 trading contiguous filesystem for free space fragmentation and that downward
 spiral.

 I suspect the best we will be able to do with fallocate based preallocation is
 to mark the region as delayed allocation.

 === Allocation Alignment

 With zone based write pointers, we lose all capability of write alignment to the
 underlying storage - our only choice to write is the current set of write
 pointers we have access to. There are several methods we could use to work
 around this problem (e.g. put a slab-like allocator on top of the zones) but
 that requires completely redesigning the allocators for SMR. Again, this may be a
 step too far....

 === RAID on SMR....

 How does RAID work with SMR, and exactly what does that look like to
 the filesystem?

 How does libzbc work with RAID given it is implemented through the scsi ioctl
 interface?

 How does RAID repair parity errors in place? Or does the RAID layer now need
 a remapping layer so the LBA or rewritten stripes remain the same? Indeed, how
 do we handle partial stripe writes which will require multiple parity block
 writes?

 What does the geometry look like (stripe unit, width) and what does the write
 pointer look like? How does RAID track all the necessary write pointers and keep
 them in sync? What about RAID1 with it's dirty region logging to minimise resync
 time and overhead?
	= SMR Layout Optimisation for XFS (v0.2, March 2015)
	Dave Chinner, <dchinner@redhat.com>

	== Overview

	This document describes a relatively simple way of modifying XFS using existing
	on disk structures to be able to use host-managed SMR drives.

	This assumes a userspace ZBC implementation such as libzbc will do all the heavy
	lifting work of laying out the structure of the filesystem, and that it will
	perform things like zone write pointer checking/resetting before the filesystem
	is mounted.

	== Concepts

	SMR is architected to have a set of sequentially written zones which don't allow
	out of order writes, nor do they allow overwrites of data already written in the
	zone. Zones are typically in the order of 256MB, though may actually be of
	variable size as physical geometry of the drives differ from inner to outer
	edges.

	SMR drives also typically have an outer section that is CMR technology - it
	allows random writes and overwrites to any area within those zones. Drive
	managed SMR devices use this region for internal metadata
	journalling for block remapping tables and as a staging area for data writes
	before being written out in sequential fashion ito zones after block remapping
	has been performed.

	Recent research has shown that 6TB seagate drives have a 20-25GB CMR zone,
	which is more than enough for our purposes. Information from other vendors
	indicate that some drives will have much more CMR, hence if we design for the
	known sizes in the Seagate drives we will be fine for other drives just coming
	onto the market right now.

	For host managed/aware drives, we are going to assume that we can use this area
	directly for filesystem metadata - for our own mapping tables and things like
	the journal, inodes, directories and free space tracking. We are also going to
	assume that we can find these regions easily in the ZBC information, and that
	they are going to be contiguous rather than spread all over the drive.

	XFS already has a data-only device call the "real time" device, whose free space
	information is tracked externally in bitmaps attached to inodes that exist in
	the "data" device. All filesystem metadata exists in the "data" device, except
	maybe the journal which can also be in an external device.

	A key constraint we need to work within here is that RAID on SMR drives is a
	long way off. The main use case is for bulk storage of data in the back end of
	distributed object stores (i.e. cat pictures on the intertubes) and hence a
	filesystem per drive is the typical configuration we'll be chasing here.
	Similarly, partitioning of SMR drives makes no sense for host aware drives,
	so we are going to constrain the architecture to a single drive for now.

	== Journal modifications

	Because the XFS journal is a sequentially written circular log, we can actually
	use SMR zones for it - it does not need to be in the metadata region. This
	requires a small amount of additional complexity - we can't wrap the log as we
	currnetly do, we'll need to split the log across two zones so that we can push
	the tail into the same zone as the head, then reset the now unused zone
	and then when the log wraps it can simply start again form the beginning of the
	erased zone.

	Like a normal spinning disk, we'll want to place the log in a pair of zones near
	the middle of the drive so that we minimise the worst case seek cost of a log
	write to half of a full disk seek. There may be advantage to putting it right
	next to the metadata zone, but typically metadata writes are not correlated with
	log writes.

	Hence the only real functionality we need to add to the log is the tail pushing
	modifications to move the tail into the same zone as the head, as well as being
	able to trigger and block on zone write pointer reset operations.

	The log doesn't actually need to track the zone write pointer, though log
	recovery will need to limit the recovery head to the current write pointer of
	the lead zone. Modifications here are limited to the function that finds the
	head of the log, and can actually be used to speed up the search algorithm.

	However, given the size of the CMR zones, we can host the journal in an
	unmodified manner inside the CMR zone and not have to worry about zone
	awareness. This is by far the simplest solution to the problem.

	== Data zones

	What we need is a mechanism for tracking the location of zones (i.e. start LBA),
	free space/write pointers within each zone, and some way of keeping track of
	that information across mounts. If we assign a real time bitmap/summary inode
	pair to each zone, we have a method of tracking free space in the zone. We can
	use the existing bitmap allocator with a small tweak (sequentially ascending,
	packed extent allocation only) to ensure that newly written blocks are allocated
	in a sane manner.

	We're going to need userspace to be able to see the contents of these inodes;
	read only access will be needed to analyse the contents of the zone, so we're
	going to need a special directory to expose this information. It would be useful
	to have a ".zones" directory hanging off the root directory that contains all
	the zone allocation inodes so userspace can simply open them.

	THis biggest issue that has come to light here is the number of zones in a
	device. Zones are typically 256MB in size, and so we are looking at 4,000
	zones/TB. For a 10TB drive, that's 40,000 zones we have to keep track of. And if
	the devices keep getting larger at the expected rate, we're going to have to
	deal with zone counts in the hundreds of thousands. Hence a single flat
	directory containing all these inodes is not going to scale, nor will we be able
	to keep them all in memory at once.

	As a result, we are going to need to group the zones for locality and efficiency
	purposes, likely as "zone groups" of, say, up to 1TB in size. Luckily, by
	keeping the zone information in inodes the information can be demand paged and
	so we don't need to pin thousands of inodes and bitmaps in memory. Zone groups
	also have other benefits...

	While it seems like tracking free space is trivial for the purposes of
	allocation (and it is!), the complexity comes when we start to delete or
	overwrite data. Suddenly zones no longer contain contiguous ranges of valid
	data; they have "freed" extents in the middle of them that contain stale data.
	We can't use that "stale space" until the entire zone is made up of "stale"
	extents. Hence we need a Cleaner.

	=== Zone Cleaner

	The purpose of the cleaner is to find zones that are mostly stale space and
	consolidate the remaining referenced data into a new, contiguous zone, enabling
	us to then "clean" the stale zone and make it available for writing new data
	again.

	The real complexity here is finding the owner of the data that needs to be move,
	but we are in the process of solving that with the reverse mapping btree and
	parent pointer functionality. This gives us the mechanism by which we can
	quickly re-organise files that have extents in zones that need cleaning.

	The key word here is "reorganise". We have a tool that already reorganises file
	layout: xfs_fsr. The "Cleaner" is a finely targeted policy for xfs_fsr -
	instead of trying to minimise fixpel fragments, it finds zones that need
	cleaning by reading their summary info from the /.zones/ directory and analysing
	the free bitmap state if there is a high enough percentage of stale blocks. From
	there we can use the reverse mapping to find the inodes that own the extents
	those zones. And from there, we can run the existing defrag code to rewrite the
	data in the file, thereby marking all the old blocks stale. This will make
	almost stale zones entirely stale, and hence then be able to be reset.

	Hence we don't actually need any major new data moving functionality in the
	kernel to enable this, except maybe an event channel for the kernel to tell
	xfs_fsr it needs to do some cleaning work.

	If we arrange zones into zoen groups, we also have a method for keeping new
	allocations out of regions we are re-organising. That is, we need to be able to
	mark zone groups as "read only" so the kernel will not attempt to allocate from
	them while the cleaner is running and re-organising the data within the zones in
	a zone group. This ZG also allows the cleaner to maintain some level of locality
	to the data that it is re-arranging.

	=== Reverse mapping btrees

	One of the complexities is that the current reverse map btree is a per
	allocation group construct. This means that, as per the current design and
	implementation, it will not work with the inode based bitmap allocator. This,
	however, is not actually a major problem thanks to the generic btree library
	that XFS uses.

	That is, the generic btree library in XFS is used to implement the block mapping
	btree held in the data fork of the inode. Hence we can use the same btree
	implementation as the per-AG rmap btree, but simply add a couple of functions,
	set a couple of flags and host it in the inode data fork of a third per-zone
	inode to track the zone's owner information.

	== Mkfs

	Mkfs is going to have to integrate with the userspace zbc libraries to query the
	layout of zones from the underlying disk and then do some magic to lay out al
	the necessary metadata correctly. I don't see there being any significant
	challenge to doing this, but we will need a stable libzbc API to work with and
	it will need ot be packaged by distros.

	If mkfs cannot find ensough random write space for the amount of metadata we
	need to track all the space in the sequential write zones and a decent amount of
	internal fielsystem metadata (inodes, etc) then it will need to fail. Drive
	vendors are going to need to provide sufficient space in these regions for us
	to be able to make use of it, otherwise we'll simply not be able to do what we
	need to do.

	mkfs will need to initialise all the zone allocation inodes, reset all the zone
	write pointers, create the /.zones directory, place the log in an appropriate
	place and initialise the metadata device as well.

	== Repair

	Because we've limited the metadata to a section of the drive that can be
	overwritten, we don't have to make significant changes to xfs_repair. It will
	need to be taught about the multiple zone allocation bitmaps for it's space
	reference checking, but otherwise all the infrastructure we need ifor using
	bitmaps for verifying used space should already be there.

	THere be dragons waiting for us if we don't have random write zones for
	metadata. If that happens, we cannot repair metadata in place and we will have
	to redesign xfs_repair from the ground up to support such functionality. That's
	jus tnot going to happen, so we'll need drives with a significant amount of
	random write space for all our metadata......

	== Quantification of Random Write Zone Capacity

	A basic guideline is that for 4k blocks and zones of 256MB, we'll need 8kB of
	bitmap space and two inodes, so call it 10kB per 256MB zone. That's 40MB per TB
	for free space bitmaps. We'll want to support at least 1 million inodes per TB,
	so that's another 512MB per TB, plus another 256MB per TB for directory
	structures. There's other bits and pieces of metadata as well (attribute space,
	internal freespace btrees, reverse map btrees, etc.

	So, at minimum we will probably need at least 2GB of random write space per TB
	of SMR zone data space. Plus a couple of GB for the journal if we want the easy
	option. For those drive vendors out there that are listening and want good
	performance, replace the CMR region with a SSD....

	== Kernel implementation

	The allocator will need to learn about multiple allocation zones based on
	bitmaps. They aren't really allocation groups, but the initialisation and
	iteration of them is going to be similar to allocation groups. To get use going
	we can do some simple mapping between inode AG and data AZ mapping so that we
	keep some form of locality to related data (e.g. grouping of data by parent
	directory).

	We can do simple things first - simply rotoring allocation across zones will get
	us moving very quickly, and then we can refine it once we have more than just a
	proof of concept prototype.

	Optimising data allocation for SMR is going to be tricky, and I hope to be able
	to leave that to drive vendor engineers....

	Ideally, we won't need a zbc interface in the kernel, except to erase zones.
	I'd like to see an interface that doesn't even require that. For example, we
	issue a discard (TRIM) on an entire zone and that erases it and resets the write
	pointer. This way we need no new infrastructure at the filesystem layer to
	implement SMR awareness. In effect, the kernel isn't even aware that it's an SMR
	drive underneath it.

	== Problem cases

	There are a few elephants in the room.

	=== Concurrent writes

	What happens when an application does concurrent writes into a file (either by
	threads or AIO), and allocation happens in the opposite order to the IO being
	dispatched. i.e., with a zone write pointer at block X, this happens:

	----
	Task A Task B
	write N write N + 1
	allocate X
	allocate X + 1
	submit_bio submit_bio
	<blocks in Io stack> IO to block X+1 dispatched.
	----

	And so even though we allocated the IO in incoming order, the dispatch order was
	different.

	I don't see how the filesystem can prevent this from occurring, except to
	completely serialise IO to zone. i.e. while we have a block allocation and no
	write completion, no other allocations to that zone can take place. If that's
	the case, this is going to cause massive fragmentation and/or severe IO latency
	problems for any application that has this sort of IO engine.

	There is a block layer solution to this in the works - the block layer will
	track the write pointer in each zone and if it gets writes out of order it will
	requeue the IO at the tail of the queue, hence allowing the IO that has been
	delayed to be issued before the out of order write.

	=== Crash recovery

	Write pointer location is undefined after power failure. It could be at an old
	location, the current location or anywhere in between. The only guarantee that
	we have is that if we flushed the cache (i.e. fsync'd a file) then they will at
	least be in a position at or past the location of the fsync.

	Hence before a filesystem runs journal recovery, all it's zone allocation write
	pointers need to be set to what the drive thinks they are, and all of the zone
	allocation beyond the write pointer need to be cleared. We could do this during
	log recovery in kernel, but that means we need full ZBC awareness in log
	recovery to iterate and query all the zones.

	Hence it's not clear if we want to do this in userspace as that has it's own
	problems e.g. we'd need to have xfs.fsck detect that it's a smr filesystem and
	perform that recovery, or write a mount.xfs helper that does it prior to
	mounting the filesystem. Either way, we need to synchronise the on-disk
	filesystem state to the internal disk zone state before doing anything else.

	This needs more thought, because I have a nagging suspiscion that we need to do
	this write pointer resynchronisation after log recovery has completed so we
	can determine if we've got to now go and free extents that the filesystem has
	allocated and are referenced by some inode out there. This, again, will require
	reverse mapping lookups to solve.

	=== Preallocation Issues

	Because we can only do sequential writes, we can only allocate space that
	exactly matches the write being performed. That means we *cannot preallocate
	extents*. The reason for this is that preallocation will physically separate the
	data write location from the zone write pointer. e.g. if we use preallocation to
	allocate space we are about to do random writes into to prevent fragmentation.
	We cannot do this on ZBC drives, we have to allocate specifically for the IO we
	are going to perform.

	As a result, we lose almost all the existing mechanisms we use for preventing
	fragmentation. Speculative EOF preallocation with delayed allocation cannot be
	used, fallocate cannot be used to preallocate physical extents, and extent size
	hints cannot be used because they do "allocate around" writes.

	We're trying to do better without much investment in time and resources here, so
	the compromise is that we are going to have to rely on xfs_fsr to clean up
	fragmentation after the fact. Luckily, the other functions we need from xfs_fsr
	(zone cleaning) also act to defragment free space so we don't have to care about
	trading contiguous filesystem for free space fragmentation and that downward
	spiral.

	I suspect the best we will be able to do with fallocate based preallocation is
	to mark the region as delayed allocation.

	=== Allocation Alignment

	With zone based write pointers, we lose all capability of write alignment to the
	underlying storage - our only choice to write is the current set of write
	pointers we have access to. There are several methods we could use to work
	around this problem (e.g. put a slab-like allocator on top of the zones) but
	that requires completely redesigning the allocators for SMR. Again, this may be a
	step too far....

	=== RAID on SMR....

	How does RAID work with SMR, and exactly what does that look like to
	the filesystem?

	How does libzbc work with RAID given it is implemented through the scsi ioctl
	interface?

	How does RAID repair parity errors in place? Or does the RAID layer now need
	a remapping layer so the LBA or rewritten stripes remain the same? Indeed, how
	do we handle partial stripe writes which will require multiple parity block
	writes?

	What does the geometry look like (stripe unit, width) and what does the write
	pointer look like? How does RAID track all the necessary write pointers and keep
	them in sync? What about RAID1 with it's dirty region logging to minimise resync
	time and overhead?