| #LyX 1.6.7 created this file. For more info see http://www.lyx.org/ |
| \lyxformat 345 |
| \begin_document |
| \begin_header |
| \textclass article |
| \use_default_options true |
| \language english |
| \inputencoding auto |
| \font_roman default |
| \font_sans default |
| \font_typewriter default |
| \font_default_family default |
| \font_sc false |
| \font_osf false |
| \font_sf_scale 100 |
| \font_tt_scale 100 |
| |
| \graphics default |
| \paperfontsize default |
| \use_hyperref false |
| \papersize default |
| \use_geometry false |
| \use_amsmath 1 |
| \use_esint 1 |
| \cite_engine basic |
| \use_bibtopic false |
| \paperorientation portrait |
| \secnumdepth 3 |
| \tocdepth 3 |
| \paragraph_separation indent |
| \defskip medskip |
| \quotes_language english |
| \papercolumns 1 |
| \papersides 1 |
| \paperpagestyle default |
| \tracking_changes false |
| \output_changes false |
| \author "" |
| \author "" |
| \end_header |
| |
| \begin_body |
| |
| \begin_layout Title |
| Btrfs Subvolume Quota Groups |
| \end_layout |
| |
| \begin_layout Author |
| Arne Jansen, Strato AG |
| \end_layout |
| |
| \begin_layout Date |
| October 2011 |
| \end_layout |
| |
| \begin_layout Section |
| Subvolume Quota Concepts |
| \end_layout |
| |
| \begin_layout Standard |
| The concept of quota has a long standing tradition in the unix world. |
| Ever since computers allow multiple users to work simultaneously in one |
| filesystem, there's the need to prevent one user from using up the full |
| space. |
| Every user should get his fair share of the available resources. |
| \end_layout |
| |
| \begin_layout Standard |
| In case of files the solution is quite straightforward. |
| Each file has an 'owner' recorded along with it, and it has a size. |
| Traditional quota just restricts the total size of all files that are owned |
| by a user. |
| The concept is quite flexible: if a user hits his quota limit, the administrato |
| r can raise it on the fly. |
| \end_layout |
| |
| \begin_layout Standard |
| On the other hand the traditional approach has only a poor solution to restrict |
| directories. |
| At installation time the harddisk can be partitioned so that every directory |
| (e.g. |
| /usr, /var, ...) that needs a limit gets its own partition. |
| The obvious problem is, that those limits can't be changed without a reinstall. |
| The btrfs subvolume feature builds a bridge. |
| Subvolumes correspond in many ways to partitions as every subvolume looks |
| like its own filesystem. |
| With subvolume quota, it is now possible to restrict each subvolume like |
| a partition, but keep the flexibility of quota. |
| The space for each subvolume can be expanded or restricted on the fly. |
| \end_layout |
| |
| \begin_layout Standard |
| As subvolumes are the basis for snapshots, interesting questions arise as |
| to how to account used space in the presence of snapshots. |
| If you have a file shared between a subvolume and a snapshot, whom to account |
| the file to? The creator? Both? What if the file gets modified in the snapshot, |
| should only these changes be accounted to it? But wait, both the snapshot |
| and the subvolume belong to the same user home. |
| I just want to limit the total space used by both! But somebody else might |
| not want to charge the snapshots to the users. |
| \end_layout |
| |
| \begin_layout Standard |
| Btrfs subvolume quota solves these problems by introducing groups of subvolumes |
| and let the user put limits on them. |
| It is even possible to have groups of groups. |
| In the following we refer to them as |
| \begin_inset Quotes eld |
| \end_inset |
| |
| qgroups |
| \begin_inset Quotes erd |
| \end_inset |
| |
| . |
| Each qgroup primarily tracks two numbers, the amount of total referenced |
| space and the amount of exclusively referenced space. |
| \end_layout |
| |
| \begin_layout Standard |
| |
| \emph on |
| Referenced |
| \emph default |
| space is the amount of data that can be reached from any of the subvolumes |
| contained in the qgroup, while |
| \emph on |
| exclusive |
| \emph default |
| is the amount of data where all references to this data can be reached |
| from within this qgroup. |
| \end_layout |
| |
| \begin_layout Section |
| Subvolume Quota Groups |
| \end_layout |
| |
| \begin_layout Standard |
| The basic notion of the Subvolume Quota feature is the qouta group, short |
| qgroup. |
| Qgroups are notated as <level>/<id>, e.g. |
| the qgroup 3/2 is a qgroup of level 3. |
| For level 0 the leading |
| \begin_inset Quotes eld |
| \end_inset |
| |
| 0/ |
| \begin_inset Quotes erd |
| \end_inset |
| |
| can be omitted. |
| Qgroups of level 0 get created automatically when a subvolume/snapshot |
| gets created. |
| The ID of the qgroup corresponds to the ID of the subvolume, so 0/5 is |
| the qgroup for the root subvolume. |
| For the |
| \begin_inset Quotes eld |
| \end_inset |
| |
| btrfs qgroup |
| \begin_inset Quotes erd |
| \end_inset |
| |
| command, the path to the subvolume can also be used instead of 0/<ID>. |
| For all higher levels, the ID can be choosen freely. |
| \end_layout |
| |
| \begin_layout Standard |
| Each qgroup can contain a set of lower level qgroups, thus creating a hierarchy |
| of qgroups. |
| Figure |
| \begin_inset CommandInset ref |
| LatexCommand ref |
| reference "fig:Sample-qgroup-hierarchy" |
| |
| \end_inset |
| |
| |
| \begin_inset Float figure |
| wide false |
| sideways false |
| status open |
| |
| \begin_layout Plain Layout |
| \begin_inset Graphics |
| filename qgroups1.svg |
| scale 20 |
| |
| \end_inset |
| |
| |
| \end_layout |
| |
| \begin_layout Plain Layout |
| \begin_inset Caption |
| |
| \begin_layout Plain Layout |
| \begin_inset CommandInset label |
| LatexCommand label |
| name "fig:Sample-qgroup-hierarchy" |
| |
| \end_inset |
| |
| Sample qgroup hierarchy |
| \end_layout |
| |
| \end_inset |
| |
| |
| \end_layout |
| |
| \end_inset |
| |
| shows an example qgroup tree. |
| At the bottom some extents are depicted showing which qgroups reference |
| which extents. |
| It is important to understand the notion of |
| \emph on |
| referenced |
| \emph default |
| versus |
| \emph on |
| exclusive |
| \emph default |
| . |
| In the example qgroup 0/2 references extents 2 and 3, while 1/2 references |
| extents 2-4. |
| 2/1 references all extents. |
| \end_layout |
| |
| \begin_layout Standard |
| On the other hand, extent 1 is exclusive to 0/1, extent 2 is exclusive to |
| 0/2, while extent 3 is neither exclusive to 0/2 nor to 0/3. |
| But because both references can be reached from 1/2, extent 3 is exclusive |
| to 1/2. |
| All extents are exclusive to 2/1. |
| So |
| \emph on |
| exclusive |
| \emph default |
| doesn't mean there's no other way to reach the extent, but it does mean |
| that if you delete all subvolumes contained in a qgroup, the extent will |
| get deleted. |
| |
| \emph on |
| Exclusive |
| \emph default |
| of a qgroup conveys the useful information how much space will be freed |
| in case all subvolumes of the qgroup get deleted. |
| \end_layout |
| |
| \begin_layout Standard |
| All data extents are accounted this way. |
| Metadata that belong to a specific subvolume (i.e. |
| its fs tree) are also accounted. |
| Checksums and extent allocation information are not accounted. |
| \end_layout |
| |
| \begin_layout Standard |
| In turn the |
| \emph on |
| referenced |
| \emph default |
| count of a qgroup can be limited. |
| All writes beyond that limit will lead to a |
| \begin_inset Quotes eld |
| \end_inset |
| |
| Quota Exceeded |
| \begin_inset Quotes erd |
| \end_inset |
| |
| error. |
| \end_layout |
| |
| \begin_layout Section |
| Inheritance |
| \end_layout |
| |
| \begin_layout Standard |
| Things get a bit more complicated when new subvolumes or snapshots are created. |
| The case of (empty) subvolumes is still quite easy. |
| If a subvolume should be part of a qgroup, it has to be added to the qgroup |
| at creation time. |
| To add it at a later time, it would be necessary to at least rescan the |
| full subvolume for a proper accounting. |
| \end_layout |
| |
| \begin_layout Standard |
| Creation of a snapshot is the hard case. |
| Obviously the snapshot will reference the exact amount of space as its |
| source, and both source and destination now have an |
| \emph on |
| exclusive |
| \emph default |
| count of 0 (4k to be precise, as the roots of the trees are not shared). |
| But what about qgroups of higher levels? If the qgroup contains both the |
| source and the destination, nothing changes. |
| If the qgroup contains only the source, it might lose some |
| \emph on |
| exclusive |
| \emph default |
| . |
| But how much? The tempting answer is, |
| \begin_inset Quotes eld |
| \end_inset |
| |
| subtract all |
| \emph on |
| exclusive |
| \emph default |
| of the source from the qgroup |
| \begin_inset Quotes erd |
| \end_inset |
| |
| , but that's wrong, or at least not enough. |
| There could have been an extent that's referenced from the source and another |
| subvolume from that qgroup. |
| This extent would have been exclusive to the qgroup, but not to the source |
| subvolume. |
| With the creation of the snapshot the qgroup would also lose this extent |
| from its exclusive set. |
| \end_layout |
| |
| \begin_layout Standard |
| So how can this problem be solved? In the instant the snapshot gets created |
| we already have to know the correct |
| \emph on |
| exclusive |
| \emph default |
| count. |
| We need to have a second qgroup that contains all the subvolumes as the |
| first qgroup, except the subvolume we want to snapshot. |
| The moment we create the snapshot, the |
| \emph on |
| exclusive |
| \emph default |
| count from the second qgroup needs to be copied to the first qgroup, as |
| it represents the correct value. |
| The second qgroup is called a tracking qgroup. |
| It is only there in case a snapshot is needed. |
| \end_layout |
| |
| \begin_layout Section |
| Use Cases |
| \end_layout |
| |
| \begin_layout Subsection |
| single-user machine |
| \end_layout |
| |
| \begin_layout Subsubsection |
| Replacement for partitions |
| \end_layout |
| |
| \begin_layout Standard |
| The simplest use case is to use qgroups as simple replacement for partitions. |
| Btrfs takes the disk as a whole, and /, /usr, /var etc. |
| are created as subvolumes. |
| As each subvolume gets it own qgroup automatically, they can simply be |
| restricted. |
| No hierarchy is needed for that. |
| \end_layout |
| |
| \begin_layout Subsubsection |
| Track usage of snapshots |
| \end_layout |
| |
| \begin_layout Standard |
| When a snapshot is taken, a qgroup for it will automatically be created |
| with the correct values. |
| |
| \emph on |
| Referenced |
| \emph default |
| will show how much is in it, possibly shared with other subvolumes. |
| |
| \emph on |
| Exclusive |
| \emph default |
| will be the amount of space that gets freed when the subvolume is deleted. |
| \end_layout |
| |
| \begin_layout Subsection |
| Multi-user machine / Hosting |
| \end_layout |
| |
| \begin_layout Subsubsection |
| \begin_inset CommandInset label |
| LatexCommand label |
| name "sub:restricting-homes" |
| |
| \end_inset |
| |
| Restricting homes |
| \end_layout |
| |
| \begin_layout Standard |
| When you have several users on a machine, with home dirs probably under |
| /home, you might want to restrict /home as a whole, while restricting every |
| user to an indiviual limit. |
| This is easily accomplished by creating a qgroup for /home, e.g. |
| 1/1, and assigning all user-subvols to it. |
| Restricting this qgroup will limit /home, while every user subvol can get |
| it's own (lower) limit. |
| \end_layout |
| |
| \begin_layout Subsubsection |
| \begin_inset CommandInset label |
| LatexCommand label |
| name "sub:accounting-snapshots-to" |
| |
| \end_inset |
| |
| Accounting snapshots to the user |
| \end_layout |
| |
| \begin_layout Standard |
| Let's say the user is allowed to create snapshots via some mechanism. |
| It would only be fair to account space used by the snapshots to the user. |
| This doesn't mean the user doubles his usage as soon as he takes a snapshot. |
| Of course files that are present in his home and the snapshot should only |
| be accounted once. |
| This can be accomplished by creating a qgroup for each user, say 1/<uid>. |
| The user home and all snapshots are assigned to this qgroup. |
| Limiting it will extend the limit to all snapshots, counting files only |
| once. |
| To limit /home as a whole, a higher level group 2/1 replacing 1/1 from |
| the previous example is needed, with all user-qgroups assigned to it. |
| \end_layout |
| |
| \begin_layout Subsubsection |
| Don't account snapshots |
| \end_layout |
| |
| \begin_layout Standard |
| On the other hand, when the snapshots get created automatically, the user |
| has no chance to control them, so the space used by them should not be |
| accounted to him. |
| This is already the case when creating snapshots in example |
| \begin_inset CommandInset ref |
| LatexCommand ref |
| reference "sub:restricting-homes" |
| |
| \end_inset |
| |
| |
| \end_layout |
| |
| \begin_layout Subsubsection |
| Snapshots for backup purposes |
| \end_layout |
| |
| \begin_layout Standard |
| This szenario is a mixture of the previous two. |
| The user can create snapshots, but some snapshots for backup purposes are |
| being created by the system. |
| User's snapshots should be accounted to the user, system's not. |
| The solution is similar to |
| \begin_inset CommandInset ref |
| LatexCommand ref |
| reference "sub:accounting-snapshots-to" |
| |
| \end_inset |
| |
| , but don't assign system snapshots to users qgroup. |
| \end_layout |
| |
| \begin_layout Section |
| Implementation |
| \end_layout |
| |
| \begin_layout Subsection |
| Update algorithm |
| \end_layout |
| |
| \begin_layout Standard |
| The update algorithm is the core of the quota implementation. |
| Whenever a reference is added or removed, the update algorithm is called. |
| \end_layout |
| |
| \begin_layout Standard |
| The algorithm is called with the address of the extent for which to add/remove |
| the reference, the root of the reference, the amount of space to add/remove, |
| and of course the operation to perform. |
| \end_layout |
| |
| \begin_layout Standard |
| A call could look like this |
| \end_layout |
| |
| \begin_layout LyX-Code |
| qgroup_record_ref(ref_root, start, num_bytes, operation); |
| \end_layout |
| |
| \begin_layout Standard |
| In fact these parameters are all contained in the delayed ref structure, |
| so just the delayed ref node is passed instead. |
| This function gets called from the central point where backrefs are added |
| to the filesystem, btrfs_add_delayed_*_ref. |
| \end_layout |
| |
| \begin_layout Standard |
| The algorithm works in multiple steps: |
| \end_layout |
| |
| \begin_layout Enumerate |
| Find all referencing roots |
| \end_layout |
| |
| \begin_layout Enumerate |
| Calculate refcnt for all qgroups |
| \end_layout |
| |
| \begin_layout Enumerate |
| Tag qgroups |
| \end_layout |
| |
| \begin_layout Enumerate |
| Update |
| \emph on |
| exclusive |
| \end_layout |
| |
| \begin_layout Subsubsection |
| Find all referencing roots |
| \end_layout |
| |
| \begin_layout Standard |
| The first step is to find all roots that are currently referencing the extent. |
| Though btrfs is fully back-referenced, this step is not as easy as it may |
| seem, because of the lazy refcounting scheme. |
| |
| \begin_inset Float figure |
| wide false |
| sideways false |
| status open |
| |
| \begin_layout Plain Layout |
| \begin_inset Graphics |
| filename 5roots.svg |
| scale 50 |
| |
| \end_inset |
| |
| |
| \begin_inset Caption |
| |
| \begin_layout Plain Layout |
| \begin_inset CommandInset label |
| LatexCommand label |
| name "fig:Extent-with-lazy" |
| |
| \end_inset |
| |
| Extent with lazy references |
| \end_layout |
| |
| \end_inset |
| |
| |
| \end_layout |
| |
| \end_inset |
| |
| The back references that are recorded for the extent may not tell the full |
| truth. |
| In figure |
| \begin_inset CommandInset ref |
| LatexCommand ref |
| reference "fig:Extent-with-lazy" |
| |
| \end_inset |
| |
| a tree is depicted where the actual extent only has 2 back references recorded, |
| whereas there are 5 roots referencing it. |
| |
| \end_layout |
| |
| \begin_layout Standard |
| The solution is to walk up the tree and follow all back references until |
| all roots are found. |
| This looks like a classic problem for a recursive tree walk, but recursion |
| here is not possible for two reasons: |
| \end_layout |
| |
| \begin_layout Enumerate |
| The code runs in kernel space with very limited stack space. |
| With a recursion the stack may overflow. |
| \end_layout |
| |
| \begin_layout Enumerate |
| To follow a back reference, the referenced extent has to be searched. |
| This is due to the nature of the indirect back references used. |
| These back references point to a key in the tree, not to the address of |
| an extent. |
| \end_layout |
| |
| \begin_layout Standard |
| The code solves this by keeping two lists, one for all roots found and one |
| for all backrefs to follow. |
| Initially, the list of roots is empty, while the list of backrefs is filled |
| with only one item, the reference to the extent for which all backrefs |
| are to be found. |
| \end_layout |
| |
| \begin_layout Standard |
| The following pseudo-code describes how all roots are found: |
| \end_layout |
| |
| \begin_layout LyX-Code |
| foreach ref (0 ... |
| #refs in ulist) |
| \end_layout |
| |
| \begin_deeper |
| \begin_layout LyX-Code |
| find extent for ref |
| \end_layout |
| |
| \begin_layout LyX-Code |
| add all refs for extent to ulist |
| \end_layout |
| |
| \begin_layout LyX-Code |
| if (extent is root) |
| \end_layout |
| |
| \begin_deeper |
| \begin_layout LyX-Code |
| add root to ulist of roots |
| \end_layout |
| |
| \begin_layout LyX-Code |
| |
| \end_layout |
| |
| \end_deeper |
| \end_deeper |
| \begin_layout Standard |
| The lists here are called 'ulists', because they only accept new items if |
| they are not already in the list, i.e. |
| if they are unique. |
| \end_layout |
| |
| \begin_layout Standard |
| The step to add all backrefs for an extent involves finding all recorded |
| inline backrefs, all in-tree backrefs and all delayed refs for the extent |
| up to the moment the algorithms starts to run. |
| Because this code might run some time, new delayed refs for any extent |
| in the tree might be added in the meantime. |
| To avoid a race condition here, each delayed ref gets a sequence number. |
| Only delayed refs with seq < own seq are considered. |
| Also, no delayed ref with a higher seq than own seq must be run while the |
| roots are searched for. |
| \end_layout |
| |
| \begin_layout Standard |
| The code will never include the reference to add/delete. |
| \end_layout |
| |
| \begin_layout Subsubsection |
| Calculate refcnt for all qgroups |
| \end_layout |
| |
| \begin_layout Standard |
| After the list of referencing roots is known, the next 3 steps all operate |
| on the qgroup hierarchy. |
| A sample hierarchy is depicted in figure |
| \begin_inset CommandInset ref |
| LatexCommand ref |
| reference "fig:Sample-qgroup-hierarchy" |
| |
| \end_inset |
| |
| . |
| \end_layout |
| |
| \begin_layout Standard |
| The first operation on the tree is to calculate the number of references |
| that can be reached from every given qgroup. |
| This is done by walking the tree upwards from every root found in the previous |
| step and incrementing a count on each qgroup visited, where each root can |
| only increment the count by one for every qgroup it can reach, even if |
| it can reach it by several paths. |
| The calculated count is called the refcnt. |
| \end_layout |
| |
| \begin_layout Standard |
| As in the previous step, the tree is walked iteratively with the help of |
| ulists to avoid recursion. |
| Figure |
| \begin_inset CommandInset ref |
| LatexCommand ref |
| reference "fig:qgroup-tree-after-1" |
| |
| \end_inset |
| |
| depicts the state after this step is done for extent 3, where the ref from |
| 0/2 should get deleted. |
| The Figure omits the fs-trees and their roots, as qgroups of level 0 directly |
| correspond to a root. |
| |
| \end_layout |
| |
| \begin_layout Standard |
| As the refcnt is part of the qgroup struct, the algorithm would require |
| that all refcnts in all qgroups be set to zero before it can run. |
| To avoid this, a global sequence number is used to determine the refcnt. |
| Only one thread at a time can currently do refcounting on the tree (that |
| is easily changed should it impose a limit). |
| This thread grabs the next sequence number and walks up the tree. |
| If the refcnt of the visited qgroup is smaller than the seq, it is not |
| yet set and known to be 0. |
| Otherwise, it is incremented. |
| After the algorithm has run, the global sequence number is incremented |
| by the max refcnt found. |
| \end_layout |
| |
| \begin_layout Standard |
| \begin_inset Float figure |
| wide false |
| sideways false |
| status open |
| |
| \begin_layout Plain Layout |
| \begin_inset Graphics |
| filename qgroups2.svg |
| scale 20 |
| |
| \end_inset |
| |
| |
| \end_layout |
| |
| \begin_layout Plain Layout |
| \begin_inset Caption |
| |
| \begin_layout Plain Layout |
| \begin_inset CommandInset label |
| LatexCommand label |
| name "fig:qgroup-tree-after" |
| |
| \end_inset |
| |
| qgroup tree after refcnt augmentation for extent 3 |
| \end_layout |
| |
| \end_inset |
| |
| |
| \end_layout |
| |
| \end_inset |
| |
| |
| \end_layout |
| |
| \begin_layout Subsubsection |
| Tag qgroups |
| \end_layout |
| |
| \begin_layout Standard |
| The next step is to walk up the tree again, but this time starting with |
| ref_root, the root to add/remove. |
| Remember that the previous step doesn't include the ref_root. |
| Every qgroup that is being visited on the way up will be tagged in preparation |
| for the next step. |
| Additionally, under certain conditions a first adjustment is made to the |
| values of the visited qgroups. |
| \end_layout |
| |
| \begin_layout Itemize |
| If the refcnt is zero and the operation is to add a reference, this means |
| this qgroup is not yet referencing this extent, but after the operation |
| it will, so the |
| \emph on |
| referenced |
| \emph default |
| value of the qgroup is increased by num_bytes. |
| \end_layout |
| |
| \begin_layout Itemize |
| If the refcnt is zero and the operation is to remove a reference, this means |
| this qgroup is currently referencing the extent, but through the operation |
| it will lose its last reference, so the |
| \emph on |
| referenced |
| \emph default |
| value is decreased by num_bytes. |
| \end_layout |
| |
| \begin_layout Itemize |
| If the refcnt is zero and the number of roots found in the first step is |
| also zero, this means: |
| \end_layout |
| |
| \begin_layout Itemize |
| in case of addition: the added reference will be the only reference, so |
| |
| \emph on |
| exclusive |
| \emph default |
| of the qgroup is increased by num_bytes |
| \end_layout |
| |
| \begin_layout Itemize |
| in case of removal: the reference is the last to remove, which means it |
| is currently exclusive to ref_root, so |
| \emph on |
| exclusive |
| \emph default |
| of the qgroup is decreased by num_bytes |
| \end_layout |
| |
| \begin_layout Standard |
| Figure |
| \begin_inset CommandInset ref |
| LatexCommand ref |
| reference "fig:qgroup-tree-after-1" |
| |
| \end_inset |
| |
| depicts the situation given the reference for 0/2 to extent 3 is to be |
| deleted. |
| |
| \emph on |
| Referenced |
| \emph default |
| of 0/2 and 1/1 will get decreased. |
| \end_layout |
| |
| \begin_layout Standard |
| \begin_inset Float figure |
| wide false |
| sideways false |
| status open |
| |
| \begin_layout Plain Layout |
| \begin_inset Graphics |
| filename qgroups3.svg |
| scale 20 |
| |
| \end_inset |
| |
| |
| \end_layout |
| |
| \begin_layout Plain Layout |
| \begin_inset Caption |
| |
| \begin_layout Plain Layout |
| \begin_inset CommandInset label |
| LatexCommand label |
| name "fig:qgroup-tree-after-1" |
| |
| \end_inset |
| |
| qgroup tree after tagging |
| \end_layout |
| |
| \end_inset |
| |
| |
| \end_layout |
| |
| \begin_layout Plain Layout |
| |
| \end_layout |
| |
| \end_inset |
| |
| |
| \end_layout |
| |
| \begin_layout Subsubsection |
| Update |
| \emph on |
| exclusive |
| \end_layout |
| |
| \begin_layout Standard |
| The last step adjusts the |
| \emph on |
| exclusive |
| \emph default |
| counts of all untagged qgroups. |
| The |
| \emph on |
| exclusive |
| \emph default |
| counts of the tagged qgroups already got adjusted in the previous step. |
| All roots from step 1 are walked again, tagged qgroups are skipped. |
| If the refcnt equals the number of roots found in step one, |
| \emph on |
| exclusive |
| \emph default |
| gets increased if the ref is to be removed and decreased otherwise. |
| Figure |
| \begin_inset CommandInset ref |
| LatexCommand ref |
| reference "fig:update-of-exclusive" |
| |
| \end_inset |
| |
| shows the outcome of this step. |
| Extent 3 is now |
| \emph on |
| exclusive |
| \emph default |
| to 0/3. |
| All other |
| \emph on |
| exclusives |
| \emph default |
| are untouched. |
| Extent 3 was exclusive to 1/2 and 2/1 and still is, while it wasn't exclusive |
| to 0/2 and 1/1 and still isn't. |
| \end_layout |
| |
| \begin_layout Standard |
| \begin_inset Float figure |
| wide false |
| sideways false |
| status open |
| |
| \begin_layout Plain Layout |
| \begin_inset Graphics |
| filename qgroups4.svg |
| scale 20 |
| |
| \end_inset |
| |
| |
| \end_layout |
| |
| \begin_layout Plain Layout |
| \begin_inset Caption |
| |
| \begin_layout Plain Layout |
| \begin_inset CommandInset label |
| LatexCommand label |
| name "fig:update-of-exclusive" |
| |
| \end_inset |
| |
| update of |
| \emph on |
| exclusive |
| \emph default |
| on qgroup tree |
| \end_layout |
| |
| \end_inset |
| |
| |
| \end_layout |
| |
| \end_inset |
| |
| |
| \end_layout |
| |
| \begin_layout Subsection |
| Tracking Groups |
| \end_layout |
| |
| \begin_layout Standard |
| As seen in the introductory chapter, when taking a snapshot the values of |
| several qgroups might need to be adjusted. |
| This is easiest to see when looking at some examples. |
| Figure |
| \begin_inset CommandInset ref |
| LatexCommand ref |
| reference "fig:Tracking-snapshots" |
| |
| \end_inset |
| |
| shows a simple example where tracking groups are needed. |
| |
| \end_layout |
| |
| \begin_layout Standard |
| \begin_inset Float figure |
| wide false |
| sideways false |
| status open |
| |
| \begin_layout Plain Layout |
| \begin_inset Graphics |
| filename tracking1.svg |
| scale 20 |
| |
| \end_inset |
| |
| |
| \end_layout |
| |
| \begin_layout Plain Layout |
| \begin_inset Caption |
| |
| \begin_layout Plain Layout |
| \begin_inset CommandInset label |
| LatexCommand label |
| name "fig:Tracking-snapshots" |
| |
| \end_inset |
| |
| Tracking snapshots |
| \end_layout |
| |
| \end_inset |
| |
| |
| \end_layout |
| |
| \end_inset |
| |
| |
| \end_layout |
| |
| \begin_layout Standard |
| The exercise is to track |
| \emph on |
| referenced |
| \emph default |
| and |
| \emph on |
| exclusive |
| \emph default |
| for all snapshots of a subvolume. |
| The gray qgroups 0/2-0/4 are all snapshot of 0/1. |
| Before 0/4 is created, 1/2 contains 0/2 and 0/3. |
| The moment 0/4 gets created, it is added to 1/2. |
| The |
| \emph on |
| exclusive |
| \emph default |
| count of 1/2 will not change, as all extents that become reachable from |
| 1/2 are also reachable from 1/1. |
| More problematic is the |
| \emph on |
| referenced |
| \emph default |
| count, as not all extents from 0/4 might be new to 1/2. |
| The solution is to add another qgroup, 1/3, that tracks 0/1 and all subvolumes |
| of it (Figure |
| \begin_inset CommandInset ref |
| LatexCommand ref |
| reference "fig:A-tracking-qgroup" |
| |
| \end_inset |
| |
| ). |
| \end_layout |
| |
| \begin_layout Standard |
| \begin_inset Float figure |
| wide false |
| sideways false |
| status open |
| |
| \begin_layout Plain Layout |
| \begin_inset Graphics |
| filename tracking2.svg |
| scale 20 |
| |
| \end_inset |
| |
| |
| \end_layout |
| |
| \begin_layout Plain Layout |
| \begin_inset Caption |
| |
| \begin_layout Plain Layout |
| \begin_inset CommandInset label |
| LatexCommand label |
| name "fig:A-tracking-qgroup" |
| |
| \end_inset |
| |
| A tracking qgroup is needed for 1/2 |
| \end_layout |
| |
| \end_inset |
| |
| |
| \end_layout |
| |
| \begin_layout Plain Layout |
| |
| \end_layout |
| |
| \end_inset |
| |
| |
| \end_layout |
| |
| \begin_layout Standard |
| The moment the snapshot gets created, 1/3 holds the correct |
| \emph on |
| referenced |
| \emph default |
| count for all snapshots. |
| To get 1/2 back to the correct values, |
| \emph on |
| referenced |
| \emph default |
| from 1/3 has to be copied to 1/2, while |
| \emph on |
| exclusive |
| \emph default |
| of 1/2 stays untouched. |
| \end_layout |
| |
| \begin_layout Standard |
| In the next step, we want to take a snapshot of 0/2. |
| The resulting snapshot should not be part of 1/2. |
| This poses another problem: while |
| \emph on |
| referenced |
| \emph default |
| does not change, |
| \emph on |
| exclusive |
| \emph default |
| needs to be corrected. |
| For this, we need another tracking group, 1/4 (Figure |
| \begin_inset CommandInset ref |
| LatexCommand ref |
| reference "fig:A-snapshot-of" |
| |
| \end_inset |
| |
| ). |
| \end_layout |
| |
| \begin_layout Standard |
| \begin_inset Float figure |
| wide false |
| sideways false |
| status open |
| |
| \begin_layout Plain Layout |
| \begin_inset Graphics |
| filename tracking3.svg |
| scale 20 |
| |
| \end_inset |
| |
| |
| \end_layout |
| |
| \begin_layout Plain Layout |
| \begin_inset Caption |
| |
| \begin_layout Plain Layout |
| \begin_inset CommandInset label |
| LatexCommand label |
| name "fig:A-snapshot-of" |
| |
| \end_inset |
| |
| A snapshot of 0/2 |
| \end_layout |
| |
| \end_inset |
| |
| |
| \end_layout |
| |
| \begin_layout Plain Layout |
| |
| \end_layout |
| |
| \end_inset |
| |
| |
| \end_layout |
| |
| \begin_layout Standard |
| When 0/5 is created, |
| \emph on |
| exclusive |
| \emph default |
| from 1/4 needs to be copied to 1/2. |
| Snapshotting 0/2 also invalidates the |
| \emph on |
| exclusive |
| \emph default |
| of 1/3. |
| Also, another snapshot of 0/1 would invalidate 1/4. |
| So one more tracking groups is needed, containing 0/1, 0/4 and 0/3. |
| \end_layout |
| |
| \begin_layout Standard |
| It is planned that the btrfs userland utility will keep track of the needed |
| tracking groups and takes care that all the necessary copies happen. |
| For this, a format needs to be found how a user can describe what snapshots |
| he intends to take. |
| Keeping tracking groups for all possible combinations would lead to an |
| exponential number of tracking groups. |
| \end_layout |
| |
| \begin_layout Subsection |
| On-disk quota tree layout |
| \end_layout |
| |
| \begin_layout Standard |
| Qgroups add a new tree, the quota tree. |
| 4 new keys are used in this tree. |
| The overall status is recorded in a status item, each qgroup has 2 items, |
| one to record the user configured limits and one to record the current |
| |
| \emph on |
| referenced |
| \emph default |
| / |
| \emph on |
| exclusive |
| \emph default |
| counts. |
| Each parent/child-relationship between qgroups gets 2 qgroup_relation items, |
| one per direction. |
| The on-disk structure is still preliminary. |
| \end_layout |
| |
| \begin_layout LyX-Code |
| /* |
| \end_layout |
| |
| \begin_layout LyX-Code |
| * Records the overall state of the qgroups. |
| \end_layout |
| |
| \begin_layout LyX-Code |
| * There's only one instance of this key present, |
| \end_layout |
| |
| \begin_layout LyX-Code |
| * (0, BTRFS_QGROUP_STATUS_KEY, 0) |
| \end_layout |
| |
| \begin_layout LyX-Code |
| */ |
| \end_layout |
| |
| \begin_layout LyX-Code |
| #define BTRFS_QGROUP_STATUS_KEY 240 |
| \end_layout |
| |
| \begin_layout LyX-Code |
| /* |
| \end_layout |
| |
| \begin_layout LyX-Code |
| * Records the currently used space of the qgroup. |
| \end_layout |
| |
| \begin_layout LyX-Code |
| * One key per qgroup, (0, BTRFS_QGROUP_INFO_KEY, qgroupid). |
| \end_layout |
| |
| \begin_layout LyX-Code |
| */ |
| \end_layout |
| |
| \begin_layout LyX-Code |
| #define BTRFS_QGROUP_INFO_KEY 242 |
| \end_layout |
| |
| \begin_layout LyX-Code |
| /* |
| \end_layout |
| |
| \begin_layout LyX-Code |
| * Contains the user configured limits for the qgroup. |
| \end_layout |
| |
| \begin_layout LyX-Code |
| * One key per qgroup, (0, BTRFS_QGROUP_LIMIT_KEY, qgroupid). |
| \end_layout |
| |
| \begin_layout LyX-Code |
| */ |
| \end_layout |
| |
| \begin_layout LyX-Code |
| #define BTRFS_QGROUP_LIMIT_KEY 244 |
| \end_layout |
| |
| \begin_layout LyX-Code |
| /* |
| \end_layout |
| |
| \begin_layout LyX-Code |
| * Records the child-parent relationship of qgroups. |
| For |
| \end_layout |
| |
| \begin_layout LyX-Code |
| * each relation, 2 keys are present: |
| \end_layout |
| |
| \begin_layout LyX-Code |
| * (childid, BTRFS_QGROUP_RELATION_KEY, parentid) |
| \end_layout |
| |
| \begin_layout LyX-Code |
| * (parentid, BTRFS_QGROUP_RELATION_KEY, childid) |
| \end_layout |
| |
| \begin_layout LyX-Code |
| */ |
| \end_layout |
| |
| \begin_layout LyX-Code |
| #define BTRFS_QGROUP_RELATION_KEY 246 |
| \end_layout |
| |
| \begin_layout Standard |
| The keys are chosen in a way that first comes the STATUS_KEY, followed by |
| all INFO_KEYs, followed by all LIMIT_KEYs. |
| After that, for each qgroup present all relations follow. |
| Only the INFO_KEYs and the STATUS_KEY get updated regularly. |
| The idea is that those keys stay close to each other, to minimize writes. |
| The RELATION_KEY is chosen in a way that by a simple enumeration all children |
| and parents for a given qgroup can be found. |
| The qgroupid is composed of a 16 bit 'level' field, followed by a 48 bit |
| 'id' field. |
| A qgroupid is represented as level/id, e.g. |
| 2/100. |
| In the case of a subvolume, the level is 0, and the 'id' is just the internal |
| tree objectid (5 or >= 256). |
| On the command line, the user will be able to use the subvolume-path as |
| the identifier. |
| \end_layout |
| |
| \begin_layout LyX-Code |
| /* |
| \end_layout |
| |
| \begin_layout LyX-Code |
| * is subvolume quota turned on? |
| \end_layout |
| |
| \begin_layout LyX-Code |
| */ |
| \end_layout |
| |
| \begin_layout LyX-Code |
| #define BTRFS_QGROUP_STATUS_FLAG_ON (1ULL << 0) |
| \end_layout |
| |
| \begin_layout LyX-Code |
| /* |
| \end_layout |
| |
| \begin_layout LyX-Code |
| * SCANNING is set during the initialization phase |
| \end_layout |
| |
| \begin_layout LyX-Code |
| */ |
| \end_layout |
| |
| \begin_layout LyX-Code |
| #define BTRFS_QGROUP_STATUS_FLAG_SCANNING (1ULL << 1) |
| \end_layout |
| |
| \begin_layout LyX-Code |
| /* |
| \end_layout |
| |
| \begin_layout LyX-Code |
| * Some qgroup entries are known to be out of date, |
| \end_layout |
| |
| \begin_layout LyX-Code |
| * either because the configuration has changed in a way that |
| \end_layout |
| |
| \begin_layout LyX-Code |
| * makes a rescan necessary, or because the fs has been mounted |
| \end_layout |
| |
| \begin_layout LyX-Code |
| * with a non-qgroup-aware version. |
| |
| \end_layout |
| |
| \begin_layout LyX-Code |
| * Turning qouta off and on again makes it inconsistent, too. |
| \end_layout |
| |
| \begin_layout LyX-Code |
| */ |
| \end_layout |
| |
| \begin_layout LyX-Code |
| #define BTRFS_QGROUP_STATUS_FLAG_INCONSISTENT (1ULL << 2) |
| \end_layout |
| |
| \begin_layout LyX-Code |
| #define BTRFS_QGROUP_STATUS_VERSION 1 |
| \end_layout |
| |
| \begin_layout LyX-Code |
| struct btrfs_qgroup_status_item { |
| \end_layout |
| |
| \begin_deeper |
| \begin_layout LyX-Code |
| __le64 version; |
| \end_layout |
| |
| \begin_layout LyX-Code |
| /* |
| \end_layout |
| |
| \begin_layout LyX-Code |
| * the generation is updated during every commit. |
| As older |
| \end_layout |
| |
| \begin_layout LyX-Code |
| * versions of btrfs are not aware of qgroups, it will be |
| \end_layout |
| |
| \begin_layout LyX-Code |
| * possible to detect inconsistencies by checking the |
| \end_layout |
| |
| \begin_layout LyX-Code |
| * generation on mount time |
| \end_layout |
| |
| \begin_layout LyX-Code |
| */ |
| \end_layout |
| |
| \begin_layout LyX-Code |
| __le64 generation; |
| \end_layout |
| |
| \begin_layout LyX-Code |
| /* flag definitions see above */ |
| \end_layout |
| |
| \begin_layout LyX-Code |
| __le64 flags; |
| \end_layout |
| |
| \begin_layout LyX-Code |
| /* |
| \end_layout |
| |
| \begin_layout LyX-Code |
| * only used during scanning to record the progress |
| \end_layout |
| |
| \begin_layout LyX-Code |
| * of the scan. |
| It contains a logical address |
| \end_layout |
| |
| \begin_layout LyX-Code |
| */ |
| \end_layout |
| |
| \begin_layout LyX-Code |
| __le64 scan; |
| \end_layout |
| |
| \end_deeper |
| \begin_layout LyX-Code |
| } __attribute__ ((__packed__)); |
| \end_layout |
| |
| \begin_layout LyX-Code |
| |
| \end_layout |
| |
| \begin_layout Standard |
| Instead of hosting the scan cursor in the structure, one could also make |
| a separate key instead that is only present during scanning. |
| \end_layout |
| |
| \begin_layout LyX-Code |
| struct btrfs_qgroup_info_item { |
| \end_layout |
| |
| \begin_deeper |
| \begin_layout LyX-Code |
| /* |
| \end_layout |
| |
| \begin_layout LyX-Code |
| * only updated when any of the other values change |
| \end_layout |
| |
| \begin_layout LyX-Code |
| */ |
| \end_layout |
| |
| \begin_layout LyX-Code |
| __le64 generation; |
| \end_layout |
| |
| \begin_layout LyX-Code |
| __le64 rfer; |
| \end_layout |
| |
| \begin_layout LyX-Code |
| __le64 rfer_cmpr; |
| \end_layout |
| |
| \begin_layout LyX-Code |
| __le64 excl; |
| \end_layout |
| |
| \begin_layout LyX-Code |
| __le64 excl_cmpr; |
| \end_layout |
| |
| \end_deeper |
| \begin_layout LyX-Code |
| } __attribute__ ((__packed__)); |
| \end_layout |
| |
| \begin_layout Standard |
| For all uncompressed data the same value will be recorded for compressed |
| and uncompressed. |
| The *_cmpr values represent the amount of disk space used, the other values |
| the amount of space from a user perspective. |
| The uncompressed values are hard to get, so a first version might not support |
| them yet and just record the on-disk values instead. |
| \end_layout |
| |
| \begin_layout LyX-Code |
| /* flags definition for qgroup limits */ |
| \end_layout |
| |
| \begin_layout LyX-Code |
| #define BTRFS_QGROUP_LIMIT_MAX_RFER (1ULL << 0) |
| \end_layout |
| |
| \begin_layout LyX-Code |
| #define BTRFS_QGROUP_LIMIT_MAX_EXCL (1ULL << 1) |
| \end_layout |
| |
| \begin_layout LyX-Code |
| #define BTRFS_QGROUP_LIMIT_RSV_RFER (1ULL << 2) |
| \end_layout |
| |
| \begin_layout LyX-Code |
| #define BTRFS_QGROUP_LIMIT_RSV_EXCL (1ULL << 3) |
| \end_layout |
| |
| \begin_layout LyX-Code |
| #define BTRFS_QGROUP_LIMIT_RFER_CMPR (1ULL << 4) |
| \end_layout |
| |
| \begin_layout LyX-Code |
| #define BTRFS_QGROUP_LIMIT_EXCL_CMPR (1ULL << 5) |
| \end_layout |
| |
| \begin_layout LyX-Code |
| struct btrfs_qgroup_limit_item { |
| \end_layout |
| |
| \begin_deeper |
| \begin_layout LyX-Code |
| __le64 flags; |
| \end_layout |
| |
| \begin_layout LyX-Code |
| __le64 max_referenced; |
| \end_layout |
| |
| \begin_layout LyX-Code |
| __le64 max_exclusive; |
| \end_layout |
| |
| \begin_layout LyX-Code |
| __le64 rsv_referenced; |
| \end_layout |
| |
| \begin_layout LyX-Code |
| __le64 rsv_exclusive; |
| \end_layout |
| |
| \end_deeper |
| \begin_layout LyX-Code |
| } __attribute__ ((__packed__)); |
| \end_layout |
| |
| \begin_layout Standard |
| The flags record which of the limits are to be enforced. |
| The last two flags indicate whether the compressed or the uncompressed |
| value is to limit. |
| This structure also contains reservations, though they might be hard to |
| implement, as btrfs has no clear understanding of how much free space is |
| left. |
| A straightforward implementation might be very inaccurate and the first |
| version will probably not implement it. |
| Those values are nevertheless included here as a means for future expansion. |
| |
| \end_layout |
| |
| \begin_layout Subsection |
| Estimation |
| \end_layout |
| |
| \begin_layout Standard |
| In btrfs, each file operation is encapsulated into a transaction. |
| All necessary space for the transaction has to be reserved before any modificat |
| ion is done to the structures, as there's no way to back out in the middle. |
| That's what block reserves are used for. |
| \end_layout |
| |
| \begin_layout Standard |
| The same holds for quota: it's not possible to deny an operation in the |
| middle of it. |
| The only point where an EDQUOT (Quota exceeded) error can be generated |
| is before the operation start. |
| The easiest way would be to only deny it if one of the affected qgroups |
| is already over quota, but that would allow large operations to exceed |
| the quota by far. |
| This implementation tries to estimate the needed space for the operation |
| and reserves it at operation start. |
| If the reservation fails, the operation is denied. |
| \end_layout |
| |
| \begin_layout Standard |
| The reservation is recorded in each qgroup. |
| Also it is saved in the trans_handle, so it can be freed on end_transaction. |
| The estimation is not a worst-case estimation like the block reservation. |
| It shouldn't deny requests too early. |
| On the other hand, it might be possible that a qgroup goes slightly over |
| quota. |
| \end_layout |
| |
| \end_body |
| \end_document |