| #LyX 2.0 created this file. For more info see http://www.lyx.org/ |
| \lyxformat 413 |
| \begin_document |
| \begin_header |
| \textclass article |
| \use_default_options true |
| \maintain_unincluded_children false |
| \language english |
| \language_package default |
| \inputencoding auto |
| \fontencoding global |
| \font_roman default |
| \font_sans default |
| \font_typewriter default |
| \font_default_family default |
| \use_non_tex_fonts false |
| \font_sc false |
| \font_osf false |
| \font_sf_scale 100 |
| \font_tt_scale 100 |
| |
| \graphics default |
| \default_output_format default |
| \output_sync 0 |
| \bibtex_command default |
| \index_command default |
| \paperfontsize default |
| \use_hyperref false |
| \papersize default |
| \use_geometry false |
| \use_amsmath 1 |
| \use_esint 1 |
| \use_mhchem 1 |
| \use_mathdots 1 |
| \cite_engine basic |
| \use_bibtopic false |
| \use_indices false |
| \paperorientation portrait |
| \suppress_date false |
| \use_refstyle 0 |
| \index Index |
| \shortcut idx |
| \color #008000 |
| \end_index |
| \secnumdepth 3 |
| \tocdepth 3 |
| \paragraph_separation indent |
| \paragraph_indentation default |
| \quotes_language english |
| \papercolumns 1 |
| \papersides 1 |
| \paperpagestyle default |
| \tracking_changes false |
| \output_changes false |
| \html_math_output 0 |
| \html_css_as_file 0 |
| \html_be_strict false |
| \end_header |
| |
| \begin_body |
| |
| \begin_layout Title |
| Btrfs Subvolume Quota Groups |
| \end_layout |
| |
| \begin_layout Author |
| Arne Jansen, Strato AG |
| \end_layout |
| |
| \begin_layout Date |
| October 2011 |
| \end_layout |
| |
| \begin_layout Section |
| Subvolume Quota Concepts |
| \end_layout |
| |
| \begin_layout Standard |
| The concept of quota has a long-standing tradition in the Unix world. |
| Ever since computers allow multiple users to work simultaneously in one |
| filesystem, there is the need to prevent one user from using up the entire |
| space. |
| Every user should get his fair share of the available resources. |
| \end_layout |
| |
| \begin_layout Standard |
| In case of files, the solution is quite straightforward. |
| Each file has an |
| \begin_inset Quotes eld |
| \end_inset |
| |
| owner |
| \begin_inset Quotes erd |
| \end_inset |
| |
| recorded along with it, and it has a size. |
| Traditional quota just restricts the total size of all files that are owned |
| by a user. |
| The concept is quite flexible: if a user hits his quota limit, the administrato |
| r can raise it on the fly. |
| \end_layout |
| |
| \begin_layout Standard |
| On the other hand, the traditional approach has only a poor solution to |
| restrict directories. |
| At installation time, the harddisk can be partitioned so that every directory |
| (e. |
| \begin_inset space \thinspace{} |
| \end_inset |
| |
| g. |
| \begin_inset space \space{} |
| \end_inset |
| |
| |
| \family typewriter |
| /usr |
| \family default |
| , |
| \family typewriter |
| /var |
| \family default |
| , ...) that needs a limit gets its own partition. |
| The obvious problem is, that those limits cannot be changed without a reinstall |
| ation. |
| The btrfs subvolume feature builds a bridge. |
| Subvolumes correspond in many ways to partitions, as every subvolume looks |
| like its own filesystem. |
| With subvolume quota, it is now possible to restrict each subvolume like |
| a partition, but keep the flexibility of quota. |
| The space for each subvolume can be expanded or restricted on the fly. |
| \end_layout |
| |
| \begin_layout Standard |
| As subvolumes are the basis for snapshots, interesting questions arise as |
| to how to account used space in the presence of snapshots. |
| If you have a file shared between a subvolume and a snapshot, whom to account |
| the file to? The creator? Both? What if the file gets modified in the snapshot, |
| should only these changes be accounted to it? But wait, both the snapshot |
| and the subvolume belong to the same user home. |
| I just want to limit the total space used by both! But somebody else might |
| not want to charge the snapshots to the users. |
| \end_layout |
| |
| \begin_layout Standard |
| Btrfs subvolume quota solves these problems by introducing groups of subvolumes |
| and let the user put limits on them. |
| It is even possible to have groups of groups. |
| In the following, we refer to them as |
| \begin_inset Quotes eld |
| \end_inset |
| |
| qgroups |
| \begin_inset Quotes erd |
| \end_inset |
| |
| . |
| Each qgroup primarily tracks two numbers, the amount of total referenced |
| space and the amount of exclusively referenced space. |
| \end_layout |
| |
| \begin_layout Standard |
| |
| \emph on |
| Referenced |
| \emph default |
| space is the amount of data that can be reached from any of the subvolumes |
| contained in the qgroup, while |
| \emph on |
| exclusive |
| \emph default |
| is the amount of data where all references to this data can be reached |
| from within this qgroup. |
| \end_layout |
| |
| \begin_layout Section |
| Subvolume Quota Groups |
| \end_layout |
| |
| \begin_layout Standard |
| The basic notion of the Subvolume Quota feature is the qouta group, short |
| qgroup. |
| Qgroups are notated as <level>/<id>, e. |
| \begin_inset space \thinspace{} |
| \end_inset |
| |
| g. |
| \begin_inset space \space{} |
| \end_inset |
| |
| the qgroup 3/2 is a qgroup of level |
| \begin_inset space ~ |
| \end_inset |
| |
| 3. |
| For level |
| \begin_inset space ~ |
| \end_inset |
| |
| 0, the leading |
| \begin_inset Quotes eld |
| \end_inset |
| |
| 0/ |
| \begin_inset Quotes erd |
| \end_inset |
| |
| can be omitted. |
| Qgroups of level |
| \begin_inset space ~ |
| \end_inset |
| |
| 0 get created automatically when a subvolume/snapshot gets created. |
| The ID of the qgroup corresponds to the ID of the subvolume, so 0/5 is |
| the qgroup for the root subvolume. |
| For the |
| \begin_inset Quotes eld |
| \end_inset |
| |
| btrfs qgroup |
| \begin_inset Quotes erd |
| \end_inset |
| |
| command, the path to the subvolume can also be used instead of 0/<ID>. |
| For all higher levels, the ID can be choosen freely. |
| \end_layout |
| |
| \begin_layout Standard |
| Each qgroup can contain a set of lower level qgroups, thus creating a hierarchy |
| of qgroups. |
| Figure |
| \begin_inset space ~ |
| \end_inset |
| |
| |
| \begin_inset CommandInset ref |
| LatexCommand ref |
| reference "fig:Sample-qgroup-hierarchy" |
| |
| \end_inset |
| |
| |
| \begin_inset Float figure |
| wide false |
| sideways false |
| status open |
| |
| \begin_layout Plain Layout |
| \begin_inset Graphics |
| filename qgroups1.svg |
| scale 20 |
| |
| \end_inset |
| |
| |
| \end_layout |
| |
| \begin_layout Plain Layout |
| \begin_inset Caption |
| |
| \begin_layout Plain Layout |
| \begin_inset CommandInset label |
| LatexCommand label |
| name "fig:Sample-qgroup-hierarchy" |
| |
| \end_inset |
| |
| Sample qgroup hierarchy |
| \end_layout |
| |
| \end_inset |
| |
| |
| \end_layout |
| |
| \end_inset |
| |
| shows an example qgroup tree. |
| At the bottom, some extents are depicted showing which qgroups reference |
| which extents. |
| It is important to understand the notion of |
| \emph on |
| referenced |
| \emph default |
| versus |
| \emph on |
| exclusive |
| \emph default |
| . |
| In the example, qgroup 0/2 references extents 2 and 3, while 1/2 references |
| extents 2--4. |
| 2/1 references all extents. |
| \end_layout |
| |
| \begin_layout Standard |
| On the other hand, extent |
| \begin_inset space ~ |
| \end_inset |
| |
| 1 is exclusive to 0/1, extent |
| \begin_inset space ~ |
| \end_inset |
| |
| 2 is exclusive to 0/2, while extent |
| \begin_inset space ~ |
| \end_inset |
| |
| 3 is neither exclusive to 0/2 nor to 0/3. |
| But because both references can be reached from 1/2, extent |
| \begin_inset space ~ |
| \end_inset |
| |
| 3 is exclusive to 1/2. |
| All extents are exclusive to 2/1. |
| So |
| \emph on |
| exclusive |
| \emph default |
| does not mean there is no other way to reach the extent, but it does mean |
| that if you delete all subvolumes contained in a qgroup, the extent will |
| get deleted. |
| |
| \emph on |
| Exclusive |
| \emph default |
| of a qgroup conveys the useful information how much space will be freed |
| in case all subvolumes of the qgroup get deleted. |
| \end_layout |
| |
| \begin_layout Standard |
| All data extents are accounted this way. |
| Metadata that belongs to a specific subvolume (i. |
| \begin_inset space \space{} |
| \end_inset |
| |
| e. |
| \begin_inset space \space{} |
| \end_inset |
| |
| its filesystem tree) is also accounted. |
| Checksums and extent allocation information are not accounted. |
| \end_layout |
| |
| \begin_layout Standard |
| In turn, the |
| \emph on |
| referenced |
| \emph default |
| count of a qgroup can be limited. |
| All writes beyond this limit will lead to a |
| \begin_inset Quotes eld |
| \end_inset |
| |
| Quota Exceeded |
| \begin_inset Quotes erd |
| \end_inset |
| |
| error. |
| \end_layout |
| |
| \begin_layout Section |
| Inheritance |
| \end_layout |
| |
| \begin_layout Standard |
| Things get a bit more complicated when new subvolumes or snapshots are created. |
| The case of (empty) subvolumes is still quite easy. |
| If a subvolume should be part of a qgroup, it has to be added to the qgroup |
| at creation time. |
| To add it at a later time, it would be necessary to at least rescan the |
| full subvolume for a proper accounting. |
| \end_layout |
| |
| \begin_layout Standard |
| Creation of a snapshot is the hard case. |
| Obviously, the snapshot will reference the exact amount of space as its |
| source, and both source and destination now have an |
| \emph on |
| exclusive |
| \emph default |
| count of 0 (4 |
| \begin_inset space \thinspace{} |
| \end_inset |
| |
| kB to be precise, as the roots of the trees are not shared). |
| But what about qgroups of higher levels? If the qgroup contains both the |
| source and the destination, nothing changes. |
| If the qgroup contains only the source, it might lose some |
| \emph on |
| exclusive |
| \emph default |
| . |
| But how much? The tempting answer is, |
| \begin_inset Quotes eld |
| \end_inset |
| |
| subtract all |
| \emph on |
| exclusive |
| \emph default |
| of the source from the qgroup |
| \begin_inset Quotes erd |
| \end_inset |
| |
| , but that is wrong, or at least not enough. |
| There could have been an extent that is referenced from the source and |
| another subvolume from that qgroup. |
| This extent would have been exclusive to the qgroup, but not to the source |
| subvolume. |
| With the creation of the snapshot, the qgroup would also lose this extent |
| from its exclusive set. |
| \end_layout |
| |
| \begin_layout Standard |
| So how can this problem be solved? In the instant the snapshot gets created, |
| we already have to know the correct |
| \emph on |
| exclusive |
| \emph default |
| count. |
| We need to have a second qgroup that contains all the subvolumes as the |
| first qgroup, except the subvolume we want to snapshot. |
| The moment we create the snapshot, the |
| \emph on |
| exclusive |
| \emph default |
| count from the second qgroup needs to be copied to the first qgroup, as |
| it represents the correct value. |
| The second qgroup is called a tracking qgroup. |
| It is only there in case a snapshot is needed. |
| \end_layout |
| |
| \begin_layout Section |
| Use Cases |
| \end_layout |
| |
| \begin_layout Subsection |
| Single-user machine |
| \end_layout |
| |
| \begin_layout Subsubsection |
| Replacement for partitions |
| \end_layout |
| |
| \begin_layout Standard |
| The simplest use case is to use qgroups as simple replacement for partitions. |
| Btrfs takes the disk as a whole, and |
| \family typewriter |
| / |
| \family default |
| , |
| \family typewriter |
| /usr |
| \family default |
| , |
| \family typewriter |
| /var |
| \family default |
| etc. |
| \begin_inset space \space{} |
| \end_inset |
| |
| are created as subvolumes. |
| As each subvolume gets it own qgroup automatically, they can simply be |
| restricted. |
| No hierarchy is needed for that. |
| \end_layout |
| |
| \begin_layout Subsubsection |
| Track usage of snapshots |
| \end_layout |
| |
| \begin_layout Standard |
| When a snapshot is taken, a qgroup for it will automatically be created |
| with the correct values. |
| |
| \emph on |
| Referenced |
| \emph default |
| will show how much is in it, possibly shared with other subvolumes. |
| |
| \emph on |
| Exclusive |
| \emph default |
| will be the amount of space that gets freed when the subvolume is deleted. |
| \end_layout |
| |
| \begin_layout Subsection |
| Multi-user machine / Hosting |
| \end_layout |
| |
| \begin_layout Subsubsection |
| \begin_inset CommandInset label |
| LatexCommand label |
| name "sub:restricting-homes" |
| |
| \end_inset |
| |
| Restricting homes |
| \end_layout |
| |
| \begin_layout Standard |
| When you have several users on a machine, with home directories probably |
| under |
| \family typewriter |
| /home |
| \family default |
| , you might want to restrict |
| \family typewriter |
| /home |
| \family default |
| as a whole, while restricting every user to an indiviual limit as well. |
| This is easily accomplished by creating a qgroup for |
| \family typewriter |
| /home |
| \family default |
| , e. |
| \begin_inset space \thinspace{} |
| \end_inset |
| |
| g. |
| \begin_inset space \space{} |
| \end_inset |
| |
| 1/1, and assigning all user subvolumes to it. |
| Restricting this qgroup will limit |
| \family typewriter |
| /home |
| \family default |
| , while every user subvolume can get its own (lower) limit. |
| \end_layout |
| |
| \begin_layout Subsubsection |
| \begin_inset CommandInset label |
| LatexCommand label |
| name "sub:accounting-snapshots-to" |
| |
| \end_inset |
| |
| Accounting snapshots to the user |
| \end_layout |
| |
| \begin_layout Standard |
| Let's say the user is allowed to create snapshots via some mechanism. |
| It would only be fair to account space used by the snapshots to the user. |
| This does not mean the user doubles his usage as soon as he takes a snapshot. |
| Of course, files that are present in his home and the snapshot should only |
| be accounted once. |
| This can be accomplished by creating a qgroup for each user, say 1/<uid>. |
| The user home and all snapshots are assigned to this qgroup. |
| Limiting it will extend the limit to all snapshots, counting files only |
| once. |
| To limit |
| \family typewriter |
| /home |
| \family default |
| as a whole, a higher level group 2/1 replacing 1/1 from the previous example |
| is needed, with all user qgroups assigned to it. |
| \end_layout |
| |
| \begin_layout Subsubsection |
| Do not account snapshots |
| \end_layout |
| |
| \begin_layout Standard |
| On the other hand, when the snapshots get created automatically, the user |
| has no chance to control them, so the space used by them should not be |
| accounted to him. |
| This is already the case when creating snapshots in the example from section |
| \begin_inset space ~ |
| \end_inset |
| |
| |
| \begin_inset CommandInset ref |
| LatexCommand ref |
| reference "sub:restricting-homes" |
| |
| \end_inset |
| |
| . |
| \end_layout |
| |
| \begin_layout Subsubsection |
| Snapshots for backup purposes |
| \end_layout |
| |
| \begin_layout Standard |
| This scenario is a mixture of the previous two. |
| The user can create snapshots, but some snapshots for backup purposes are |
| being created by the system. |
| The user's snapshots should be accounted to the user, not the system. |
| The solution is similar to the one from section |
| \begin_inset space ~ |
| \end_inset |
| |
| |
| \begin_inset CommandInset ref |
| LatexCommand ref |
| reference "sub:accounting-snapshots-to" |
| |
| \end_inset |
| |
| , but do not assign system snapshots to user's qgroup. |
| \end_layout |
| |
| \begin_layout Section |
| Implementation |
| \end_layout |
| |
| \begin_layout Subsection |
| Update algorithm |
| \end_layout |
| |
| \begin_layout Standard |
| The update algorithm is the core of the quota implementation. |
| Whenever a reference is added or removed, the update algorithm is called. |
| \end_layout |
| |
| \begin_layout Standard |
| The algorithm is called with the address of the extent for which to add/remove |
| the reference, the root of the reference, the amount of space to add/remove, |
| and of course the operation to perform. |
| \end_layout |
| |
| \begin_layout Standard |
| A call could look like this |
| \end_layout |
| |
| \begin_layout LyX-Code |
| qgroup_record_ref(ref_root, start, num_bytes, operation); |
| \end_layout |
| |
| \begin_layout Standard |
| In fact, these parameters are all contained in the delayed ref structure, |
| so just the delayed ref node is passed instead. |
| This function gets called from the central point where backrefs are added |
| to the filesystem, |
| \family typewriter |
| btrfs_\SpecialChar \- |
| add_\SpecialChar \- |
| delayed_*_ref |
| \family default |
| . |
| \end_layout |
| |
| \begin_layout Standard |
| The algorithm works in multiple steps: |
| \end_layout |
| |
| \begin_layout Enumerate |
| Find all referencing roots |
| \end_layout |
| |
| \begin_layout Enumerate |
| Calculate refcnt for all qgroups |
| \end_layout |
| |
| \begin_layout Enumerate |
| Tag qgroups |
| \end_layout |
| |
| \begin_layout Enumerate |
| Update |
| \emph on |
| exclusive |
| \end_layout |
| |
| \begin_layout Subsubsection |
| Find all referencing roots |
| \end_layout |
| |
| \begin_layout Standard |
| The first step is to find all roots that are currently referencing the extent. |
| Though btrfs is fully back-referenced, this step is not as easy as it may |
| seem, because of the lazy refcounting scheme. |
| |
| \begin_inset Float figure |
| wide false |
| sideways false |
| status open |
| |
| \begin_layout Plain Layout |
| \begin_inset Graphics |
| filename 5roots.svg |
| scale 50 |
| |
| \end_inset |
| |
| |
| \begin_inset Caption |
| |
| \begin_layout Plain Layout |
| \begin_inset CommandInset label |
| LatexCommand label |
| name "fig:Extent-with-lazy" |
| |
| \end_inset |
| |
| Extent with lazy references |
| \end_layout |
| |
| \end_inset |
| |
| |
| \end_layout |
| |
| \end_inset |
| |
| The back references that are recorded for the extent may not tell the full |
| truth. |
| In figure |
| \begin_inset space ~ |
| \end_inset |
| |
| |
| \begin_inset CommandInset ref |
| LatexCommand ref |
| reference "fig:Extent-with-lazy" |
| |
| \end_inset |
| |
| , a tree is depicted where the actual extent only has two back references |
| recorded, whereas there are five roots referencing it. |
| \end_layout |
| |
| \begin_layout Standard |
| The solution is to walk up the tree and follow all back references until |
| all roots are found. |
| This looks like a classic problem for a recursive tree walk, but recursion |
| here is not possible for two reasons: |
| \end_layout |
| |
| \begin_layout Enumerate |
| The code runs in kernel space with very limited stack space. |
| With a recursion, the stack may overflow. |
| \end_layout |
| |
| \begin_layout Enumerate |
| To follow a back reference, the referenced extent has to be searched. |
| This is due to the nature of the indirect back references used. |
| These back references point to a key in the tree, not to the address of |
| an extent. |
| \end_layout |
| |
| \begin_layout Standard |
| The code solves this by keeping two lists, one for all roots found and one |
| for all backrefs to follow. |
| Initially, the list of roots is empty, while the list of backrefs is filled |
| with only one item, the reference to the extent for which all backrefs |
| are to be found. |
| \end_layout |
| |
| \begin_layout Standard |
| The following pseudo-code describes how all roots are found: |
| \end_layout |
| |
| \begin_layout LyX-Code |
| foreach ref (0 ... |
| #refs in ulist) |
| \end_layout |
| |
| \begin_deeper |
| \begin_layout LyX-Code |
| find extent for ref |
| \end_layout |
| |
| \begin_layout LyX-Code |
| add all refs for extent to ulist |
| \end_layout |
| |
| \begin_layout LyX-Code |
| if (extent is root) |
| \end_layout |
| |
| \begin_deeper |
| \begin_layout LyX-Code |
| add root to ulist of roots |
| \end_layout |
| |
| \begin_layout LyX-Code |
| |
| \end_layout |
| |
| \end_deeper |
| \end_deeper |
| \begin_layout Standard |
| The lists here are called |
| \begin_inset Quotes eld |
| \end_inset |
| |
| ulists |
| \begin_inset Quotes erd |
| \end_inset |
| |
| , because they only accept new items if they are not already in the list, |
| i. |
| \begin_inset space \thinspace{} |
| \end_inset |
| |
| e. |
| \begin_inset space \space{} |
| \end_inset |
| |
| if they are unique. |
| \end_layout |
| |
| \begin_layout Standard |
| The step to add all backrefs for an extent involves finding all recorded |
| inline backrefs, all in-tree backrefs and all delayed refs for the extent |
| up to the moment the algorithms starts to run. |
| Because this code might run some time, new delayed refs for any extent |
| in the tree might be added in the meantime. |
| To avoid a race condition here, each delayed ref gets a sequence number. |
| Only delayed refs with seq |
| \begin_inset space ~ |
| \end_inset |
| |
| < own seq are considered. |
| Also, no delayed ref with a higher seq than own seq must be run while the |
| roots are searched for. |
| \end_layout |
| |
| \begin_layout Standard |
| The code will never include the reference to add/delete. |
| \end_layout |
| |
| \begin_layout Subsubsection |
| Calculate refcnt for all qgroups |
| \end_layout |
| |
| \begin_layout Standard |
| After the list of referencing roots is known, the next three steps all operate |
| on the qgroup hierarchy. |
| A sample hierarchy is depicted in figure |
| \begin_inset space ~ |
| \end_inset |
| |
| |
| \begin_inset CommandInset ref |
| LatexCommand ref |
| reference "fig:Sample-qgroup-hierarchy" |
| |
| \end_inset |
| |
| . |
| \end_layout |
| |
| \begin_layout Standard |
| The first operation on the tree is to calculate the number of references |
| that can be reached from every given qgroup. |
| This is done by walking the tree upwards from every root found in the previous |
| step and incrementing a count on each qgroup visited, where each root can |
| only increment the count by one for every qgroup it can reach, even if |
| it can reach it by several paths. |
| The calculated count is called the refcnt. |
| \end_layout |
| |
| \begin_layout Standard |
| As in the previous step, the tree is walked iteratively with the help of |
| ulists to avoid recursion. |
| Figure |
| \begin_inset space ~ |
| \end_inset |
| |
| |
| \begin_inset CommandInset ref |
| LatexCommand ref |
| reference "fig:qgroup-tree-after-1" |
| |
| \end_inset |
| |
| depicts the state after this step is done for extent |
| \begin_inset space ~ |
| \end_inset |
| |
| 3, where the ref from 0/2 should get deleted. |
| The Figure omits the fs trees and their roots, as qgroups of level |
| \begin_inset space ~ |
| \end_inset |
| |
| 0 directly correspond to a root. |
| \end_layout |
| |
| \begin_layout Standard |
| As the refcnt is part of the qgroup struct, the algorithm would require |
| that all refcnts in all qgroups be set to zero before it can run. |
| To avoid this, a global sequence number is used to determine the refcnt. |
| Only one thread at a time can currently do refcounting on the tree (this |
| is easily changable, should it impose a limit). |
| This thread grabs the next sequence number and walks up the tree. |
| If the refcnt of the visited qgroup is smaller than the seq, it is not |
| yet set and known to be |
| \begin_inset space ~ |
| \end_inset |
| |
| 0. |
| Otherwise, it is incremented. |
| After the algorithm has run, the global sequence number is incremented |
| by the max refcnt found. |
| \end_layout |
| |
| \begin_layout Standard |
| \begin_inset Float figure |
| wide false |
| sideways false |
| status open |
| |
| \begin_layout Plain Layout |
| \begin_inset Graphics |
| filename qgroups2.svg |
| scale 20 |
| |
| \end_inset |
| |
| |
| \end_layout |
| |
| \begin_layout Plain Layout |
| \begin_inset Caption |
| |
| \begin_layout Plain Layout |
| \begin_inset CommandInset label |
| LatexCommand label |
| name "fig:qgroup-tree-after" |
| |
| \end_inset |
| |
| qgroup tree after refcnt augmentation for extent |
| \begin_inset space ~ |
| \end_inset |
| |
| 3 |
| \end_layout |
| |
| \end_inset |
| |
| |
| \end_layout |
| |
| \end_inset |
| |
| |
| \end_layout |
| |
| \begin_layout Subsubsection |
| Tag qgroups |
| \end_layout |
| |
| \begin_layout Standard |
| The next step is to walk up the tree again, but this time starting with |
| ref_root, the root to add/remove. |
| Remember that the previous step does not include the ref_root. |
| Every qgroup that is being visited on the way up will be tagged in preparation |
| for the next step. |
| Additionally, under certain conditions, a first adjustment is made to the |
| values of the visited qgroups. |
| \end_layout |
| |
| \begin_layout Itemize |
| If the refcnt is zero and the operation is to add a reference, this means |
| this qgroup is not yet referencing this extent, but after the operation, |
| it will, so the |
| \emph on |
| referenced |
| \emph default |
| value of the qgroup is increased by num_bytes. |
| \end_layout |
| |
| \begin_layout Itemize |
| If the refcnt is zero and the operation is to remove a reference, this means |
| this qgroup is currently referencing the extent, but through the operation, |
| it will lose its last reference, so the |
| \emph on |
| referenced |
| \emph default |
| value is decreased by num_bytes. |
| \end_layout |
| |
| \begin_layout Itemize |
| If the refcnt is zero and the number of roots found in the first step is |
| also zero, this means: |
| \end_layout |
| |
| \begin_layout Itemize |
| In case of addition: the added reference will be the only reference, so |
| |
| \emph on |
| exclusive |
| \emph default |
| of the qgroup is increased by num_bytes. |
| \end_layout |
| |
| \begin_layout Itemize |
| In case of removal: the reference is the last to remove, which means it |
| is currently exclusive to ref_root, so |
| \emph on |
| exclusive |
| \emph default |
| of the qgroup is decreased by num_bytes. |
| \end_layout |
| |
| \begin_layout Standard |
| Figure |
| \begin_inset space ~ |
| \end_inset |
| |
| |
| \begin_inset CommandInset ref |
| LatexCommand ref |
| reference "fig:qgroup-tree-after-1" |
| |
| \end_inset |
| |
| depicts the situation given the reference for 0/2 to extent |
| \begin_inset space ~ |
| \end_inset |
| |
| 3 is to be deleted. |
| |
| \emph on |
| Referenced |
| \emph default |
| of 0/2 and 1/1 will get decreased. |
| \end_layout |
| |
| \begin_layout Standard |
| \begin_inset Float figure |
| wide false |
| sideways false |
| status open |
| |
| \begin_layout Plain Layout |
| \begin_inset Graphics |
| filename qgroups3.svg |
| scale 20 |
| |
| \end_inset |
| |
| |
| \end_layout |
| |
| \begin_layout Plain Layout |
| \begin_inset Caption |
| |
| \begin_layout Plain Layout |
| \begin_inset CommandInset label |
| LatexCommand label |
| name "fig:qgroup-tree-after-1" |
| |
| \end_inset |
| |
| qgroup tree after tagging |
| \end_layout |
| |
| \end_inset |
| |
| |
| \end_layout |
| |
| \begin_layout Plain Layout |
| |
| \end_layout |
| |
| \end_inset |
| |
| |
| \end_layout |
| |
| \begin_layout Subsubsection |
| Update |
| \emph on |
| exclusive |
| \end_layout |
| |
| \begin_layout Standard |
| The last step adjusts the |
| \emph on |
| exclusive |
| \emph default |
| counts of all untagged qgroups. |
| The |
| \emph on |
| exclusive |
| \emph default |
| counts of the tagged qgroups already got adjusted in the previous step. |
| All roots from step |
| \begin_inset space ~ |
| \end_inset |
| |
| 1 are walked again, tagged qgroups are skipped. |
| If the refcnt equals the number of roots found in step one, |
| \emph on |
| exclusive |
| \emph default |
| gets increased if the ref is to be removed and decreased otherwise. |
| Figure |
| \begin_inset space ~ |
| \end_inset |
| |
| |
| \begin_inset CommandInset ref |
| LatexCommand ref |
| reference "fig:update-of-exclusive" |
| |
| \end_inset |
| |
| shows the outcome of this step. |
| Extent |
| \begin_inset space ~ |
| \end_inset |
| |
| 3 is now |
| \emph on |
| exclusive |
| \emph default |
| to 0/3. |
| All other |
| \emph on |
| exclusives |
| \emph default |
| are untouched. |
| Extent |
| \begin_inset space ~ |
| \end_inset |
| |
| 3 was exclusive to 1/2 and 2/1 and still is, while it was not exclusive |
| to 0/2 and 1/1 and still is not. |
| \end_layout |
| |
| \begin_layout Standard |
| \begin_inset Float figure |
| wide false |
| sideways false |
| status open |
| |
| \begin_layout Plain Layout |
| \begin_inset Graphics |
| filename qgroups4.svg |
| scale 20 |
| |
| \end_inset |
| |
| |
| \end_layout |
| |
| \begin_layout Plain Layout |
| \begin_inset Caption |
| |
| \begin_layout Plain Layout |
| \begin_inset CommandInset label |
| LatexCommand label |
| name "fig:update-of-exclusive" |
| |
| \end_inset |
| |
| Update of |
| \emph on |
| exclusive |
| \emph default |
| on qgroup tree |
| \end_layout |
| |
| \end_inset |
| |
| |
| \end_layout |
| |
| \end_inset |
| |
| |
| \end_layout |
| |
| \begin_layout Subsection |
| Tracking Groups |
| \end_layout |
| |
| \begin_layout Standard |
| As seen in the introductory chapter, when taking a snapshot, the values |
| of several qgroups might need to be adjusted. |
| This is easiest to see when looking at some examples. |
| Figure |
| \begin_inset space ~ |
| \end_inset |
| |
| |
| \begin_inset CommandInset ref |
| LatexCommand ref |
| reference "fig:Tracking-snapshots" |
| |
| \end_inset |
| |
| shows a simple example where tracking groups are needed. |
| |
| \end_layout |
| |
| \begin_layout Standard |
| \begin_inset Float figure |
| wide false |
| sideways false |
| status open |
| |
| \begin_layout Plain Layout |
| \begin_inset Graphics |
| filename tracking1.svg |
| scale 20 |
| |
| \end_inset |
| |
| |
| \end_layout |
| |
| \begin_layout Plain Layout |
| \begin_inset Caption |
| |
| \begin_layout Plain Layout |
| \begin_inset CommandInset label |
| LatexCommand label |
| name "fig:Tracking-snapshots" |
| |
| \end_inset |
| |
| Tracking snapshots |
| \end_layout |
| |
| \end_inset |
| |
| |
| \end_layout |
| |
| \end_inset |
| |
| |
| \end_layout |
| |
| \begin_layout Standard |
| The exercise is to track |
| \emph on |
| referenced |
| \emph default |
| and |
| \emph on |
| exclusive |
| \emph default |
| for all snapshots of a subvolume. |
| The gray qgroups 0/2--0/4 are all snapshot of 0/1. |
| Before 0/4 is created, 1/2 contains 0/2 and 0/3. |
| The moment 0/4 gets created, it is added to 1/2. |
| The |
| \emph on |
| exclusive |
| \emph default |
| count of 1/2 will not change, as all extents that become reachable from |
| 1/2 are also reachable from 1/1. |
| More problematic is the |
| \emph on |
| referenced |
| \emph default |
| count, as not all extents from 0/4 might be new to 1/2. |
| The solution is to add another qgroup, 1/3, that tracks 0/1 and all subvolumes |
| of it (figure |
| \begin_inset space ~ |
| \end_inset |
| |
| |
| \begin_inset CommandInset ref |
| LatexCommand ref |
| reference "fig:A-tracking-qgroup" |
| |
| \end_inset |
| |
| ). |
| \end_layout |
| |
| \begin_layout Standard |
| \begin_inset Float figure |
| wide false |
| sideways false |
| status open |
| |
| \begin_layout Plain Layout |
| \begin_inset Graphics |
| filename tracking2.svg |
| scale 20 |
| |
| \end_inset |
| |
| |
| \end_layout |
| |
| \begin_layout Plain Layout |
| \begin_inset Caption |
| |
| \begin_layout Plain Layout |
| \begin_inset CommandInset label |
| LatexCommand label |
| name "fig:A-tracking-qgroup" |
| |
| \end_inset |
| |
| A tracking qgroup is needed for 1/2 |
| \end_layout |
| |
| \end_inset |
| |
| |
| \end_layout |
| |
| \begin_layout Plain Layout |
| |
| \end_layout |
| |
| \end_inset |
| |
| |
| \end_layout |
| |
| \begin_layout Standard |
| The moment the snapshot gets created, 1/3 holds the correct |
| \emph on |
| referenced |
| \emph default |
| count for all snapshots. |
| To get 1/2 back to the correct values, |
| \emph on |
| referenced |
| \emph default |
| from 1/3 has to be copied to 1/2, while |
| \emph on |
| exclusive |
| \emph default |
| of 1/2 stays untouched. |
| \end_layout |
| |
| \begin_layout Standard |
| In the next step, we want to take a snapshot of 0/2. |
| The resulting snapshot should not be part of 1/2. |
| This poses another problem: while |
| \emph on |
| referenced |
| \emph default |
| does not change, |
| \emph on |
| exclusive |
| \emph default |
| needs to be corrected. |
| For this, we need another tracking group, 1/4 (figure |
| \begin_inset space ~ |
| \end_inset |
| |
| |
| \begin_inset CommandInset ref |
| LatexCommand ref |
| reference "fig:A-snapshot-of" |
| |
| \end_inset |
| |
| ). |
| \end_layout |
| |
| \begin_layout Standard |
| \begin_inset Float figure |
| wide false |
| sideways false |
| status open |
| |
| \begin_layout Plain Layout |
| \begin_inset Graphics |
| filename tracking3.svg |
| scale 20 |
| |
| \end_inset |
| |
| |
| \end_layout |
| |
| \begin_layout Plain Layout |
| \begin_inset Caption |
| |
| \begin_layout Plain Layout |
| \begin_inset CommandInset label |
| LatexCommand label |
| name "fig:A-snapshot-of" |
| |
| \end_inset |
| |
| A snapshot of 0/2 |
| \end_layout |
| |
| \end_inset |
| |
| |
| \end_layout |
| |
| \begin_layout Plain Layout |
| |
| \end_layout |
| |
| \end_inset |
| |
| |
| \end_layout |
| |
| \begin_layout Standard |
| When 0/5 is created, |
| \emph on |
| exclusive |
| \emph default |
| from 1/4 needs to be copied to 1/2. |
| Snapshotting 0/2 also invalidates the |
| \emph on |
| exclusive |
| \emph default |
| of 1/3. |
| Also, another snapshot of 0/1 would invalidate 1/4. |
| So one more tracking groups is needed, containing 0/1, 0/4 and 0/3. |
| \end_layout |
| |
| \begin_layout Standard |
| It is planned that the btrfs userland utility will keep track of the needed |
| tracking groups and takes care that all the necessary copies happen. |
| For this, a format needs to be found how a user can describe what snapshots |
| he intends to take. |
| Keeping tracking groups for all possible combinations would lead to an |
| exponential number of tracking groups. |
| \end_layout |
| |
| \begin_layout Subsection |
| On-disk quota tree layout |
| \end_layout |
| |
| \begin_layout Standard |
| Qgroups add a new tree, the quota tree. |
| Four new keys are used in this tree. |
| The overall status is recorded in a status item, and each qgroup has two |
| items, one to record the user configured limits and one to record the current |
| |
| \emph on |
| referenced |
| \emph default |
| / |
| \emph on |
| exclusive |
| \emph default |
| counts. |
| Each parent/child-relationship between qgroups gets two qgroup_relation |
| items, one per direction. |
| The on-disk structure is still preliminary. |
| \end_layout |
| |
| \begin_layout LyX-Code |
| /* |
| \end_layout |
| |
| \begin_layout LyX-Code |
| * Records the overall state of the qgroups. |
| \end_layout |
| |
| \begin_layout LyX-Code |
| * There is only one instance of this key present, |
| \end_layout |
| |
| \begin_layout LyX-Code |
| * (0, BTRFS_QGROUP_STATUS_KEY, 0) |
| \end_layout |
| |
| \begin_layout LyX-Code |
| */ |
| \end_layout |
| |
| \begin_layout LyX-Code |
| #define BTRFS_QGROUP_STATUS_KEY 240 |
| \end_layout |
| |
| \begin_layout LyX-Code |
| /* |
| \end_layout |
| |
| \begin_layout LyX-Code |
| * Records the currently used space of the qgroup. |
| \end_layout |
| |
| \begin_layout LyX-Code |
| * One key per qgroup, (0, BTRFS_QGROUP_INFO_KEY, qgroupid). |
| \end_layout |
| |
| \begin_layout LyX-Code |
| */ |
| \end_layout |
| |
| \begin_layout LyX-Code |
| #define BTRFS_QGROUP_INFO_KEY 242 |
| \end_layout |
| |
| \begin_layout LyX-Code |
| /* |
| \end_layout |
| |
| \begin_layout LyX-Code |
| * Contains the user configured limits for the qgroup. |
| \end_layout |
| |
| \begin_layout LyX-Code |
| * One key per qgroup, (0, BTRFS_QGROUP_LIMIT_KEY, qgroupid). |
| \end_layout |
| |
| \begin_layout LyX-Code |
| */ |
| \end_layout |
| |
| \begin_layout LyX-Code |
| #define BTRFS_QGROUP_LIMIT_KEY 244 |
| \end_layout |
| |
| \begin_layout LyX-Code |
| /* |
| \end_layout |
| |
| \begin_layout LyX-Code |
| * Records the child-parent relationship of qgroups. |
| For |
| \end_layout |
| |
| \begin_layout LyX-Code |
| * each relation, 2 keys are present: |
| \end_layout |
| |
| \begin_layout LyX-Code |
| * (childid, BTRFS_QGROUP_RELATION_KEY, parentid) |
| \end_layout |
| |
| \begin_layout LyX-Code |
| * (parentid, BTRFS_QGROUP_RELATION_KEY, childid) |
| \end_layout |
| |
| \begin_layout LyX-Code |
| */ |
| \end_layout |
| |
| \begin_layout LyX-Code |
| #define BTRFS_QGROUP_RELATION_KEY 246 |
| \end_layout |
| |
| \begin_layout Standard |
| The keys are chosen in a way such that |
| \family typewriter |
| STATUS_KEY |
| \family default |
| comes first, followed by all |
| \family typewriter |
| INFO_KEY |
| \family default |
| s, followed by all |
| \family typewriter |
| LIMIT_KEY |
| \family default |
| s. |
| After that, for each qgroup present, all relations follow. |
| Only the |
| \family typewriter |
| INFO_KEY |
| \family default |
| s and the |
| \family typewriter |
| STATUS_KEY |
| \family default |
| get updated regularly. |
| The idea is that those keys stay close to each other to minimize writes. |
| The |
| \family typewriter |
| RELATION_KEY |
| \family default |
| is chosen in a way that, by a simple enumeration, all children and parents |
| for a given qgroup can be found. |
| The qgroupid is composed of a 16\SpecialChar \nobreakdash- |
| bit |
| \begin_inset Quotes eld |
| \end_inset |
| |
| level |
| \begin_inset Quotes erd |
| \end_inset |
| |
| field, followed by a 48\SpecialChar \nobreakdash- |
| bit |
| \begin_inset Quotes eld |
| \end_inset |
| |
| id |
| \begin_inset Quotes erd |
| \end_inset |
| |
| field. |
| A qgroupid is represented as level/id, e. |
| \begin_inset space \thinspace{} |
| \end_inset |
| |
| g. |
| \begin_inset space \space{} |
| \end_inset |
| |
| 2/100. |
| In the case of a subvolume, the level is 0, and the |
| \begin_inset Quotes eld |
| \end_inset |
| |
| id |
| \begin_inset Quotes erd |
| \end_inset |
| |
| is just the internal tree objectid (5 or >= 256). |
| On the command line, the user will be able to use the subvolume path as |
| the identifier. |
| \end_layout |
| |
| \begin_layout LyX-Code |
| /* |
| \end_layout |
| |
| \begin_layout LyX-Code |
| * is subvolume quota turned on? |
| \end_layout |
| |
| \begin_layout LyX-Code |
| */ |
| \end_layout |
| |
| \begin_layout LyX-Code |
| #define BTRFS_QGROUP_STATUS_FLAG_ON (1ULL << 0) |
| \end_layout |
| |
| \begin_layout LyX-Code |
| /* |
| \end_layout |
| |
| \begin_layout LyX-Code |
| * SCANNING is set during the initialization phase |
| \end_layout |
| |
| \begin_layout LyX-Code |
| */ |
| \end_layout |
| |
| \begin_layout LyX-Code |
| #define BTRFS_QGROUP_STATUS_FLAG_SCANNING (1ULL << 1) |
| \end_layout |
| |
| \begin_layout LyX-Code |
| /* |
| \end_layout |
| |
| \begin_layout LyX-Code |
| * Some qgroup entries are known to be out of date, |
| \end_layout |
| |
| \begin_layout LyX-Code |
| * either because the configuration has changed in a way that |
| \end_layout |
| |
| \begin_layout LyX-Code |
| * makes a rescan necessary, or because the fs has been mounted |
| \end_layout |
| |
| \begin_layout LyX-Code |
| * with a non-qgroup-aware version. |
| |
| \end_layout |
| |
| \begin_layout LyX-Code |
| * Turning qouta off and on again makes it inconsistent, too. |
| \end_layout |
| |
| \begin_layout LyX-Code |
| */ |
| \end_layout |
| |
| \begin_layout LyX-Code |
| #define BTRFS_QGROUP_STATUS_FLAG_INCONSISTENT (1ULL << 2) |
| \end_layout |
| |
| \begin_layout LyX-Code |
| #define BTRFS_QGROUP_STATUS_VERSION 1 |
| \end_layout |
| |
| \begin_layout LyX-Code |
| struct btrfs_qgroup_status_item { |
| \end_layout |
| |
| \begin_deeper |
| \begin_layout LyX-Code |
| __le64 version; |
| \end_layout |
| |
| \begin_layout LyX-Code |
| /* |
| \end_layout |
| |
| \begin_layout LyX-Code |
| * the generation is updated during every commit. |
| As older |
| \end_layout |
| |
| \begin_layout LyX-Code |
| * versions of btrfs are not aware of qgroups, it will be |
| \end_layout |
| |
| \begin_layout LyX-Code |
| * possible to detect inconsistencies by checking the |
| \end_layout |
| |
| \begin_layout LyX-Code |
| * generation on mount time |
| \end_layout |
| |
| \begin_layout LyX-Code |
| */ |
| \end_layout |
| |
| \begin_layout LyX-Code |
| __le64 generation; |
| \end_layout |
| |
| \begin_layout LyX-Code |
| /* flag definitions see above */ |
| \end_layout |
| |
| \begin_layout LyX-Code |
| __le64 flags; |
| \end_layout |
| |
| \begin_layout LyX-Code |
| /* |
| \end_layout |
| |
| \begin_layout LyX-Code |
| * only used during scanning to record the progress |
| \end_layout |
| |
| \begin_layout LyX-Code |
| * of the scan. |
| It contains a logical address |
| \end_layout |
| |
| \begin_layout LyX-Code |
| */ |
| \end_layout |
| |
| \begin_layout LyX-Code |
| __le64 scan; |
| \end_layout |
| |
| \end_deeper |
| \begin_layout LyX-Code |
| } __attribute__ ((__packed__)); |
| \end_layout |
| |
| \begin_layout LyX-Code |
| |
| \end_layout |
| |
| \begin_layout Standard |
| Instead of hosting the scan cursor in the structure, one could also make |
| a separate key instead that is only present during scanning. |
| \end_layout |
| |
| \begin_layout LyX-Code |
| struct btrfs_qgroup_info_item { |
| \end_layout |
| |
| \begin_deeper |
| \begin_layout LyX-Code |
| /* |
| \end_layout |
| |
| \begin_layout LyX-Code |
| * only updated when any of the other values change |
| \end_layout |
| |
| \begin_layout LyX-Code |
| */ |
| \end_layout |
| |
| \begin_layout LyX-Code |
| __le64 generation; |
| \end_layout |
| |
| \begin_layout LyX-Code |
| __le64 rfer; |
| \end_layout |
| |
| \begin_layout LyX-Code |
| __le64 rfer_cmpr; |
| \end_layout |
| |
| \begin_layout LyX-Code |
| __le64 excl; |
| \end_layout |
| |
| \begin_layout LyX-Code |
| __le64 excl_cmpr; |
| \end_layout |
| |
| \end_deeper |
| \begin_layout LyX-Code |
| } __attribute__ ((__packed__)); |
| \end_layout |
| |
| \begin_layout Standard |
| For all uncompressed data, the same value will be recorded for compressed |
| and uncompressed. |
| The |
| \family typewriter |
| *_cmpr |
| \family default |
| values represent the amount of disk space used, the other values the amount |
| of space from a user perspective. |
| The uncompressed values are hard to get, so a first version might not support |
| them yet and just record the on-disk values instead. |
| \end_layout |
| |
| \begin_layout LyX-Code |
| /* flags definition for qgroup limits */ |
| \end_layout |
| |
| \begin_layout LyX-Code |
| #define BTRFS_QGROUP_LIMIT_MAX_RFER (1ULL << 0) |
| \end_layout |
| |
| \begin_layout LyX-Code |
| #define BTRFS_QGROUP_LIMIT_MAX_EXCL (1ULL << 1) |
| \end_layout |
| |
| \begin_layout LyX-Code |
| #define BTRFS_QGROUP_LIMIT_RSV_RFER (1ULL << 2) |
| \end_layout |
| |
| \begin_layout LyX-Code |
| #define BTRFS_QGROUP_LIMIT_RSV_EXCL (1ULL << 3) |
| \end_layout |
| |
| \begin_layout LyX-Code |
| #define BTRFS_QGROUP_LIMIT_RFER_CMPR (1ULL << 4) |
| \end_layout |
| |
| \begin_layout LyX-Code |
| #define BTRFS_QGROUP_LIMIT_EXCL_CMPR (1ULL << 5) |
| \end_layout |
| |
| \begin_layout LyX-Code |
| struct btrfs_qgroup_limit_item { |
| \end_layout |
| |
| \begin_deeper |
| \begin_layout LyX-Code |
| __le64 flags; |
| \end_layout |
| |
| \begin_layout LyX-Code |
| __le64 max_referenced; |
| \end_layout |
| |
| \begin_layout LyX-Code |
| __le64 max_exclusive; |
| \end_layout |
| |
| \begin_layout LyX-Code |
| __le64 rsv_referenced; |
| \end_layout |
| |
| \begin_layout LyX-Code |
| __le64 rsv_exclusive; |
| \end_layout |
| |
| \end_deeper |
| \begin_layout LyX-Code |
| } __attribute__ ((__packed__)); |
| \end_layout |
| |
| \begin_layout Standard |
| The flags record which of the limits are to be enforced. |
| The last two flags indicate whether the compressed or the uncompressed |
| value is to limit. |
| This structure also contains reservations, though they might be hard to |
| implement, as btrfs has no clear understanding of how much free space is |
| left. |
| A straightforward implementation might be very inaccurate and the first |
| version will probably not implement it. |
| Those values are nevertheless included here as a means for future expansion. |
| \end_layout |
| |
| \begin_layout Subsection |
| Estimation |
| \end_layout |
| |
| \begin_layout Standard |
| In btrfs, each file operation is encapsulated into a transaction. |
| All necessary space for the transaction has to be reserved before any modificat |
| ion is done to the structures, as there is no way to back out in the middle. |
| That is what block reserves are used for. |
| \end_layout |
| |
| \begin_layout Standard |
| The same holds for quota: it is not possible to deny an operation in the |
| middle of it. |
| The only point where an |
| \family typewriter |
| EDQUOT |
| \family default |
| (Quota exceeded) error can be generated is before the start of the operation. |
| The easiest way would be to only deny it if one of the affected qgroups |
| is already over quota, but that would allow large operations to exceed |
| the quota by far. |
| This implementation tries to estimate the needed space for the operation |
| and reserves it at the start of the operation. |
| If the reservation fails, the operation is denied. |
| \end_layout |
| |
| \begin_layout Standard |
| The reservation is recorded in each qgroup. |
| Also, it is saved in the trans_handle, so it can be freed on end_transaction. |
| The estimation is not a worst-case estimation like the block reservation. |
| It should not deny requests too early. |
| On the other hand, it might be possible that a qgroup goes slightly over |
| quota. |
| \end_layout |
| |
| \end_body |
| \end_document |