blob: d0bdb5bcdd4f941a711467e047add90f29e6689b [file] [log] [blame]
<?xml version='1.0'?>
<!DOCTYPE chapter PUBLIC "-//OASIS//DTD DocBook XML V4.4//EN" "http://www.oasis-open.org/docbook/xml/4.4/docbookx.dtd" [
]>
<chapter id="Extended_Attributes"><title>
Extended Attributes</title>
<para>Extended attributes implement the ability for a user to attach name:value pairs to inodes within the XFS filesystem. They could be used to store meta-information about the file.</para>
<para>The attribute names can be up to 256 bytes in length, terminated by the first 0 byte. The intent is that they be printable ASCII (or other character set) names for the attribute. The values can be up to 64KB of arbitrary binary data. Some XFS internal attributes (eg. parent pointers) use non-printable names for the attribute.</para>
<para>Access Control Lists (ACLs) and Data Migration Facility (DMF) use extended attributes to store their associated metadata with an inode.</para>
<para>XFS uses two disjoint attribute name spaces associated with every inode. They are the root and user address spaces. The root address space is accessible only to the superuser, and then only by specifying a flag argument to the function call. Other users will not see or be able to modify attributes in the root address space. The user address space is protected by the normal file permissions mechanism, so the owner of the file can decide who is able to see and/or modify the value of attributes on any particular file.</para>
<para>To view extended attributes from the command line, use the <command>getfattr</command> command. To set or delete extended attributes, use the <command>setfattr</command> command. ACLs control should use the <command>getfacl</command> and <command>setfacl</command> commands.</para>
<para>XFS attributes supports three namespaces: "user", "trusted" (or "root" using IRIX terminology) and "secure".</para>
<para>The location of the attribute fork in the inode's literal area is specified by the <command>di_forkoff</command> value in the inode's core. If this value is zero, the inode does not contain any extended attributes. Non-zero, the byte offset into the literal area = <command>di_forkoff * 8</command>, which also determines the 2048 byte maximum size for an inode. Attributes must be allocated on a 64-bit boundary on the disk except shortform attributes (they are tightly packed). To determine the offset into the inode itself, add 100 (0x64) to <command>di_forkoff * 8</command>.</para>
<para>The following four sections describe each of the on-disk formats.</para>
<section id="Shortform_Attributes"><title>
Shortform Attributes</title>
<para>When the all extended attributes can fit within the inode's attribute fork, the inode's <command>di_aformat</command> is set to "local" and the attributes are stored in the inode's literal area starting at offset <command>di_forkoff * 8</command>.</para>
<para>Shortform attributes use the following structures:</para>
<programlisting>
typedef struct xfs_attr_shortform {
struct xfs_attr_sf_hdr {
__be16 totsize;
__u8 count;
} hdr;
struct xfs_attr_sf_entry {
__uint8_t namelen;
__uint8_t valuelen;
__uint8_t flags;
__uint8_t nameval[1];
} list[1];
} xfs_attr_shortform_t;
typedef struct xfs_attr_sf_hdr xfs_attr_sf_hdr_t;
typedef struct xfs_attr_sf_entry xfs_attr_sf_entry_t;
</programlisting>
<para>
<mediaobject>
<imageobject><imagedata fileref="images/64.png" format="PNG" width="100%" scalefit="0"/></imageobject>
<textobject><phrase>64</phrase></textobject>
</mediaobject>
</para>
<itemizedlist>
<listitem>
<para><command>namelen</command> and <command>valuelen</command> specify the size of the two byte arrays containing the name and value pairs. <command>valuelen</command> is zero for extended attributes with no value.</para>
</listitem>
<listitem>
<para><command>nameval[]</command> is a single array where it's size is the sum of <command>namelen</command> and <command>valuelen</command>. The names and values are not null terminated on-disk. The value immediately follows the name in the array.</para>
</listitem>
<listitem>
<para><command>flags</command> specifies the namespace for the attribute (0 = "user"):</para>
<informaltable frame="all">
<tgroup cols="2"><thead><row>
<entry>
<para>Flag</para>
</entry>
<entry>
<para>Description</para>
</entry>
</row>
</thead>
<tbody>
<row>
<entry>
<para><command>XFS_ATTR_ROOT</command></para>
</entry>
<entry>
<para>The attribute's namespace is "trusted".</para>
</entry>
</row>
<row>
<entry>
<para><command>XFS_ATTR_SECURE</command></para>
</entry>
<entry>
<para>The attribute's namespace is "secure".</para>
</entry>
</row></tbody></tgroup>
</informaltable>
</listitem>
</itemizedlist>
<bridgehead>xfs_db Example:</bridgehead>
<para>A file is created and two attributes are set:</para>
<programlisting>
# setfattr -n user.empty few_attr
# setfattr -n trusted.trust -v val1 few_attr
</programlisting>
<para>Using xfs_db, we dump the inode:</para>
<programlisting>
xfs_db&gt; inode &lt;inode#&gt;
xfs_db&gt; p
core.magic = 0x494e
core.mode = 0100644
...
core.naextents = 0
core.forkoff = 15
core.aformat = 1 (local)
...
a.sfattr.hdr.totsize = 24
a.sfattr.hdr.count = 2
a.sfattr.list[0].namelen = 5
a.sfattr.list[0].valuelen = 0
a.sfattr.list[0].root = 0
a.sfattr.list[0].secure = 0
a.sfattr.list[0].name = "empty"
a.sfattr.list[1].namelen = 5
a.sfattr.list[1].valuelen = 4
a.sfattr.list[1].root = 1
a.sfattr.list[1].secure = 0
a.sfattr.list[1].name = "trust"
a.sfattr.list[1].value = "val1"
</programlisting>
<para>We can determine the actual inode offset to be 220 (15 x 8 + 100) or <command>0xdc</command>.</para>
<para>Examining the raw dump, the second attribute is highlighted:</para>
<mediaobject>
<imageobject><imagedata fileref="images/code/65.png" format="PNG" /></imageobject>
<textobject><phrase>code65</phrase></textobject>
</mediaobject>
<para>Adding another attribute with attr1, the format is converted to extents and <command>di_forkoff</command> remains unchanged (and all those zeros in the dump above remain unused):</para>
<programlisting>
xfs_db&gt; inode &lt;inode#&gt;
xfs_db&gt; p
...
core.naextents = 1
core.forkoff = 15
core.aformat = 2 (extents)
...
a.bmx[0] = [startoff,startblock,blockcount,extentflag] 0:[0,37534,1,0]
</programlisting>
<para>Performing the same steps with attr2, adding one attribute at a time, you can see <command>di_forkoff</command> change as attributes are added:</para>
<programlisting>
xfs_db&gt; inode &lt;inode#&gt;
xfs_db&gt; p
...
core.naextents = 0
core.forkoff = 15
core.aformat = 1 (local)
...
a.sfattr.hdr.totsize = 17
a.sfattr.hdr.count = 1
a.sfattr.list[0].namelen = 10
a.sfattr.list[0].valuelen = 0
a.sfattr.list[0].root = 0
a.sfattr.list[0].secure = 0
a.sfattr.list[0].name = "empty_attr"
</programlisting>
<para>Attribute added:</para>
<programlisting>
xfs_db&gt; p
...
core.naextents = 0
core.forkoff = 15
core.aformat = 1 (local)
...
a.sfattr.hdr.totsize = 31
a.sfattr.hdr.count = 2
a.sfattr.list[0].namelen = 10
a.sfattr.list[0].valuelen = 0
a.sfattr.list[0].root = 0
a.sfattr.list[0].secure = 0
a.sfattr.list[0].name = "empty_attr"
a.sfattr.list[1].namelen = 7
a.sfattr.list[1].valuelen = 4
a.sfattr.list[1].root = 1
a.sfattr.list[1].secure = 0
a.sfattr.list[1].name = "trust_a"
a.sfattr.list[1].value = "val1"
</programlisting>
<para>Another attribute is added:</para>
<mediaobject>
<imageobject><imagedata fileref="images/code/66.png" format="PNG" /></imageobject>
<textobject><phrase>code66</phrase></textobject>
</mediaobject>
<para>One more is added:</para>
<programlisting>
xfs_db&gt; p
core.naextents = 0
core.forkoff = 10
core.aformat = 1 (local)
...
a.sfattr.hdr.totsize = 69
a.sfattr.hdr.count = 4
a.sfattr.list[0].namelen = 10
a.sfattr.list[0].valuelen = 0
a.sfattr.list[0].root = 0
a.sfattr.list[0].secure = 0
a.sfattr.list[0].name = "empty_attr"
a.sfattr.list[1].namelen = 7
a.sfattr.list[1].valuelen = 4
a.sfattr.list[1].root = 1
a.sfattr.list[1].secure = 0
a.sfattr.list[1].name = "trust_a"
a.sfattr.list[1].value = "val1"
a.sfattr.list[2].namelen = 6
a.sfattr.list[2].valuelen = 12
a.sfattr.list[2].root = 0
a.sfattr.list[2].secure = 0
a.sfattr.list[2].name = "second"
a.sfattr.list[2].value = "second_value"
a.sfattr.list[3].namelen = 6
a.sfattr.list[3].valuelen = 8
a.sfattr.list[3].root = 0
a.sfattr.list[3].secure = 1
a.sfattr.list[3].name = "policy"
a.sfattr.list[3].value = "contents"
</programlisting>
<para>A raw dump is shown to compare with the attr1 dump on a prior page, the header is highlighted:</para>
<mediaobject>
<imageobject><imagedata fileref="images/code/67.png" format="PNG" /></imageobject>
<textobject><phrase>code67</phrase></textobject>
</mediaobject>
<para>It can be clearly seen that attr2 allows many more attributes to be stored in an inode before they are moved to another filesystem block.</para></section>
<section id="Leaf_Attributes"><title>Leaf Attributes</title>
<para>When an inode's attribute fork space is used up with shortform attributes and more are added, the attribute format is migrated to "extents".</para>
<para>Extent based attributes use hash/index pairs to speed up an attribute lookup. The first part of the "leaf" contains an array of fixed size hash/index pairs with the flags stored as well. The remaining part of the leaf block contains the array name/value pairs, where each element varies in length.</para>
<para>Each leaf is based on the <command>xfs_da_blkinfo_t</command> block header declared in Leaf Directories. The structure encapsulating all other structures in the <command>xfs_attr_leafblock_t</command>.</para>
<para>The structures involved are:</para>
<programlisting>
typedef struct xfs_attr_leaf_map {
__be16 base;
__be16 size;
} xfs_attr_leaf_map_t;
typedef struct xfs_attr_leaf_hdr {
xfs_da_blkinfo_t info;
__be16 count;
__be16 usedbytes;
__be16 firstused;
__u8 holes;
__u8 pad1;
xfs_attr_leaf_map_t freemap[3];
} xfs_attr_leaf_hdr_t;
typedef struct xfs_attr_leaf_entry {
__be32 hashval;
__be16 nameidx;
__u8 flags;
__u8 pad2;
} xfs_attr_leaf_entry_t;
typedef struct xfs_attr_leaf_name_local {
__be16 valuelen;
__u8 namelen;
__u8 nameval[1];
} xfs_attr_leaf_name_local_t;
typedef struct xfs_attr_leaf_name_remote {
__be32 valueblk;
__be32 valuelen;
__u8 namelen;
__u8 name[1];
} xfs_attr_leaf_name_remote_t;
typedef struct xfs_attr_leafblock {
xfs_attr_leaf_hdr_t hdr;
xfs_attr_leaf_entry_t entries[1];
xfs_attr_leaf_name_local_t namelist;
xfs_attr_leaf_name_remote_t valuelist;
} xfs_attr_leafblock_t;
</programlisting>
<para>Each leaf header uses the following magic number:</para>
<programlisting>
#define XFS_ATTR_LEAF_MAGIC        0xfbee
</programlisting>
<para>The hash/index elements in the <command>entries[]</command> array are packed from the top of the block.  Name/values grow from the bottom but are not packed. The freemap contains run-length-encoded entries for the free bytes after the <command>entries[]</command> array, but only the three largest runs are stored (smaller runs are dropped).  When the <command>freemap</command> doesn’t show enough space for an allocation, name/value area is compacted and allocation is tried again.  If there still isn't enough space, then the block is split. The name/value structures (both local and remote versions) must be 32-bit aligned.</para>
<para>For attributes with small values (ie. the value can be stored within the leaf), the <command>XFS_ATTR_LOCAL</command> flag is set for the attribute. The entry details are stored using the <command>xfs_attr_leaf_name_local_t</command> structure. For large attribute values that cannot be stored within the leaf, separate filesystem blocks are allocated to store the value. They use the <command>xfs_attr_leaf_name_remote_t</command> structure.</para>
<para>
<mediaobject>
<imageobject><imagedata fileref="images/69.png" format="PNG" width="100%" scalefit="0"/></imageobject>
<textobject><phrase>69</phrase></textobject>
</mediaobject>
</para>
<para>Both local and remote entries can be interleaved as they are only addressed by the hash/index entries. The flag is stored with the hash/index pairs so the appropriate structure can be used.</para>
<para>Since duplicate hash keys are possible, for each hash that matches during a lookup, the actual name string must be compared.</para>
<para>An “incomplete” bit is also used for attribute flags.  It shows that an attribute is in the middle of being created and should not be shown to the user if we crash during the time that the bit is set.  The bit is cleared when attribute has finished being setup.  This is done because some large attributes cannot be created inside a single transaction.</para>
<bridgehead>xfs_db Example:</bridgehead>
<para>A single 30KB extended attribute is added to an inode:</para>
<programlisting>
xfs_db&gt; inode &lt;inode#&gt;
xfs_db&gt; p
...
core.nblocks = 9
core.nextents = 0
core.naextents = 1
core.forkoff = 15
core.aformat = 2 (extents)
...
a.bmx[0] = [startoff,startblock,blockcount,extentflag]
0:[0,37535,9,0]
xfs_db&gt; ablock 0
xfs_db&gt; p
hdr.info.forw = 0
hdr.info.back = 0
hdr.info.magic = 0xfbee
hdr.count = 1
hdr.usedbytes = 20
hdr.firstused = 4076
hdr.holes = 0
hdr.freemap[0-2] = [base,size] 0:[40,4036] 1:[0,0] 2:[0,0]
entries[0] = [hashval,nameidx,incomplete,root,secure,local]
0:[0xfcf89d4f,4076,0,0,0,0]
nvlist[0].valueblk = 0x1
nvlist[0].valuelen = 30692
nvlist[0].namelen = 8
nvlist[0].name = "big_attr"
</programlisting>
<para>Attribute blocks 1 to 8 (filesystem blocks 37536 to 37543) contain the raw binary value data for the attribute.</para>
<para>Index 4076 (0xfec) is the offset into the block where the name/value information is. As can be seen by the value, it's at the end of the block:</para>
<programlisting>
xfs_db&gt; type text
xfs_db&gt; p
000: 00 00 00 00 00 00 00 00 fb ee 00 00 00 01 00 14 ................
010: 0f ec 00 00 00 28 0f c4 00 00 00 00 00 00 00 00 ................
020: fc f8 9d 4f 0f ec 00 00 00 00 00 00 00 00 00 00 ...O............
030: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 ................
...
fe0: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 01 ................
ff0: 00 00 77 e4 08 62 69 67 5f 61 74 74 72 00 00 00 ..w..big.attr...
</programlisting>
<para>A 30KB attribute and a couple of small attributes are added to a file:</para>
<programlisting>
xfs_db&gt; inode &lt;inode#&gt;
xfs_db&gt; p
...
core.nblocks = 10
core.extsize = 0
core.nextents = 1
core.naextents = 2
core.forkoff = 15
core.aformat = 2 (extents)
...
u.bmx[0] = [startoff,startblock,blockcount,extentflag]
0:[0,81857,1,0]
a.bmx[0-1] = [startoff,startblock,blockcount,extentflag]
0:[0,81858,1,0]
1:[1,182398,8,0]
xfs_db> ablock 0
xfs_db> p
hdr.info.forw = 0
hdr.info.back = 0
hdr.info.magic = 0xfbee
hdr.count = 3
hdr.usedbytes = 52
hdr.firstused = 4044
hdr.holes = 0
hdr.freemap[0-2] = [base,size] 0:[56,3988] 1:[0,0] 2:[0,0]
entries[0-2] = [hashval,nameidx,incomplete,root,secure,local]
0:[0x1e9d3934,4044,0,0,0,1]
1:[0x1e9d3937,4060,0,0,0,1]
2:[0xfcf89d4f,4076,0,0,0,0]
nvlist[0].valuelen = 6
nvlist[0].namelen = 5
nvlist[0].name = "attr2"
nvlist[0].value = "value2"
nvlist[1].valuelen = 6
nvlist[1].namelen = 5
nvlist[1].name = "attr1"
nvlist[1].value = "value1"
nvlist[2].valueblk = 0x1
nvlist[2].valuelen = 30692
nvlist[2].namelen = 8
nvlist[2].name = "big_attr"
</programlisting>
<para>As can be seen in the entries array, the two small attributes have the local flag set and the values are printed.</para>
<para>A raw disk dump shows the attributes. The last attribute added is highlighted (offset 4044 or 0xfcc):</para>
<para>
<inlinemediaobject>
<imageobject><imagedata fileref="images/code/71.png" format="PNG" /></imageobject>
<textobject><phrase>c</phrase></textobject>
</inlinemediaobject>
</para>
</section>
<section id="Node_Attributes"><title>
Node Attributes</title>
<para>When the number of attributes exceeds the space that can fit in one filesystem block (ie. hash, flag, name and local values), the first attribute block becomes the root of a B+tree where the leaves contain the hash/name/value information that was stored in a single leaf block. The inode's attribute format itself remains extent based. The nodes use the <command>xfs_da_intnode_t structure</command> introduced in Node Directories.</para>
<para>The location of the attribute leaf blocks can be in any order, the only way to determine the appropriate is by the node block hash/before values. Given a hash to lookup, you read the node's btree array and first <command>hashval</command> in the array that exceeds the given hash and it can then be found in the block pointed to by the <command>before</command> value. </para>
<para>
<mediaobject>
<imageobject><imagedata fileref="images/72.png" format="PNG" width="100%" scalefit="0"/></imageobject>
<textobject><phrase>72</phrase></textobject>
</mediaobject>
</para>
<bridgehead>xfs_db Example:</bridgehead>
<para>An inode with 1000 small attributes with the naming "attribute_n" where 'n' is a number:</para>
<programlisting>
xfs_db&gt; inode &lt;inode#&gt;
xfs_db&gt; p
...
core.nblocks = 15
core.nextents = 0
core.naextents = 1
core.forkoff = 15
core.aformat = 2 (extents)
...
a.bmx[0] = [startoff,startblock,blockcount,extentflag] 0:[0,525144,15,0]
xfs_db> ablock 0
xfs_db> p
hdr.info.forw = 0
hdr.info.back = 0
hdr.info.magic = 0xfebe
hdr.count = 14
hdr.level = 1
btree[0-13] = [hashval,before]
0:[0x3435122d,1]
1:[0x343550a9,14]
2:[0x343553a6,13]
3:[0x3436122d,12]
4:[0x343650a9,8]
5:[0x343653a6,7]
6:[0x343691af,6]
7:[0x3436d0ab,11]
8:[0x3436d3a7,10]
9:[0x3437122d,9]
10:[0x3437922e,3]
11:[0x3437d22a,5]
12:[0x3e686c25,4]
13:[0x3e686fad,2]
</programlisting>
<para>The hashes are in ascending order in the btree array, and if the hash for the attribute we are looking up is before the entry, we go to the addressed attribute block.</para>
<para>For example, to lookup attribute "attribute_267":</para>
<programlisting>
xfs_db&gt; hash attribute_267
0x3437d1a8
</programlisting>
<para>In the root btree node, this falls between <command>0x3437922e</command> and <command>0x3437d22a</command>, therefore leaf 11 or attribute block 5 will contain the entry.</para>
<mediaobject>
<imageobject><imagedata fileref="images/code/73-74.png" format="PNG" /></imageobject>
<textobject><phrase>code73-74</phrase></textobject>
</mediaobject>
<para>Each of the hash entries has <command>XFS_ATTR_LOCAL</command> flag set (1), which means the attribute's value follows immediately after the name. Raw disk of the name/value pair at offset 2864 (0xb30), highlighted with "value_267\d" following immediately after the name:</para>
<mediaobject>
<imageobject><imagedata fileref="images/code/74.png" format="PNG" /></imageobject>
<textobject><phrase>code74</phrase></textobject>
</mediaobject>
<para>Each entry starts on a 32-bit (4 byte) boundary, therefore the highlighted entry has 2 unused bytes after it.</para>
</section>
<section id="Btree_Attributes"><title>B+tree Attributes</title>
<para>When the attribute's extent map in an inode grows beyond the available space, the inode's attribute format is changed to a "btree". The inode contains root node of the extent B+tree which then address the leaves that contains the extent arrays for the attribute data. The attribute data itself in the allocated filesystem blocks use the same layout and structures as described in Node Attributes.</para>
<para>Refer to the previous section on B+tree Data Extents for more information on XFS B+tree extents.</para>
<bridgehead>xfs_db Example:</bridgehead>
<para>Added 2000 attributes with 729 byte values to a file:</para>
<programlisting>
xfs_db&gt; inode &lt;inode#&gt;
xfs_db&gt; p
...
core.nblocks = 640
core.extsize = 0
core.nextents = 1
core.naextents = 274
core.forkoff = 15
core.aformat = 3 (btree)
...
a.bmbt.level = 1
a.bmbt.numrecs = 2
a.bmbt.keys[1-2] = [startoff] 1:[0] 2:[219]
a.bmbt.ptrs[1-2] = 1:83162 2:109968
xfs_db> fsblock 83162
xfs_db> type bmapbtd
xfs_db> p
magic = 0x424d4150
level = 0
numrecs = 127
leftsib = null
rightsib = 109968
recs[1-127] = [startoff,startblock,blockcount,extentflag]
1:[0,81870,1,0]
...
xfs_db&gt; fsblock 109968
xfs_db&gt; type bmapbtd
xfs_db&gt; p
magic = 0x424d4150
level = 0
numrecs = 147
leftsib = 83162
rightsib = null
recs[1-147] = [startoff,startblock,blockcount,extentflag]
...
(which is fsblock 81870)
xfs_db&gt; ablock 0
xfs_db&gt; p
hdr.info.forw = 0
hdr.info.back = 0
hdr.info.magic = 0xfebe
hdr.count = 2
hdr.level = 2
btree[0-1] = [hashval,before] 0:[0x343612a6,513] 1:[0x3e686fad,512]
</programlisting>
<para>The extent B+tree has two leaves that specify the 274 extents used for the attributes. Looking at the first block, it can be seen that the attribute B+tree is two levels deep. The two blocks at offset 513 and 512 (ie. access using the <command>ablock</command> command) are intermediate <command>xfs_da_intnode_t</command> nodes that index all the attribute leaves.</para></section></chapter>