|  | .. SPDX-License-Identifier: GPL-2.0 | 
|  |  | 
|  | ====================================== | 
|  | EROFS - Enhanced Read-Only File System | 
|  | ====================================== | 
|  |  | 
|  | Overview | 
|  | ======== | 
|  |  | 
|  | EROFS filesystem stands for Enhanced Read-Only File System.  It aims to form a | 
|  | generic read-only filesystem solution for various read-only use cases instead | 
|  | of just focusing on storage space saving without considering any side effects | 
|  | of runtime performance. | 
|  |  | 
|  | It is designed to meet the needs of flexibility, feature extendability and user | 
|  | payload friendly, etc.  Apart from those, it is still kept as a simple | 
|  | random-access friendly high-performance filesystem to get rid of unneeded I/O | 
|  | amplification and memory-resident overhead compared to similar approaches. | 
|  |  | 
|  | It is implemented to be a better choice for the following scenarios: | 
|  |  | 
|  | - read-only storage media or | 
|  |  | 
|  | - part of a fully trusted read-only solution, which means it needs to be | 
|  | immutable and bit-for-bit identical to the official golden image for | 
|  | their releases due to security or other considerations and | 
|  |  | 
|  | - hope to minimize extra storage space with guaranteed end-to-end performance | 
|  | by using compact layout, transparent file compression and direct access, | 
|  | especially for those embedded devices with limited memory and high-density | 
|  | hosts with numerous containers. | 
|  |  | 
|  | Here are the main features of EROFS: | 
|  |  | 
|  | - Little endian on-disk design; | 
|  |  | 
|  | - Block-based distribution and file-based distribution over fscache are | 
|  | supported; | 
|  |  | 
|  | - Support multiple devices to refer to external blobs, which can be used | 
|  | for container images; | 
|  |  | 
|  | - 32-bit block addresses for each device, therefore 16TiB address space at | 
|  | most with 4KiB block size for now; | 
|  |  | 
|  | - Two inode layouts for different requirements: | 
|  |  | 
|  | =====================  ============  ====================================== | 
|  | compact (v1)  extended (v2) | 
|  | =====================  ============  ====================================== | 
|  | Inode metadata size    32 bytes      64 bytes | 
|  | Max file size          4 GiB         16 EiB (also limited by max. vol size) | 
|  | Max uids/gids          65536         4294967296 | 
|  | Per-inode timestamp    no            yes (64 + 32-bit timestamp) | 
|  | Max hardlinks          65536         4294967296 | 
|  | Metadata reserved      8 bytes       18 bytes | 
|  | =====================  ============  ====================================== | 
|  |  | 
|  | - Support extended attributes as an option; | 
|  |  | 
|  | - Support a bloom filter that speeds up negative extended attribute lookups; | 
|  |  | 
|  | - Support POSIX.1e ACLs by using extended attributes; | 
|  |  | 
|  | - Support transparent data compression as an option: | 
|  | LZ4, MicroLZMA and DEFLATE algorithms can be used on a per-file basis; In | 
|  | addition, inplace decompression is also supported to avoid bounce compressed | 
|  | buffers and unnecessary page cache thrashing. | 
|  |  | 
|  | - Support chunk-based data deduplication and rolling-hash compressed data | 
|  | deduplication; | 
|  |  | 
|  | - Support tailpacking inline compared to byte-addressed unaligned metadata | 
|  | or smaller block size alternatives; | 
|  |  | 
|  | - Support merging tail-end data into a special inode as fragments. | 
|  |  | 
|  | - Support large folios to make use of THPs (Transparent Hugepages); | 
|  |  | 
|  | - Support direct I/O on uncompressed files to avoid double caching for loop | 
|  | devices; | 
|  |  | 
|  | - Support FSDAX on uncompressed images for secure containers and ramdisks in | 
|  | order to get rid of unnecessary page cache. | 
|  |  | 
|  | - Support file-based on-demand loading with the Fscache infrastructure. | 
|  |  | 
|  | The following git tree provides the file system user-space tools under | 
|  | development, such as a formatting tool (mkfs.erofs), an on-disk consistency & | 
|  | compatibility checking tool (fsck.erofs), and a debugging tool (dump.erofs): | 
|  |  | 
|  | - git://git.kernel.org/pub/scm/linux/kernel/git/xiang/erofs-utils.git | 
|  |  | 
|  | For more information, please also refer to the documentation site: | 
|  |  | 
|  | - https://erofs.docs.kernel.org | 
|  |  | 
|  | Bugs and patches are welcome, please kindly help us and send to the following | 
|  | linux-erofs mailing list: | 
|  |  | 
|  | - linux-erofs mailing list   <linux-erofs@lists.ozlabs.org> | 
|  |  | 
|  | Mount options | 
|  | ============= | 
|  |  | 
|  | ===================    ========================================================= | 
|  | (no)user_xattr         Setup Extended User Attributes. Note: xattr is enabled | 
|  | by default if CONFIG_EROFS_FS_XATTR is selected. | 
|  | (no)acl                Setup POSIX Access Control List. Note: acl is enabled | 
|  | by default if CONFIG_EROFS_FS_POSIX_ACL is selected. | 
|  | cache_strategy=%s      Select a strategy for cached decompression from now on: | 
|  |  | 
|  | ==========  ============================================= | 
|  | disabled  In-place I/O decompression only; | 
|  | readahead  Cache the last incomplete compressed physical | 
|  | cluster for further reading. It still does | 
|  | in-place I/O decompression for the rest | 
|  | compressed physical clusters; | 
|  | readaround  Cache the both ends of incomplete compressed | 
|  | physical clusters for further reading. | 
|  | It still does in-place I/O decompression | 
|  | for the rest compressed physical clusters. | 
|  | ==========  ============================================= | 
|  | dax={always,never}     Use direct access (no page cache).  See | 
|  | Documentation/filesystems/dax.rst. | 
|  | dax                    A legacy option which is an alias for ``dax=always``. | 
|  | device=%s              Specify a path to an extra device to be used together. | 
|  | fsid=%s                Specify a filesystem image ID for Fscache back-end. | 
|  | domain_id=%s           Specify a domain ID in fscache mode so that different images | 
|  | with the same blobs under a given domain ID can share storage. | 
|  | fsoffset=%llu          Specify block-aligned filesystem offset for the primary device. | 
|  | ===================    ========================================================= | 
|  |  | 
|  | Sysfs Entries | 
|  | ============= | 
|  |  | 
|  | Information about mounted erofs file systems can be found in /sys/fs/erofs. | 
|  | Each mounted filesystem will have a directory in /sys/fs/erofs based on its | 
|  | device name (i.e., /sys/fs/erofs/sda). | 
|  | (see also Documentation/ABI/testing/sysfs-fs-erofs) | 
|  |  | 
|  | On-disk details | 
|  | =============== | 
|  |  | 
|  | Summary | 
|  | ------- | 
|  | Different from other read-only file systems, an EROFS volume is designed | 
|  | to be as simple as possible:: | 
|  |  | 
|  | |-> aligned with the block size | 
|  | ____________________________________________________________ | 
|  | | |SB| | ... | Metadata | ... | Data | Metadata | ... | Data | | 
|  | |_|__|_|_____|__________|_____|______|__________|_____|______| | 
|  | 0 +1K | 
|  |  | 
|  | All data areas should be aligned with the block size, but metadata areas | 
|  | may not. All metadatas can be now observed in two different spaces (views): | 
|  |  | 
|  | 1. Inode metadata space | 
|  |  | 
|  | Each valid inode should be aligned with an inode slot, which is a fixed | 
|  | value (32 bytes) and designed to be kept in line with compact inode size. | 
|  |  | 
|  | Each inode can be directly found with the following formula: | 
|  | inode offset = meta_blkaddr * block_size + 32 * nid | 
|  |  | 
|  | :: | 
|  |  | 
|  | |-> aligned with 8B | 
|  | |-> followed closely | 
|  | + meta_blkaddr blocks                                      |-> another slot | 
|  | _____________________________________________________________________ | 
|  | |  ...   | inode |  xattrs  | extents  | data inline | ... | inode ... | 
|  | |________|_______|(optional)|(optional)|__(optional)_|_____|__________ | 
|  | |-> aligned with the inode slot size | 
|  | .                   . | 
|  | .                         . | 
|  | .                              . | 
|  | .                                    . | 
|  | .                                         . | 
|  | .                                              . | 
|  | .____________________________________________________|-> aligned with 4B | 
|  | | xattr_ibody_header | shared xattrs | inline xattrs | | 
|  | |____________________|_______________|_______________| | 
|  | |->    12 bytes    <-|->x * 4 bytes<-|               . | 
|  | .                .                 . | 
|  | .                      .                   . | 
|  | .                           .                     . | 
|  | ._______________________________.______________________. | 
|  | | id | id | id | id |  ... | id | ent | ... | ent| ... | | 
|  | |____|____|____|____|______|____|_____|_____|____|_____| | 
|  | |-> aligned with 4B | 
|  | |-> aligned with 4B | 
|  |  | 
|  | Inode could be 32 or 64 bytes, which can be distinguished from a common | 
|  | field which all inode versions have -- i_format:: | 
|  |  | 
|  | __________________               __________________ | 
|  | |     i_format     |             |     i_format     | | 
|  | |__________________|             |__________________| | 
|  | |        ...       |             |        ...       | | 
|  | |                  |             |                  | | 
|  | |__________________| 32 bytes    |                  | | 
|  | |                  | | 
|  | |__________________| 64 bytes | 
|  |  | 
|  | Xattrs, extents, data inline are placed after the corresponding inode with | 
|  | proper alignment, and they could be optional for different data mappings. | 
|  | _currently_ total 5 data layouts are supported: | 
|  |  | 
|  | ==  ==================================================================== | 
|  | 0  flat file data without data inline (no extent); | 
|  | 1  fixed-sized output data compression (with non-compacted indexes); | 
|  | 2  flat file data with tail packing data inline (no extent); | 
|  | 3  fixed-sized output data compression (with compacted indexes, v5.3+); | 
|  | 4  chunk-based file (v5.15+). | 
|  | ==  ==================================================================== | 
|  |  | 
|  | The size of the optional xattrs is indicated by i_xattr_count in inode | 
|  | header. Large xattrs or xattrs shared by many different files can be | 
|  | stored in shared xattrs metadata rather than inlined right after inode. | 
|  |  | 
|  | 2. Shared xattrs metadata space | 
|  |  | 
|  | Shared xattrs space is similar to the above inode space, started with | 
|  | a specific block indicated by xattr_blkaddr, organized one by one with | 
|  | proper align. | 
|  |  | 
|  | Each share xattr can also be directly found by the following formula: | 
|  | xattr offset = xattr_blkaddr * block_size + 4 * xattr_id | 
|  |  | 
|  | :: | 
|  |  | 
|  | |-> aligned by  4 bytes | 
|  | + xattr_blkaddr blocks                     |-> aligned with 4 bytes | 
|  | _________________________________________________________________________ | 
|  | |  ...   | xattr_entry |  xattr data | ... |  xattr_entry | xattr data  ... | 
|  | |________|_____________|_____________|_____|______________|_______________ | 
|  |  | 
|  | Directories | 
|  | ----------- | 
|  | All directories are now organized in a compact on-disk format. Note that | 
|  | each directory block is divided into index and name areas in order to support | 
|  | random file lookup, and all directory entries are _strictly_ recorded in | 
|  | alphabetical order in order to support improved prefix binary search | 
|  | algorithm (could refer to the related source code). | 
|  |  | 
|  | :: | 
|  |  | 
|  | ___________________________ | 
|  | /                           | | 
|  | /              ______________|________________ | 
|  | /              /              | nameoff1       | nameoffN-1 | 
|  | ____________.______________._______________v________________v__________ | 
|  | | dirent | dirent | ... | dirent | filename | filename | ... | filename | | 
|  | |___.0___|____1___|_____|___N-1__|____0_____|____1_____|_____|___N-1____| | 
|  | \                           ^ | 
|  | \                          |                           * could have | 
|  | \                         |                             trailing '\0' | 
|  | \________________________| nameoff0 | 
|  | Directory block | 
|  |  | 
|  | Note that apart from the offset of the first filename, nameoff0 also indicates | 
|  | the total number of directory entries in this block since it is no need to | 
|  | introduce another on-disk field at all. | 
|  |  | 
|  | Chunk-based files | 
|  | ----------------- | 
|  | In order to support chunk-based data deduplication, a new inode data layout has | 
|  | been supported since Linux v5.15: Files are split in equal-sized data chunks | 
|  | with ``extents`` area of the inode metadata indicating how to get the chunk | 
|  | data: these can be simply as a 4-byte block address array or in the 8-byte | 
|  | chunk index form (see struct erofs_inode_chunk_index in erofs_fs.h for more | 
|  | details.) | 
|  |  | 
|  | By the way, chunk-based files are all uncompressed for now. | 
|  |  | 
|  | Long extended attribute name prefixes | 
|  | ------------------------------------- | 
|  | There are use cases where extended attributes with different values can have | 
|  | only a few common prefixes (such as overlayfs xattrs).  The predefined prefixes | 
|  | work inefficiently in both image size and runtime performance in such cases. | 
|  |  | 
|  | The long xattr name prefixes feature is introduced to address this issue.  The | 
|  | overall idea is that, apart from the existing predefined prefixes, the xattr | 
|  | entry could also refer to user-specified long xattr name prefixes, e.g. | 
|  | "trusted.overlay.". | 
|  |  | 
|  | When referring to a long xattr name prefix, the highest bit (bit 7) of | 
|  | erofs_xattr_entry.e_name_index is set, while the lower bits (bit 0-6) as a whole | 
|  | represent the index of the referred long name prefix among all long name | 
|  | prefixes.  Therefore, only the trailing part of the name apart from the long | 
|  | xattr name prefix is stored in erofs_xattr_entry.e_name, which could be empty if | 
|  | the full xattr name matches exactly as its long xattr name prefix. | 
|  |  | 
|  | All long xattr prefixes are stored one by one in the packed inode as long as | 
|  | the packed inode is valid, or in the meta inode otherwise.  The | 
|  | xattr_prefix_count (of the on-disk superblock) indicates the total number of | 
|  | long xattr name prefixes, while (xattr_prefix_start * 4) indicates the start | 
|  | offset of long name prefixes in the packed/meta inode.  Note that, long extended | 
|  | attribute name prefixes are disabled if xattr_prefix_count is 0. | 
|  |  | 
|  | Each long name prefix is stored in the format: ALIGN({__le16 len, data}, 4), | 
|  | where len represents the total size of the data part.  The data part is actually | 
|  | represented by 'struct erofs_xattr_long_prefix', where base_index represents the | 
|  | index of the predefined xattr name prefix, e.g. EROFS_XATTR_INDEX_TRUSTED for | 
|  | "trusted.overlay." long name prefix, while the infix string keeps the string | 
|  | after stripping the short prefix, e.g. "overlay." for the example above. | 
|  |  | 
|  | Data compression | 
|  | ---------------- | 
|  | EROFS implements fixed-sized output compression which generates fixed-sized | 
|  | compressed data blocks from variable-sized input in contrast to other existing | 
|  | fixed-sized input solutions. Relatively higher compression ratios can be gotten | 
|  | by using fixed-sized output compression since nowadays popular data compression | 
|  | algorithms are mostly LZ77-based and such fixed-sized output approach can be | 
|  | benefited from the historical dictionary (aka. sliding window). | 
|  |  | 
|  | In details, original (uncompressed) data is turned into several variable-sized | 
|  | extents and in the meanwhile, compressed into physical clusters (pclusters). | 
|  | In order to record each variable-sized extent, logical clusters (lclusters) are | 
|  | introduced as the basic unit of compress indexes to indicate whether a new | 
|  | extent is generated within the range (HEAD) or not (NONHEAD). Lclusters are now | 
|  | fixed in block size, as illustrated below:: | 
|  |  | 
|  | |<-    variable-sized extent    ->|<-       VLE         ->| | 
|  | clusterofs                        clusterofs              clusterofs | 
|  | |                                 |                       | | 
|  | _________v_________________________________v_______________________v________ | 
|  | ... |    .         |              |        .     |              |  .   ... | 
|  | ____|____._________|______________|________.___ _|______________|__.________ | 
|  | |-> lcluster <-|-> lcluster <-|-> lcluster <-|-> lcluster <-| | 
|  | (HEAD)        (NONHEAD)       (HEAD)        (NONHEAD)    . | 
|  | .             CBLKCNT            .                    . | 
|  | .                               .                  . | 
|  | .                              .                . | 
|  | _______._____________________________.______________._________________ | 
|  | ... |              |              |              | ... | 
|  | _______|______________|______________|______________|_________________ | 
|  | |->      big pcluster       <-|-> pcluster <-| | 
|  |  | 
|  | A physical cluster can be seen as a container of physical compressed blocks | 
|  | which contains compressed data. Previously, only lcluster-sized (4KB) pclusters | 
|  | were supported. After big pcluster feature is introduced (available since | 
|  | Linux v5.13), pcluster can be a multiple of lcluster size. | 
|  |  | 
|  | For each HEAD lcluster, clusterofs is recorded to indicate where a new extent | 
|  | starts and blkaddr is used to seek the compressed data. For each NONHEAD | 
|  | lcluster, delta0 and delta1 are available instead of blkaddr to indicate the | 
|  | distance to its HEAD lcluster and the next HEAD lcluster. A PLAIN lcluster is | 
|  | also a HEAD lcluster except that its data is uncompressed. See the comments | 
|  | around "struct z_erofs_vle_decompressed_index" in erofs_fs.h for more details. | 
|  |  | 
|  | If big pcluster is enabled, pcluster size in lclusters needs to be recorded as | 
|  | well. Let the delta0 of the first NONHEAD lcluster store the compressed block | 
|  | count with a special flag as a new called CBLKCNT NONHEAD lcluster. It's easy | 
|  | to understand its delta0 is constantly 1, as illustrated below:: | 
|  |  | 
|  | __________________________________________________________ | 
|  | | HEAD |  NONHEAD  | NONHEAD | ... | NONHEAD | HEAD | HEAD | | 
|  | |__:___|_(CBLKCNT)_|_________|_____|_________|__:___|____:_| | 
|  | |<----- a big pcluster (with CBLKCNT) ------>|<--  -->| | 
|  | a lcluster-sized pcluster (without CBLKCNT) ^ | 
|  |  | 
|  | If another HEAD follows a HEAD lcluster, there is no room to record CBLKCNT, | 
|  | but it's easy to know the size of such pcluster is 1 lcluster as well. | 
|  |  | 
|  | Since Linux v6.1, each pcluster can be used for multiple variable-sized extents, | 
|  | therefore it can be used for compressed data deduplication. |