| % public-inbox developer manual |
| |
| =head1 NAME |
| |
| public-inbox-v2-format - structure of public inbox v2 archives |
| |
| =head1 DESCRIPTION |
| |
| The v2 format is designed primarily to address several |
| scalability problems of the original format described at |
| L<public-inbox-v1-format(5)>. It also handles messages with |
| Message-IDs. |
| |
| =head1 INBOX LAYOUT |
| |
| The key change in v2 is the inbox is no longer a bare git |
| repository, but a directory with two or more git repositories. |
| v2 divides git repositories by time "epochs" and Xapian |
| databases for parallelism by "shards". |
| |
| =head2 INBOX OVERVIEW AND DEFINITIONS |
| |
| $EPOCH - Integer starting with 0 based on time |
| $SCHEMA_VERSION - DB schema version (for Xapian) |
| $SHARD - Integer starting with 0 based on parallelism |
| |
| foo/ # "foo" is the name of the inbox |
| - inbox.lock # lock file to protect global state |
| - git/$EPOCH.git # normal git repositories |
| - all.git # empty, alternates to $EPOCH.git |
| - xap$SCHEMA_VERSION/$SHARD # per-shard Xapian DB |
| - xap$SCHEMA_VERSION/over.sqlite3 # OVER-view DB for NNTP, threading |
| - msgmap.sqlite3 # same as the v1 msgmap |
| |
| For blob lookups, the reader only needs to open the "all.git" |
| repository with $GIT_DIR/objects/info/alternates which references |
| every $EPOCH.git repo. |
| |
| Individual $EPOCH.git repos DO NOT use alternates themselves as |
| git currently limits recursion of alternates nesting depth to 5. |
| |
| =head2 GIT EPOCHS |
| |
| One of the inherent scalability problems with git itself is the |
| full history of a project must be stored and carried around to |
| all clients. To address this problem, the v2 format uses |
| multiple git repositories, stored as time-based "epochs". |
| |
| We currently divide epochs into roughly one gigabyte segments; |
| but this size can be configurable (if needed) in the future. |
| |
| A pleasant side-effect of this design is the git packs of older |
| epochs are stable, allowing them to be cloned without requiring |
| expensive pack generation. This also allows clients to clone |
| only the epochs they are interested in to save bandwidth and |
| storage. |
| |
| To minimize changes to existing v1-based code and simplify our |
| code, we use the "alternates" mechanism described in |
| L<gitrepository-layout(5)> to link all the epoch repositories |
| with a single read-only "all.git" endpoint. |
| |
| Processes retrieve blobs via the "all.git" repository, while |
| writers write blobs directly to epochs. |
| |
| =head2 GIT TREE LAYOUT |
| |
| One key problem specific to v1 was large trees were frequently a |
| performance problem as name lookups are expensive and there were |
| limited deltafication opportunities with unpredictable file |
| names. As a result, all Xapian-enabled installations retrieve |
| blob object_ids directly in v1, bypassing tree lookups. |
| |
| While dividing git repositories into epochs caps the growth of |
| trees, worst-case tree size was still unnecessary overhead and |
| worth eliminating. |
| |
| So in contrast to the big trees of v1, the v2 git tree contains |
| only a single file at the top-level of the tree, either 'm' (for |
| 'mail' or 'message') or 'd' (for deleted). A tree does not have |
| 'm' and 'd' at the same time. |
| |
| Mail is still stored in blobs (instead of inline with the commit |
| object) as we still need a stable reference in the indices in |
| case commit history is rewritten to comply with legal |
| requirements. |
| |
| After-the-fact invocations of L<public-inbox-index> will ignore |
| messages written to 'd' after they are written to 'm'. |
| |
| Deltafication is not significantly improved over v1, but overall |
| storage for trees is made as small as possible. Initial |
| statistics and benchmarks showing the benefits of this approach |
| are documented at: |
| |
| L<https://public-inbox.org/meta/20180209205140.GA11047@dcvr/> |
| |
| =head2 XAPIAN SHARDS |
| |
| Another scalability problem in v1 was the inability to |
| utilize multiple CPU cores for Xapian indexing. This is |
| addressed by using shards in Xapian to perform import |
| indexing in parallel. |
| |
| As with git alternates, Xapian natively supports a read-only |
| interface which transparently abstracts away the knowledge of |
| multiple shards. This allows us to simplify our read-only |
| code paths. |
| |
| The performance of the storage device is now the bottleneck on |
| larger multi-core systems. In our experience, performance is |
| improved with high-quality and high-quantity solid-state storage. |
| Issuing TRIM commands with L<fstrim(8)> was necessary to maintain |
| consistent performance while developing this feature. |
| |
| Rotational storage devices perform significantly worse than |
| solid state storage for indexing of large mail archives; but are |
| fine for backup and usable for small instances. |
| |
| As of public-inbox 1.6.0, the C<publicInbox.indexSequentialShard> |
| option of L<public-inbox-index(1)> may be used with a high shard |
| count to ensure individual shards fit into page cache when the entire |
| Xapian DB cannot. |
| |
| Our use of the L</OVERVIEW DB> requires Xapian document IDs to |
| remain stable. Using L<public-inbox-compact(1)> and |
| L<public-inbox-xcpdb(1)> wrappers are recommended over tools |
| provided by Xapian. |
| |
| =head2 OVERVIEW DB |
| |
| Towards the end of v2 development, it became apparent Xapian did |
| not perform well for sorting large result sets used to generate |
| the landing page in the PSGI UI (/$INBOX/) or many queries used |
| by the NNTP server. Thus, SQLite was employed and the Xapian |
| "skeleton" DB was renamed to the "overview" DB (after the NNTP |
| OVER/XOVER commands). |
| |
| The overview DB maintains all the header information necessary |
| to implement the NNTP OVER/XOVER commands and non-search |
| endpoints of the PSGI UI. |
| |
| Xapian has become completely optional for v2 (as it is for v1), but |
| SQLite remains required for v2. SQLite turns out to be powerful |
| enough to maintain overview information. Most of the PSGI and all |
| of the NNTP functionality is possible with only SQLite in addition |
| to git. |
| |
| The overview DB was an instrumental piece in maintaining near |
| constant-time read performance on a dataset 2-3 times larger |
| than LKML history as of 2018. |
| |
| =head3 GHOST MESSAGES |
| |
| The overview DB also includes references to "ghost" messages, |
| or messages which have replies but have not been seen by us. |
| Thus it is expected to have more rows than the "msgmap" DB |
| described below. |
| |
| =head2 msgmap.sqlite3 |
| |
| The SQLite msgmap DB is unchanged from v1, but it is now at the |
| top-level of the directory. |
| |
| =head1 OBJECT IDENTIFIERS |
| |
| There are three distinct type of identifiers. content_hash is the |
| new one for v2 and should make message removal and deduplication |
| easier. object_id and Message-ID are already known. |
| |
| =over |
| |
| =item object_id |
| |
| The blob identifier git uses (currently SHA-1). No need to |
| publicly expose this outside of normal git ops (cloning) and |
| there's no need to make this searchable. As with v1 of |
| public-inbox, this is stored as part of the Xapian document so |
| expensive name lookups can be avoided for document retrieval. |
| |
| =item Message-ID |
| |
| The email header; duplicates allowed for archival purposes. |
| This remains a searchable field in Xapian. Note: it's possible |
| for emails to have multiple Message-ID headers (and L<git-send-email(1)> |
| had that bug for a bit); so we take all of them into account. |
| In case of conflicts detected by content_hash below, we generate a new |
| Message-ID based on content_hash; if the generated Message-ID still |
| conflicts, a random one is generated. |
| |
| =item content_hash |
| |
| A hash of relevant headers and raw body content for |
| purging of unwanted content. This is not stored anywhere, |
| but always calculated on-the-fly. |
| |
| For now, the relevant headers are: |
| |
| Subject, From, Date, References, In-Reply-To, To, Cc |
| |
| Received, List-Id, and similar headers are NOT part of content_hash as |
| they differ across lists and we will want removal to be able to cross |
| lists. |
| |
| The textual parts of the body are decoded, CRLF normalized to |
| LF, and trailing whitespace stripped. Notably, hashing the |
| raw body risks being broken by list signatures; but we can use |
| filters (e.g. PublicInbox::Filter::Vger) to clean the body for |
| imports. |
| |
| content_hash is SHA-256 for now; but can be changed at any time |
| without making DB changes. |
| |
| =back |
| |
| =head1 LOCKING |
| |
| L<flock(2)> locking exclusively locks the empty inbox.lock file |
| for all non-atomic operations. |
| |
| =head1 HEADERS |
| |
| Same handling as with v1, except the Message-ID header will |
| be generated if not provided or conflicting. "Bytes", "Lines" |
| and "Content-Length" headers are stripped and not allowed, they |
| can interfere with further processing. |
| |
| The "Status" mbox header is also stripped as that header makes |
| no sense in a public archive. |
| |
| =head1 THANKS |
| |
| Thanks to the Linux Foundation for sponsoring the development |
| and testing of the v2 format. |
| |
| =head1 COPYRIGHT |
| |
| Copyright 2018-2021 all contributors L<mailto:meta@public-inbox.org> |
| |
| License: AGPL-3.0+ L<http://www.gnu.org/licenses/agpl-3.0.txt> |
| |
| =head1 SEE ALSO |
| |
| L<gitrepository-layout(5)>, L<public-inbox-v1-format(5)> |