Documentation/public-inbox-v2-format.pod - pub/scm/infra/public-inbox - Git at Google

 % public-inbox developer manual

 =head1 NAME

 public-inbox-v2-format - structure of public inbox v2 archives

 =head1 DESCRIPTION

 The v2 format is designed primarily to address several
 scalability problems of the original format described at
 L<public-inbox-v1-format(5)>.  It also handles messages with
 Message-IDs.

 =head1 INBOX LAYOUT

 The key change in v2 is the inbox is no longer a bare git
 repository, but a directory with two or more git repositories.
 v2 divides git repositories by time "epochs" and Xapian
 databases for parallelism by "shards".

 =head2 INBOX OVERVIEW AND DEFINITIONS

   $EPOCH - Integer starting with 0 based on time
   $SCHEMA_VERSION - DB schema version (for Xapian)
   $SHARD - Integer starting with 0 based on parallelism

   foo/                              # "foo" is the name of the inbox
   - inbox.lock                      # lock file to protect global state
   - git/$EPOCH.git                  # normal git repositories
   - all.git                         # empty, alternates to $EPOCH.git
   - xap$SCHEMA_VERSION/$SHARD       # per-shard Xapian DB
   - xap$SCHEMA_VERSION/over.sqlite3 # OVER-view DB for NNTP, threading
   - msgmap.sqlite3                  # same as the v1 msgmap

 For blob lookups, the reader only needs to open the "all.git"
 repository with $GIT_DIR/objects/info/alternates which references
 every $EPOCH.git repo.

 Individual $EPOCH.git repos DO NOT use alternates themselves as
 git currently limits recursion of alternates nesting depth to 5.

 =head2 GIT EPOCHS

 One of the inherent scalability problems with git itself is the
 full history of a project must be stored and carried around to
 all clients.  To address this problem, the v2 format uses
 multiple git repositories, stored as time-based "epochs".

 We currently divide epochs into roughly one gigabyte segments;
 but this size can be configurable (if needed) in the future.

 A pleasant side-effect of this design is the git packs of older
 epochs are stable, allowing them to be cloned without requiring
 expensive pack generation.  This also allows clients to clone
 only the epochs they are interested in to save bandwidth and
 storage.

 To minimize changes to existing v1-based code and simplify our
 code, we use the "alternates" mechanism described in
 L<gitrepository-layout(5)> to link all the epoch repositories
 with a single read-only "all.git" endpoint.

 Processes retrieve blobs via the "all.git" repository, while
 writers write blobs directly to epochs.

 =head2 GIT TREE LAYOUT

 One key problem specific to v1 was large trees were frequently a
 performance problem as name lookups are expensive and there were
 limited deltafication opportunities with unpredictable file
 names.  As a result, all Xapian-enabled installations retrieve
 blob object_ids directly in v1, bypassing tree lookups.

 While dividing git repositories into epochs caps the growth of
 trees, worst-case tree size was still unnecessary overhead and
 worth eliminating.

 So in contrast to the big trees of v1, the v2 git tree contains
 only a single file at the top-level of the tree, either 'm' (for
 'mail' or 'message') or 'd' (for deleted).  A tree does not have
 'm' and 'd' at the same time.

 Mail is still stored in blobs (instead of inline with the commit
 object) as we still need a stable reference in the indices in
 case commit history is rewritten to comply with legal
 requirements.

 After-the-fact invocations of L<public-inbox-index> will ignore
 messages written to 'd' after they are written to 'm'.

 Deltafication is not significantly improved over v1, but overall
 storage for trees is made as small as possible.  Initial
 statistics and benchmarks showing the benefits of this approach
 are documented at:

 L<https://public-inbox.org/meta/20180209205140.GA11047@dcvr/>

 =head2 XAPIAN SHARDS

 Another scalability problem in v1 was the inability to
 utilize multiple CPU cores for Xapian indexing.  This is
 addressed by using shards in Xapian to perform import
 indexing in parallel.

 As with git alternates, Xapian natively supports a read-only
 interface which transparently abstracts away the knowledge of
 multiple shards.  This allows us to simplify our read-only
 code paths.

 The performance of the storage device is now the bottleneck on
 larger multi-core systems.  In our experience, performance is
 improved with high-quality and high-quantity solid-state storage.
 Issuing TRIM commands with L<fstrim(8)> was necessary to maintain
 consistent performance while developing this feature.

 Rotational storage devices perform significantly worse than
 solid state storage for indexing of large mail archives; but are
 fine for backup and usable for small instances.

 As of public-inbox 1.6.0, the C<publicInbox.indexSequentialShard>
 option of L<public-inbox-index(1)> may be used with a high shard
 count to ensure individual shards fit into page cache when the entire
 Xapian DB cannot.

 Our use of the L</OVERVIEW DB> requires Xapian document IDs to
 remain stable.  Using L<public-inbox-compact(1)> and
 L<public-inbox-xcpdb(1)> wrappers are recommended over tools
 provided by Xapian.

 =head2 OVERVIEW DB

 Towards the end of v2 development, it became apparent Xapian did
 not perform well for sorting large result sets used to generate
 the landing page in the PSGI UI (/$INBOX/) or many queries used
 by the NNTP server.  Thus, SQLite was employed and the Xapian
 "skeleton" DB was renamed to the "overview" DB (after the NNTP
 OVER/XOVER commands).

 The overview DB maintains all the header information necessary
 to implement the NNTP OVER/XOVER commands and non-search
 endpoints of the PSGI UI.

 Xapian has become completely optional for v2 (as it is for v1), but
 SQLite remains required for v2.  SQLite turns out to be powerful
 enough to maintain overview information.  Most of the PSGI and all
 of the NNTP functionality is possible with only SQLite in addition
 to git.

 The overview DB was an instrumental piece in maintaining near
 constant-time read performance on a dataset 2-3 times larger
 than LKML history as of 2018.

 =head3 GHOST MESSAGES

 The overview DB also includes references to "ghost" messages,
 or messages which have replies but have not been seen by us.
 Thus it is expected to have more rows than the "msgmap" DB
 described below.

 =head2 msgmap.sqlite3

 The SQLite msgmap DB is unchanged from v1, but it is now at the
 top-level of the directory.

 =head1 OBJECT IDENTIFIERS

 There are three distinct type of identifiers.  content_hash is the
 new one for v2 and should make message removal and deduplication
 easier.  object_id and Message-ID are already known.

 =over

 =item object_id

 The blob identifier git uses (currently SHA-1).  No need to
 publicly expose this outside of normal git ops (cloning) and
 there's no need to make this searchable.  As with v1 of
 public-inbox, this is stored as part of the Xapian document so
 expensive name lookups can be avoided for document retrieval.

 =item Message-ID

 The email header; duplicates allowed for archival purposes.
 This remains a searchable field in Xapian.  Note: it's possible
 for emails to have multiple Message-ID headers (and L<git-send-email(1)>
 had that bug for a bit); so we take all of them into account.
 In case of conflicts detected by content_hash below, we generate a new
 Message-ID based on content_hash; if the generated Message-ID still
 conflicts, a random one is generated.

 =item content_hash

 A hash of relevant headers and raw body content for
 purging of unwanted content.  This is not stored anywhere,
 but always calculated on-the-fly.

 For now, the relevant headers are:

 	Subject, From, Date, References, In-Reply-To, To, Cc

 Received, List-Id, and similar headers are NOT part of content_hash as
 they differ across lists and we will want removal to be able to cross
 lists.

 The textual parts of the body are decoded, CRLF normalized to
 LF, and trailing whitespace stripped.  Notably, hashing the
 raw body risks being broken by list signatures; but we can use
 filters (e.g. PublicInbox::Filter::Vger) to clean the body for
 imports.

 content_hash is SHA-256 for now; but can be changed at any time
 without making DB changes.

 =back

 =head1 LOCKING

 L<flock(2)> locking exclusively locks the empty inbox.lock file
 for all non-atomic operations.

 =head1 HEADERS

 Same handling as with v1, except the Message-ID header will
 be generated if not provided or conflicting.  "Bytes", "Lines"
 and "Content-Length" headers are stripped and not allowed, they
 can interfere with further processing.

 The "Status" mbox header is also stripped as that header makes
 no sense in a public archive.

 =head1 THANKS

 Thanks to the Linux Foundation for sponsoring the development
 and testing of the v2 format.

 =head1 COPYRIGHT

 Copyright 2018-2021 all contributors L<mailto:meta@public-inbox.org>

 License: AGPL-3.0+ L<http://www.gnu.org/licenses/agpl-3.0.txt>

 =head1 SEE ALSO

 L<gitrepository-layout(5)>, L<public-inbox-v1-format(5)>
	% public-inbox developer manual

	=head1 NAME

	public-inbox-v2-format - structure of public inbox v2 archives

	=head1 DESCRIPTION

	The v2 format is designed primarily to address several
	scalability problems of the original format described at
	L<public-inbox-v1-format(5)>. It also handles messages with
	Message-IDs.

	=head1 INBOX LAYOUT

	The key change in v2 is the inbox is no longer a bare git
	repository, but a directory with two or more git repositories.
	v2 divides git repositories by time "epochs" and Xapian
	databases for parallelism by "shards".

	=head2 INBOX OVERVIEW AND DEFINITIONS

	$EPOCH - Integer starting with 0 based on time
	$SCHEMA_VERSION - DB schema version (for Xapian)
	$SHARD - Integer starting with 0 based on parallelism

	foo/ # "foo" is the name of the inbox
	- inbox.lock # lock file to protect global state
	- git/$EPOCH.git # normal git repositories
	- all.git # empty, alternates to $EPOCH.git
	- xap$SCHEMA_VERSION/$SHARD # per-shard Xapian DB
	- xap$SCHEMA_VERSION/over.sqlite3 # OVER-view DB for NNTP, threading
	- msgmap.sqlite3 # same as the v1 msgmap

	For blob lookups, the reader only needs to open the "all.git"
	repository with $GIT_DIR/objects/info/alternates which references
	every $EPOCH.git repo.

	Individual $EPOCH.git repos DO NOT use alternates themselves as
	git currently limits recursion of alternates nesting depth to 5.

	=head2 GIT EPOCHS

	One of the inherent scalability problems with git itself is the
	full history of a project must be stored and carried around to
	all clients. To address this problem, the v2 format uses
	multiple git repositories, stored as time-based "epochs".

	We currently divide epochs into roughly one gigabyte segments;
	but this size can be configurable (if needed) in the future.

	A pleasant side-effect of this design is the git packs of older
	epochs are stable, allowing them to be cloned without requiring
	expensive pack generation. This also allows clients to clone
	only the epochs they are interested in to save bandwidth and
	storage.

	To minimize changes to existing v1-based code and simplify our
	code, we use the "alternates" mechanism described in
	L<gitrepository-layout(5)> to link all the epoch repositories
	with a single read-only "all.git" endpoint.

	Processes retrieve blobs via the "all.git" repository, while
	writers write blobs directly to epochs.

	=head2 GIT TREE LAYOUT

	One key problem specific to v1 was large trees were frequently a
	performance problem as name lookups are expensive and there were
	limited deltafication opportunities with unpredictable file
	names. As a result, all Xapian-enabled installations retrieve
	blob object_ids directly in v1, bypassing tree lookups.

	While dividing git repositories into epochs caps the growth of
	trees, worst-case tree size was still unnecessary overhead and
	worth eliminating.

	So in contrast to the big trees of v1, the v2 git tree contains
	only a single file at the top-level of the tree, either 'm' (for
	'mail' or 'message') or 'd' (for deleted). A tree does not have
	'm' and 'd' at the same time.

	Mail is still stored in blobs (instead of inline with the commit
	object) as we still need a stable reference in the indices in
	case commit history is rewritten to comply with legal
	requirements.

	After-the-fact invocations of L<public-inbox-index> will ignore
	messages written to 'd' after they are written to 'm'.

	Deltafication is not significantly improved over v1, but overall
	storage for trees is made as small as possible. Initial
	statistics and benchmarks showing the benefits of this approach
	are documented at:

	L<https://public-inbox.org/meta/20180209205140.GA11047@dcvr/>

	=head2 XAPIAN SHARDS

	Another scalability problem in v1 was the inability to
	utilize multiple CPU cores for Xapian indexing. This is
	addressed by using shards in Xapian to perform import
	indexing in parallel.

	As with git alternates, Xapian natively supports a read-only
	interface which transparently abstracts away the knowledge of
	multiple shards. This allows us to simplify our read-only
	code paths.

	The performance of the storage device is now the bottleneck on
	larger multi-core systems. In our experience, performance is
	improved with high-quality and high-quantity solid-state storage.
	Issuing TRIM commands with L<fstrim(8)> was necessary to maintain
	consistent performance while developing this feature.

	Rotational storage devices perform significantly worse than
	solid state storage for indexing of large mail archives; but are
	fine for backup and usable for small instances.

	As of public-inbox 1.6.0, the C<publicInbox.indexSequentialShard>
	option of L<public-inbox-index(1)> may be used with a high shard
	count to ensure individual shards fit into page cache when the entire
	Xapian DB cannot.

	Our use of the L</OVERVIEW DB> requires Xapian document IDs to
	remain stable. Using L<public-inbox-compact(1)> and
	L<public-inbox-xcpdb(1)> wrappers are recommended over tools
	provided by Xapian.

	=head2 OVERVIEW DB

	Towards the end of v2 development, it became apparent Xapian did
	not perform well for sorting large result sets used to generate
	the landing page in the PSGI UI (/$INBOX/) or many queries used
	by the NNTP server. Thus, SQLite was employed and the Xapian
	"skeleton" DB was renamed to the "overview" DB (after the NNTP
	OVER/XOVER commands).

	The overview DB maintains all the header information necessary
	to implement the NNTP OVER/XOVER commands and non-search
	endpoints of the PSGI UI.

	Xapian has become completely optional for v2 (as it is for v1), but
	SQLite remains required for v2. SQLite turns out to be powerful
	enough to maintain overview information. Most of the PSGI and all
	of the NNTP functionality is possible with only SQLite in addition
	to git.

	The overview DB was an instrumental piece in maintaining near
	constant-time read performance on a dataset 2-3 times larger
	than LKML history as of 2018.

	=head3 GHOST MESSAGES

	The overview DB also includes references to "ghost" messages,
	or messages which have replies but have not been seen by us.
	Thus it is expected to have more rows than the "msgmap" DB
	described below.

	=head2 msgmap.sqlite3

	The SQLite msgmap DB is unchanged from v1, but it is now at the
	top-level of the directory.

	=head1 OBJECT IDENTIFIERS

	There are three distinct type of identifiers. content_hash is the
	new one for v2 and should make message removal and deduplication
	easier. object_id and Message-ID are already known.

	=over

	=item object_id

	The blob identifier git uses (currently SHA-1). No need to
	publicly expose this outside of normal git ops (cloning) and
	there's no need to make this searchable. As with v1 of
	public-inbox, this is stored as part of the Xapian document so
	expensive name lookups can be avoided for document retrieval.

	=item Message-ID

	The email header; duplicates allowed for archival purposes.
	This remains a searchable field in Xapian. Note: it's possible
	for emails to have multiple Message-ID headers (and L<git-send-email(1)>
	had that bug for a bit); so we take all of them into account.
	In case of conflicts detected by content_hash below, we generate a new
	Message-ID based on content_hash; if the generated Message-ID still
	conflicts, a random one is generated.

	=item content_hash

	A hash of relevant headers and raw body content for
	purging of unwanted content. This is not stored anywhere,
	but always calculated on-the-fly.

	For now, the relevant headers are:

	Subject, From, Date, References, In-Reply-To, To, Cc

	Received, List-Id, and similar headers are NOT part of content_hash as
	they differ across lists and we will want removal to be able to cross
	lists.

	The textual parts of the body are decoded, CRLF normalized to
	LF, and trailing whitespace stripped. Notably, hashing the
	raw body risks being broken by list signatures; but we can use
	filters (e.g. PublicInbox::Filter::Vger) to clean the body for
	imports.

	content_hash is SHA-256 for now; but can be changed at any time
	without making DB changes.

	=back

	=head1 LOCKING

	L<flock(2)> locking exclusively locks the empty inbox.lock file
	for all non-atomic operations.

	=head1 HEADERS

	Same handling as with v1, except the Message-ID header will
	be generated if not provided or conflicting. "Bytes", "Lines"
	and "Content-Length" headers are stripped and not allowed, they
	can interfere with further processing.

	The "Status" mbox header is also stripped as that header makes
	no sense in a public archive.

	=head1 THANKS

	Thanks to the Linux Foundation for sponsoring the development
	and testing of the v2 format.

	=head1 COPYRIGHT

	Copyright 2018-2021 all contributors L<mailto:meta@public-inbox.org>

	License: AGPL-3.0+ L<http://www.gnu.org/licenses/agpl-3.0.txt>

	=head1 SEE ALSO

	L<gitrepository-layout(5)>, L<public-inbox-v1-format(5)>