blob: db223fd9388b090557c17d9db1d20cf2b7203673 [file] [log] [blame]
% public-inbox developer manual
=head1 NAME
public-inbox-v1-format - git repository and tree description (aka "ssoma")
=head1 DESCRIPTION
WARNING: this does NOT describe the scalable v2 format used
by public-inbox. Use of ssoma is not recommended for new
installations due to scalability problems.
ssoma uses a git repository to store each email as a git blob.
The tree filename of the blob is based on the SHA1 hexdigest of
the first Message-ID header. A commit is made for each message
delivered. The commit SHA-1 identifier is used by ssoma clients
to track synchronization state.
=head1 PATHNAMES IN TREES
A Message-ID may be extremely long and also contain slashes, so using
them as a path name is challenging. Instead we use the SHA-1 hexdigest
of the Message-ID (excluding the leading "E<lt>" and trailing "E<gt>")
to generate a path name. Leading and trailing white space in the
Message-ID header is ignored for hashing.
A message with Message-ID of: E<lt>20131106023245.GA20224@dcvr.yhbt.netE<gt>
Would be stored as: f2/8c6cfd2b0a65f994c3e1be266105413b3d3f63
Thus it is easy to look up the contents of a message matching a given
a Message-ID.
=head1 MESSAGE-ID CONFLICTS
public-inbox v1 repositories currently do not resolve conflicting
Message-IDs or messages with multiple Message-IDs.
=head1 HEADERS
The Message-ID header is required.
"Bytes", "Lines" and "Content-Length" headers are stripped and not
allowed, they can interfere with further processing.
When using ssoma with public-inbox-mda, the "Status" mbox header
is also stripped as that header makes no sense in a public archive.
=head1 LOCKING
L<flock(2)> locking exclusively locks the empty $GIT_DIR/ssoma.lock file
for all non-atomic operations.
=head1 EXAMPLE INPUT FLOW (SERVER-SIDE MDA)
1. Message is delivered to a mail transport agent (MTA)
1a. (optional) reject/discard spam, this should run before ssoma-mda
1b. (optional) reject/strip unwanted attachments
ssoma-mda handles all steps once invoked.
2. Mail transport agent invokes ssoma-mda
3. reads message via stdin, extracting Message-ID
4. acquires exclusive flock lock on $GIT_DIR/ssoma.lock
5. creates or updates the blob of associated 2/38 SHA-1 path
6. updates the index and commits
7. releases $GIT_DIR/ssoma.lock
ssoma-mda can also be used as an L<inotify(7)> trigger to monitor maildirs,
and the ability to monitor IMAP mailboxes using IDLE will be available
in the future.
=head1 GIT REPOSITORIES (SERVERS)
ssoma uses bare git repositories on both servers and clients.
Using the L<git-init(1)> command with --bare is the recommend method
of creating a git repository on a server:
git init --bare /path/to/wherever/you/want.git
There are no standardized paths for servers, administrators make
all the choices regarding git repository locations.
Special files in $GIT_DIR on the server:
=over
=item $GIT_DIR/ssoma.lock
An empty file for L<flock(2)> locking.
This is necessary to ensure the index and commits are updated
consistently and multiple processes running MDA do not step on
each other.
=item $GIT_DIR/public-inbox/msgmap.sqlite3
SQLite3 database maintaining a stable mapping of Message-IDs to NNTP
article numbers. Used by L<public-inbox-nntpd(1)> and created
and updated by L<public-inbox-index(1)>.
Users of the L<PublicInbox::WWW> interface will find it
useful for attempting recovery from copy-paste truncations of
URLs containing long Message-IDs.
Automatically updated by L<public-inbox-mda(1)>,
L<public-inbox-learn(1)> and L<public-inbox-watch(1)>.
Losing or damaging this file will cause synchronization problems for
NNTP clients. This file is expected to be stable and require no
updates to its schema.
Requires L<DBD::SQLite>.
=item $GIT_DIR/public-inbox/xapian$N/
Xapian database for search indices in the PSGI web UI.
$N is the value of PublicInbox::Search::SCHEMA_VERSION, and
installations may have parallel versions on disk during upgrades
or to roll-back upgrades.
This is created and updated by L<public-inbox-index(1)>.
Automatically updated by L<public-inbox-mda(1)>,
L<public-inbox-learn(1)> and L<public-inbox-watch(1)>.
This directory can always be regenerated with L<public-inbox-index(1)>.
If lost or damaged, there is no need to back it up unless the
CPU/memory cost of regenerating it outweighs the storage/transfer cost.
Since SCHEMA_VERSION 15 and the development of the v2 format,
the "overview" DB also exists in the xapian directory for v1
repositories. See L<public-inbox-v2-format(5)/OVERVIEW DB>
Our use of the L</OVERVIEW DB> requires Xapian document IDs to
remain stable. Using L<public-inbox-compact(1)> and
L<public-inbox-xcpdb(1)> wrappers are recommended over tools
provided by Xapian.
This directory is large, often two to three times the size of
the objects stored in a packed git repository.
=item $GIT_DIR/ssoma.index
This file is no longer used or created by public-inbox, but it is
updated if it exists to remain compatible with ssoma installations.
A git index file used for MDA updates. The normal git index (in
$GIT_DIR/index) is not used at all as there is typically no working
tree.
=back
Each client $GIT_DIR may have multiple mbox/maildir/command targets.
It is possible for a client to extract the mail stored in the git
repository to multiple mboxes for compatibility with a variety of
different tools.
=head1 CAVEATS
It is NOT recommended to check out the working directory of a git.
there may be many files.
It is impossible to completely expunge messages, even spam, as git
retains full history. Projects may (with adequate notice) cycle to new
repositories/branches with history cleaned up via L<git-filter-repo(1)>
or L<git-filter-branch(1)>.
This is up to the administrators.
=head1 COPYRIGHT
Copyright 2013-2021 all contributors L<mailto:meta@public-inbox.org>
License: AGPL-3.0+ L<http://www.gnu.org/licenses/agpl-3.0.txt>
=head1 SEE ALSO
L<gitrepository-layout(5)>, L<ssoma(1)>