| % public-inbox developer manual |
| |
| =head1 NAME |
| |
| public-inbox-v1-format - git repository and tree description (aka "ssoma") |
| |
| =head1 DESCRIPTION |
| |
| WARNING: this does NOT describe the scalable v2 format used |
| by public-inbox. Use of ssoma is not recommended for new |
| installations due to scalability problems. |
| |
| ssoma uses a git repository to store each email as a git blob. |
| The tree filename of the blob is based on the SHA1 hexdigest of |
| the first Message-ID header. A commit is made for each message |
| delivered. The commit SHA-1 identifier is used by ssoma clients |
| to track synchronization state. |
| |
| =head1 PATHNAMES IN TREES |
| |
| A Message-ID may be extremely long and also contain slashes, so using |
| them as a path name is challenging. Instead we use the SHA-1 hexdigest |
| of the Message-ID (excluding the leading "E<lt>" and trailing "E<gt>") |
| to generate a path name. Leading and trailing white space in the |
| Message-ID header is ignored for hashing. |
| |
| A message with Message-ID of: E<lt>20131106023245.GA20224@dcvr.yhbt.netE<gt> |
| |
| Would be stored as: f2/8c6cfd2b0a65f994c3e1be266105413b3d3f63 |
| |
| Thus it is easy to look up the contents of a message matching a given |
| a Message-ID. |
| |
| =head1 MESSAGE-ID CONFLICTS |
| |
| public-inbox v1 repositories currently do not resolve conflicting |
| Message-IDs or messages with multiple Message-IDs. |
| |
| =head1 HEADERS |
| |
| The Message-ID header is required. |
| "Bytes", "Lines" and "Content-Length" headers are stripped and not |
| allowed, they can interfere with further processing. |
| When using ssoma with public-inbox-mda, the "Status" mbox header |
| is also stripped as that header makes no sense in a public archive. |
| |
| =head1 LOCKING |
| |
| L<flock(2)> locking exclusively locks the empty $GIT_DIR/ssoma.lock file |
| for all non-atomic operations. |
| |
| =head1 EXAMPLE INPUT FLOW (SERVER-SIDE MDA) |
| |
| 1. Message is delivered to a mail transport agent (MTA) |
| |
| 1a. (optional) reject/discard spam, this should run before ssoma-mda |
| |
| 1b. (optional) reject/strip unwanted attachments |
| |
| ssoma-mda handles all steps once invoked. |
| |
| 2. Mail transport agent invokes ssoma-mda |
| |
| 3. reads message via stdin, extracting Message-ID |
| |
| 4. acquires exclusive flock lock on $GIT_DIR/ssoma.lock |
| |
| 5. creates or updates the blob of associated 2/38 SHA-1 path |
| |
| 6. updates the index and commits |
| |
| 7. releases $GIT_DIR/ssoma.lock |
| |
| ssoma-mda can also be used as an L<inotify(7)> trigger to monitor maildirs, |
| and the ability to monitor IMAP mailboxes using IDLE will be available |
| in the future. |
| |
| =head1 GIT REPOSITORIES (SERVERS) |
| |
| ssoma uses bare git repositories on both servers and clients. |
| |
| Using the L<git-init(1)> command with --bare is the recommend method |
| of creating a git repository on a server: |
| |
| git init --bare /path/to/wherever/you/want.git |
| |
| There are no standardized paths for servers, administrators make |
| all the choices regarding git repository locations. |
| |
| Special files in $GIT_DIR on the server: |
| |
| =over |
| |
| =item $GIT_DIR/ssoma.lock |
| |
| An empty file for L<flock(2)> locking. |
| This is necessary to ensure the index and commits are updated |
| consistently and multiple processes running MDA do not step on |
| each other. |
| |
| =item $GIT_DIR/public-inbox/msgmap.sqlite3 |
| |
| SQLite3 database maintaining a stable mapping of Message-IDs to NNTP |
| article numbers. Used by L<public-inbox-nntpd(1)> and created |
| and updated by L<public-inbox-index(1)>. |
| |
| Users of the L<PublicInbox::WWW> interface will find it |
| useful for attempting recovery from copy-paste truncations of |
| URLs containing long Message-IDs. |
| |
| Automatically updated by L<public-inbox-mda(1)>, |
| L<public-inbox-learn(1)> and L<public-inbox-watch(1)>. |
| |
| Losing or damaging this file will cause synchronization problems for |
| NNTP clients. This file is expected to be stable and require no |
| updates to its schema. |
| |
| Requires L<DBD::SQLite>. |
| |
| =item $GIT_DIR/public-inbox/xapian$N/ |
| |
| Xapian database for search indices in the PSGI web UI. |
| |
| $N is the value of PublicInbox::Search::SCHEMA_VERSION, and |
| installations may have parallel versions on disk during upgrades |
| or to roll-back upgrades. |
| |
| This is created and updated by L<public-inbox-index(1)>. |
| |
| Automatically updated by L<public-inbox-mda(1)>, |
| L<public-inbox-learn(1)> and L<public-inbox-watch(1)>. |
| |
| This directory can always be regenerated with L<public-inbox-index(1)>. |
| If lost or damaged, there is no need to back it up unless the |
| CPU/memory cost of regenerating it outweighs the storage/transfer cost. |
| |
| Since SCHEMA_VERSION 15 and the development of the v2 format, |
| the "overview" DB also exists in the xapian directory for v1 |
| repositories. See L<public-inbox-v2-format(5)/OVERVIEW DB> |
| |
| Our use of the L</OVERVIEW DB> requires Xapian document IDs to |
| remain stable. Using L<public-inbox-compact(1)> and |
| L<public-inbox-xcpdb(1)> wrappers are recommended over tools |
| provided by Xapian. |
| |
| This directory is large, often two to three times the size of |
| the objects stored in a packed git repository. |
| |
| =item $GIT_DIR/ssoma.index |
| |
| This file is no longer used or created by public-inbox, but it is |
| updated if it exists to remain compatible with ssoma installations. |
| |
| A git index file used for MDA updates. The normal git index (in |
| $GIT_DIR/index) is not used at all as there is typically no working |
| tree. |
| |
| =back |
| |
| Each client $GIT_DIR may have multiple mbox/maildir/command targets. |
| It is possible for a client to extract the mail stored in the git |
| repository to multiple mboxes for compatibility with a variety of |
| different tools. |
| |
| =head1 CAVEATS |
| |
| It is NOT recommended to check out the working directory of a git. |
| there may be many files. |
| |
| It is impossible to completely expunge messages, even spam, as git |
| retains full history. Projects may (with adequate notice) cycle to new |
| repositories/branches with history cleaned up via L<git-filter-repo(1)> |
| or L<git-filter-branch(1)>. |
| This is up to the administrators. |
| |
| =head1 COPYRIGHT |
| |
| Copyright 2013-2021 all contributors L<mailto:meta@public-inbox.org> |
| |
| License: AGPL-3.0+ L<http://www.gnu.org/licenses/agpl-3.0.txt> |
| |
| =head1 SEE ALSO |
| |
| L<gitrepository-layout(5)>, L<ssoma(1)> |