blob: 2db7d7e90527b63cbde1480cf774551c4c8cd88c [file] [log] [blame]
=head1 NAME
public-inbox-extindex - create and update external search indices
=head1 SYNOPSIS
public-inbox-extindex [OPTIONS] EXTINDEX_DIR INBOX_DIR...
public-inbox-extindex [OPTIONS] [EXTINDEX_DIR] --all
=head1 DESCRIPTION
public-inbox-extindex creates and updates an external search and
overview database used by the read-only public-inbox PSGI (HTTP),
NNTP, and IMAP interfaces. This requires either the
L<Xapian> SWIG bindings OR or L<Search::Xapian> XS bindings
along with L<DBD::SQLite> and L<DBI> Perl modules.
=head1 OPTIONS
=over
=item -j JOBS
=item --jobs=JOBS
=item --no-fsync
=item --dangerous
=item --rethread
=item --max-size SIZE
=item --batch-size SIZE
These switches behave as they do for L<public-inbox-index(1)>
=item --all
Index all C<publicinbox> entries in C<PI_CONFIG>.
C<publicinbox> entries indexed by C<public-inbox-extindex> can
have full Xapian searching abilities with the per-C<publicinbox>
C<indexlevel> set to C<basic> and their respective Xapian
(C<xap15> or C<xapian15>) directories removed. For multiple
public-inboxes where cross-posting is common, this allows
significant space savings on Xapian indices.
=item --dedupe=MSGID
=item --dedupe
Rerun deduplication on messages with the given Message-ID or
all messages if no Message-ID is specified. Deduplication rules may
change and evolve over time, especially if filters are involved.
C<--dedupe=MSGID> may be specified multiple times to deduplicate
multiple Message-IDs.
Use this if you see C<W: BUG? $MSGID not deduplicated properly>
warnings from WWW logs.
=item --gc
Perform garbage collection instead of indexing. Use this if
inboxes are removed from the extindex, a newsgroup name is
set or changed, or if messages are purged or removed from
some inboxes.
=item --reindex
Forces a re-index of all messages in the extindex. This can be
used for in-place upgrades and bugfixes while read-only server
processes are utilizing the index. Keep in mind this roughly
doubles the size of the already-large Xapian database.
=item --fast
Used with C<--reindex>, it will only look for new and stale
entries and not touch already-indexed messages.
=item --no-multi-pack-index
Disable writing a L<git-multi-pack-index(1)> file to save memory.
Normally, enabling multi-pack-index speeds up startup time of
subsequent L<git-cat-file(1)> processes by 3-4%, but generating
this file requires several GB of memory with large repos.
Unlike the C<core.multiPackIndex> directive in git, it's still
possible to read existing multi-pack-index files if they are
created elsewhere.
Available in public-inbox 2.0.0+
=back
=head1 FILES
L<public-inbox-extindex-format(5)>
=head1 CONFIGURATION
public-inbox-extindex does not write to the L<public-inbox-config(5)>
file, it must be entered manually.
The extindex name of C<all> is a special case which
corresponds to indexing C<--all> inboxes. An example for
C<--all> is as follows:
[extindex "all"]
topdir = /path/to/extindex_dir
url = all
coderepo = foo
coderepo = bar
Putting an C<extindex> entry in the config allows L<PublicInbox::WWW>.
You can have any number of C<extentry.$NAME> sections where C<$NAME>
is something other than C<all> to display a union of several inboxes.
It is strongly recommended any public inboxes indexed by this command
have a stable C<publicinbox.$NAME.newsgroup> entry (regardless of
the presence of an NNTP or IMAP server). Otherwise, public-inbox-extindex
will use C<publicinbox.$NAME.inboxdir> as an internal key which can
cause needless reindexing and require L<--gc> if inboxes are relocated.
See L<public-inbox-config(5)> for more details.
=head1 ENVIRONMENT
=over 8
=item PI_CONFIG
Used to override the default "~/.public-inbox/config" value.
=item XAPIAN_FLUSH_THRESHOLD
The number of documents to update before committing changes to
disk. This environment is handled directly by Xapian, refer to
Xapian API documentation for more details.
Setting C<XAPIAN_FLUSH_THRESHOLD> or
C<publicinbox.indexBatchSize> for a large C<--reindex> may cause
L<public-inbox-mda(1)>, L<public-inbox-learn(1)> and
L<public-inbox-watch(1)> tasks to wait long and unpredictable
periods of time during C<--reindex>.
Default: none, uses C<publicinbox.indexBatchSize>
=back
=head1 UPGRADING
Occasionally, public-inbox will update its schema version and
require a full index by running this command.
=head1 LOCKING
It is safe to use C<--dedupe>, C<--gc> and C<--reindex> while
other processes are writing to covered inboxes or extindex.
The extindex locks will be released roughly every 10s to
allow L<public-inbox-mda(1)> and L<public-inbox-watch(1)>
processes to write to the extindex.
=head1 CONTACT
Feedback welcome via plain-text mail to L<mailto:meta@public-inbox.org>
The mail archives are hosted at L<https://public-inbox.org/meta/> and
L<http://4uok3hntl7oi7b4uf4rtfwefqeexfzil2w6kgk2jn5z2f764irre7byd.onion/meta/>
=head1 COPYRIGHT
Copyright all contributors L<mailto:meta@public-inbox.org>
License: AGPL-3.0+ L<https://www.gnu.org/licenses/agpl-3.0.txt>
=head1 SEE ALSO
L<Search::Xapian>, L<DBD::SQLite>