| =head1 NAME |
| |
| public-inbox-extindex - create and update external search indices |
| |
| =head1 SYNOPSIS |
| |
| public-inbox-extindex [OPTIONS] EXTINDEX_DIR INBOX_DIR... |
| |
| public-inbox-extindex [OPTIONS] [EXTINDEX_DIR] --all |
| |
| =head1 DESCRIPTION |
| |
| public-inbox-extindex creates and updates an external search and |
| overview database used by the read-only public-inbox PSGI (HTTP), |
| NNTP, and IMAP interfaces. This requires either the |
| L<Xapian> SWIG bindings OR or L<Search::Xapian> XS bindings |
| along with L<DBD::SQLite> and L<DBI> Perl modules. |
| |
| =head1 OPTIONS |
| |
| =over |
| |
| =item -j JOBS |
| |
| =item --jobs=JOBS |
| |
| =item --no-fsync |
| |
| =item --dangerous |
| |
| =item --rethread |
| |
| =item --max-size SIZE |
| |
| =item --batch-size SIZE |
| |
| These switches behave as they do for L<public-inbox-index(1)> |
| |
| =item --all |
| |
| Index all C<publicinbox> entries in C<PI_CONFIG>. |
| |
| C<publicinbox> entries indexed by C<public-inbox-extindex> can |
| have full Xapian searching abilities with the per-C<publicinbox> |
| C<indexlevel> set to C<basic> and their respective Xapian |
| (C<xap15> or C<xapian15>) directories removed. For multiple |
| public-inboxes where cross-posting is common, this allows |
| significant space savings on Xapian indices. |
| |
| =item --dedupe=MSGID |
| |
| =item --dedupe |
| |
| Rerun deduplication on messages with the given Message-ID or |
| all messages if no Message-ID is specified. Deduplication rules may |
| change and evolve over time, especially if filters are involved. |
| |
| C<--dedupe=MSGID> may be specified multiple times to deduplicate |
| multiple Message-IDs. |
| |
| Use this if you see C<W: BUG? $MSGID not deduplicated properly> |
| warnings from WWW logs. |
| |
| =item --gc |
| |
| Perform garbage collection instead of indexing. Use this if |
| inboxes are removed from the extindex, a newsgroup name is |
| set or changed, or if messages are purged or removed from |
| some inboxes. |
| |
| =item --reindex |
| |
| Forces a re-index of all messages in the extindex. This can be |
| used for in-place upgrades and bugfixes while read-only server |
| processes are utilizing the index. Keep in mind this roughly |
| doubles the size of the already-large Xapian database. |
| |
| =item --fast |
| |
| Used with C<--reindex>, it will only look for new and stale |
| entries and not touch already-indexed messages. |
| |
| =item --no-multi-pack-index |
| |
| Disable writing a L<git-multi-pack-index(1)> file to save memory. |
| Normally, enabling multi-pack-index speeds up startup time of |
| subsequent L<git-cat-file(1)> processes by 3-4%, but generating |
| this file requires several GB of memory with large repos. |
| |
| Unlike the C<core.multiPackIndex> directive in git, it's still |
| possible to read existing multi-pack-index files if they are |
| created elsewhere. |
| |
| Available in public-inbox 2.0.0+ |
| |
| =back |
| |
| =head1 FILES |
| |
| L<public-inbox-extindex-format(5)> |
| |
| =head1 CONFIGURATION |
| |
| public-inbox-extindex does not write to the L<public-inbox-config(5)> |
| file, it must be entered manually. |
| The extindex name of C<all> is a special case which |
| corresponds to indexing C<--all> inboxes. An example for |
| C<--all> is as follows: |
| |
| [extindex "all"] |
| topdir = /path/to/extindex_dir |
| url = all |
| coderepo = foo |
| coderepo = bar |
| |
| Putting an C<extindex> entry in the config allows L<PublicInbox::WWW>. |
| You can have any number of C<extentry.$NAME> sections where C<$NAME> |
| is something other than C<all> to display a union of several inboxes. |
| |
| It is strongly recommended any public inboxes indexed by this command |
| have a stable C<publicinbox.$NAME.newsgroup> entry (regardless of |
| the presence of an NNTP or IMAP server). Otherwise, public-inbox-extindex |
| will use C<publicinbox.$NAME.inboxdir> as an internal key which can |
| cause needless reindexing and require L<--gc> if inboxes are relocated. |
| |
| See L<public-inbox-config(5)> for more details. |
| |
| =head1 ENVIRONMENT |
| |
| =over 8 |
| |
| =item PI_CONFIG |
| |
| Used to override the default "~/.public-inbox/config" value. |
| |
| =item XAPIAN_FLUSH_THRESHOLD |
| |
| The number of documents to update before committing changes to |
| disk. This environment is handled directly by Xapian, refer to |
| Xapian API documentation for more details. |
| |
| Setting C<XAPIAN_FLUSH_THRESHOLD> or |
| C<publicinbox.indexBatchSize> for a large C<--reindex> may cause |
| L<public-inbox-mda(1)>, L<public-inbox-learn(1)> and |
| L<public-inbox-watch(1)> tasks to wait long and unpredictable |
| periods of time during C<--reindex>. |
| |
| Default: none, uses C<publicinbox.indexBatchSize> |
| |
| =back |
| |
| =head1 UPGRADING |
| |
| Occasionally, public-inbox will update its schema version and |
| require a full index by running this command. |
| |
| =head1 LOCKING |
| |
| It is safe to use C<--dedupe>, C<--gc> and C<--reindex> while |
| other processes are writing to covered inboxes or extindex. |
| The extindex locks will be released roughly every 10s to |
| allow L<public-inbox-mda(1)> and L<public-inbox-watch(1)> |
| processes to write to the extindex. |
| |
| =head1 CONTACT |
| |
| Feedback welcome via plain-text mail to L<mailto:meta@public-inbox.org> |
| |
| The mail archives are hosted at L<https://public-inbox.org/meta/> and |
| L<http://4uok3hntl7oi7b4uf4rtfwefqeexfzil2w6kgk2jn5z2f764irre7byd.onion/meta/> |
| |
| =head1 COPYRIGHT |
| |
| Copyright all contributors L<mailto:meta@public-inbox.org> |
| |
| License: AGPL-3.0+ L<https://www.gnu.org/licenses/agpl-3.0.txt> |
| |
| =head1 SEE ALSO |
| |
| L<Search::Xapian>, L<DBD::SQLite> |