| =head1 NAME |
| |
| public-inbox-index - create and update search indices |
| |
| =head1 SYNOPSIS |
| |
| public-inbox-index [OPTIONS] INBOX_DIR... |
| |
| public-inbox-index [OPTIONS] --all |
| |
| =head1 DESCRIPTION |
| |
| public-inbox-index creates and updates the search, overview and |
| NNTP article number database used by the read-only public-inbox |
| HTTP and NNTP interfaces. Currently, this requires |
| L<DBD::SQLite> and L<DBI> Perl modules. L<Xapian> (or L<Search::Xapian>) |
| are optional, only to support the PSGI search interface. |
| |
| Once the initial indices are created by public-inbox-index, |
| L<public-inbox-mda(1)> and L<public-inbox-watch(1)> will |
| automatically maintain them. |
| |
| Running this manually to update indices is only required if |
| relying on L<git-fetch(1)> to mirror an existing public-inbox; |
| or if upgrading to a new version of public-inbox using |
| the C<--reindex> option. |
| |
| Having the overview and article number database is essential to |
| running the NNTP interface, and strongly recommended for the |
| HTTP interface as it provides thread grouping in addition to |
| normal search functionality. |
| |
| =head1 OPTIONS |
| |
| =over |
| |
| =item -j JOBS |
| |
| =item --jobs=JOBS |
| |
| Influences the number of Xapian indexing shards in a |
| (L<public-inbox-v2-format(5)>) inbox. |
| |
| See L<public-inbox-init(1)/--jobs> for a full description |
| of sharding. |
| |
| C<--jobs=0> is accepted as of public-inbox 1.6.0 |
| to disable parallel indexing regardless of the number of |
| pre-existing shards. |
| |
| If the inbox has not been indexed or initialized, C<JOBS - 1> |
| shards will be created (one job is always needed for indexing |
| the overview and article number mapping). |
| |
| Default: the number of existing Xapian shards |
| |
| =item -c |
| |
| =item --compact |
| |
| Compacts the Xapian DBs after indexing. This is recommended |
| when using C<--reindex> to avoid running out of disk space |
| while indexing multiple inboxes. |
| |
| While option takes a negligible amount of time compared to |
| C<--reindex>, it requires temporarily duplicating the entire |
| contents of the Xapian DB. |
| |
| This switch may be specified twice, in which case compaction |
| happens both before and after indexing to minimize the temporal |
| footprint of the (re)indexing operation. |
| |
| Available since public-inbox 1.4.0. |
| |
| =item --reindex |
| |
| Forces a re-index of all messages in the inbox. |
| This can be used for in-place upgrades and bugfixes while |
| NNTP/HTTP server processes are utilizing the index. Keep in |
| mind this roughly doubles the size of the already-large |
| Xapian database. Using this with C<--compact> or running |
| L<public-inbox-compact(1)> afterwards is recommended to |
| release free space. |
| |
| public-inbox protects writes to various indices with |
| L<flock(2)>, so it is safe to reindex (and rethread) while |
| L<public-inbox-watch(1)>, L<public-inbox-mda(1)> or |
| L<public-inbox-learn(1)> run. |
| |
| This does not touch the NNTP article number database. |
| It does not affect threading unless C<--rethread> is |
| used. |
| |
| =item --all |
| |
| Index all inboxes configured in ~/.public-inbox/config. |
| This is an alternative to specifying individual inboxes directories |
| on the command-line. |
| |
| =item --rethread |
| |
| Regenerate internal THREADID and message thread associations |
| when reindexing. |
| |
| This fixes some bugs in older versions of public-inbox. While |
| it is possible to use this without C<--reindex>, it makes little |
| sense to do so. |
| |
| Available in public-inbox 1.6.0+. |
| |
| =item --prune |
| |
| Run L<git-gc(1)> to prune and expire reflogs if discontiguous history |
| is detected. This is intended to be used in mirrors after running |
| L<public-inbox-edit(1)> or L<public-inbox-purge(1)> to ensure data |
| is expunged from mirrors. |
| |
| Available since public-inbox 1.2.0. |
| |
| =item --max-size SIZE |
| |
| Sets or overrides L</publicinbox.indexMaxSize> on a |
| per-invocation basis. See L</publicinbox.indexMaxSize> |
| below. |
| |
| Available since public-inbox 1.5.0. |
| |
| =item --batch-size SIZE |
| |
| Sets or overrides L</publicinbox.indexBatchSize> on a |
| per-invocation basis. See L</publicinbox.indexBatchSize> |
| below. |
| |
| When using rotational storage but abundant RAM, using a large |
| value (e.g. C<500m>) with C<--sequential-shard> can |
| significantly speed up and reduce fragmentation during the |
| initial index and full C<--reindex> invocations (but not |
| incremental updates). |
| |
| Available in public-inbox 1.6.0+. |
| |
| =item --no-fsync |
| |
| Disables L<fsync(2)> and L<fdatasync(2)> operations on SQLite |
| and Xapian. This is only effective with Xapian 1.4+. This is |
| primarily intended for systems with low RAM and the small |
| (default) C<--batch-size=1m>. Users of large C<--batch-size> |
| may even find disabling L<fdatasync(2)> causes too much dirty |
| data to accumulate, resulting on latency spikes from writeback. |
| |
| Available in public-inbox 1.6.0+. |
| |
| =item --dangerous |
| |
| Speed up initial index by using in-place updates and denying support for |
| concurrent readers. This is only effective with Xapian 1.4+. |
| |
| Available in public-inbox 1.8.0+ |
| |
| =item --sequential-shard |
| |
| Sets or overrides L</publicinbox.indexSequentialShard> on a |
| per-invocation basis. See L</publicinbox.indexSequentialShard> |
| below. |
| |
| Available in public-inbox 1.6.0+. |
| |
| =item --skip-docdata |
| |
| Stop storing document data in Xapian on an existing inbox. |
| |
| See L<public-inbox-init(1)/--skip-docdata> for description and caveats. |
| |
| Available in public-inbox 1.6.0+. |
| |
| =item -E EXTINDEX |
| |
| =item --update-extindex=EXTINDEX |
| |
| Update the given external index (L<public-inbox-extindex-format(5)>. |
| Either the configured section name (e.g. C<all>) or a directory name |
| may be specified. |
| |
| Defaults to C<all> if C<[extindex "all"]> is configured, |
| otherwise no external indices are updated. |
| |
| May be specified multiple times in rare cases where multiple |
| external indices are configured. |
| |
| =item --no-update-extindex |
| |
| Do not update the C<all> external index by default. This negates |
| all uses of C<-E> / C<--update-extindex=> on the command-line. |
| |
| =item --no-multi-pack-index |
| |
| Disables writing the multi-pack-index when using L</--update-extindex>. |
| See L<public-inbox-extindex(1)/--no-multi-pack-index> for details. |
| |
| Available in public-inbox 2.0.0+ |
| |
| =item --since=DATESTRING |
| |
| =item --after=DATESTRING |
| |
| =item --until=DATESTRING |
| |
| =item --before=DATESTRING |
| |
| Passed directly to L<git-log(1)> to limit changes for C<--reindex> |
| |
| =back |
| |
| =head1 FILES |
| |
| For v1 (ssoma) repositories described in L<public-inbox-v1-format(5)>. |
| All public-inbox-specific files are contained within the |
| C<$GIT_DIR/public-inbox/> directory. |
| |
| v2 inboxes are described in L<public-inbox-v2-format(5)>. |
| |
| =head1 CONFIGURATION |
| |
| =over 8 |
| |
| =item publicinbox.indexMaxSize |
| |
| Prevents indexing of messages larger than the specified size |
| value. A single suffix modifier of C<k>, C<m> or C<g> is |
| supported, thus the value of C<1m> to prevents indexing of |
| messages larger than one megabyte. |
| |
| This is useful for avoiding memory exhaustion in mirrors |
| via git. It does not prevent L<public-inbox-mda(1)> or |
| L<public-inbox-watch(1)> from importing (and indexing) |
| a message. |
| |
| This option is only available in public-inbox 1.5 or later. |
| |
| Default: none |
| |
| =item publicinbox.indexBatchSize |
| |
| Flushes changes to the filesystem and releases locks after |
| indexing the given number of bytes. The default value of C<1m> |
| (one megabyte) is low to minimize memory use and reduce |
| contention with parallel invocations of L<public-inbox-mda(1)>, |
| L<public-inbox-learn(1)>, and L<public-inbox-watch(1)>. |
| |
| Increase this value on powerful systems to improve throughput at |
| the expense of memory use. The reduction of lock granularity |
| may not be noticeable on fast systems. With SSDs, values above |
| C<4m> have little benefit. |
| |
| For L<public-inbox-v2-format(5)> inboxes, this value is |
| multiplied by the number of Xapian shards. Thus a typical v2 |
| inbox with 3 shards will flush every 3 megabytes by default |
| unless parallelism is disabled via C<--sequential-shard> |
| or C<--jobs=0>. |
| |
| This influences memory usage of Xapian, but it is not exact. |
| The actual memory used by Xapian and Perl has been observed |
| in excess of 10x this value. |
| |
| This option is available in public-inbox 1.6 or later. |
| public-inbox 1.5 and earlier used the current default, C<1m>. |
| |
| Default: 1m (one megabyte) |
| |
| =item publicinbox.indexSequentialShard |
| |
| For L<public-inbox-v2-format(5)> inboxes, setting this to C<true> |
| allows indexing Xapian shards in multiple passes. This speeds up |
| indexing on rotational storage with high seek latency by allowing |
| individual shards to fit into the kernel page cache. |
| |
| Using a higher-than-normal number of C<--jobs> with |
| L<public-inbox-init(1)> may be required to ensure individual |
| shards are small enough to fit into cache. |
| |
| Warning: interrupting C<public-inbox-index(1)> while this option |
| is in use may leave the search indices out-of-date with respect |
| to SQLite databases. WWW and IMAP users may notice incomplete |
| search results, but it is otherwise non-fatal. Using C<--reindex> |
| will bring everything back up-to-date. |
| |
| Available in public-inbox 1.6.0+. |
| |
| This is ignored on L<public-inbox-v1-format(5)> inboxes. |
| |
| Default: false, shards are indexed in parallel |
| |
| =item publicinbox.<name>.indexSequentialShard |
| |
| Identical to L</publicinbox.indexSequentialShard>, |
| but only affect the inbox matching E<lt>nameE<gt>. |
| |
| =back |
| |
| =head1 ENVIRONMENT |
| |
| =over 8 |
| |
| =item PI_CONFIG |
| |
| Used to override the default "~/.public-inbox/config" value. |
| |
| =item XAPIAN_FLUSH_THRESHOLD |
| |
| The number of documents to update before committing changes to |
| disk. This environment is handled directly by Xapian, refer to |
| Xapian API documentation for more details. |
| |
| For public-inbox 1.6 and later, use C<publicinbox.indexBatchSize> |
| instead. |
| |
| Setting C<XAPIAN_FLUSH_THRESHOLD> or |
| C<publicinbox.indexBatchSize> for a large C<--reindex> may cause |
| L<public-inbox-mda(1)>, L<public-inbox-learn(1)> and |
| L<public-inbox-watch(1)> tasks to wait long and unpredictable |
| periods of time during C<--reindex>. |
| |
| Default: none, uses C<publicinbox.indexBatchSize> |
| |
| =back |
| |
| =head1 UPGRADING |
| |
| Occasionally, public-inbox will update its schema version and |
| require a full index by running this command. |
| |
| =head1 CONTACT |
| |
| Feedback welcome via plain-text mail to L<mailto:meta@public-inbox.org> |
| |
| The mail archives are hosted at L<https://public-inbox.org/meta/> and |
| L<http://4uok3hntl7oi7b4uf4rtfwefqeexfzil2w6kgk2jn5z2f764irre7byd.onion/meta/> |
| |
| =head1 COPYRIGHT |
| |
| Copyright all contributors L<mailto:meta@public-inbox.org> |
| |
| License: AGPL-3.0+ L<https://www.gnu.org/licenses/agpl-3.0.txt> |
| |
| =head1 SEE ALSO |
| |
| L<Search::Xapian>, L<DBD::SQLite>, L<public-inbox-extindex-format(5)> |