#+TITLE: MU INDEX #+MAN_CLASS_OPTIONS: :section-id "@SECTION_ID@" :date "@MAN_DATE@" * NAME mu-index - index e-mail messages stored in Maildirs * SYNOPSIS *mu [common-options] index* * DESCRIPTION *mu index* is the *mu* command for scanning the contents of Maildir directories and storing the results in a Xapian database. The data can then be queried using *mu-find(1)*. Before the first time you run *mu index*, you must run *mu init* to initialize the database. *index* understands Maildirs as defined by Daniel Bernstein for *qmail(7)*. In addition, it understands recursive Maildirs (Maildirs within Maildirs), Maildir++. It also supports VFAT-based Maildirs which use =!= or =;= as the separators instead of =:=. E-mail messages which are not stored in something resembling a maildir leaf-directory (=cur= and =new=) are ignored, as are the cache directories for =notmuch= and =gnus=, and any dot-directory. Symlinks are followed, and the directories can be spread over multiple filesystems; however note that moving files around is much faster when multiple filesystems are not involved. Be careful to avoid self-referential symlinks! If there is a file called =.noindex= in a directory, the contents of that directory and all of its subdirectories will be ignored. This can be useful to exclude certain directories from the indexing process, for example directories with spam-messages. If there is a file called =.noupdate= in a directory, the contents of that directory and all of its subdirectories will be ignored. This can be useful to speed up things you have some maildirs that never change. =.noupdate= does not affect already-indexed message: you can still search for them. =.noupdate= is ignored when you start indexing with an empty database (such as directly after =mu init=). There also the option *--lazy-check* which can greatly speed up indexing; see below for details. The first run of *mu index* may take a few minutes if you have a lot of mail (tens of thousands of messages). Fortunately, such a full scan needs to be done only once; after that it suffices to index the changes, which goes much faster. See the `PERFORMANCE (i,ii,iii)' below for more information. The optional `phase two' of the indexing-process is the removal of messages from the database for which there is no longer a corresponding file in the Maildir. If you do not want this, you can use ~-n~, ~--nocleanup~. When *mu index* catches one of the signals *SIGINT*, *SIGHUP* or *SIGTERM* (e.g., when you press Ctrl-C during the indexing process), it attempts to shutdown gracefully; it tries to save and commit data, and close the database etc. If it receives another signal (e.g., when pressing Ctrl-C once more), *mu index* will terminate immediately. * INDEX OPTIONS ** --lazy-check in lazy-check mode, *mu* does not consider messages for which the time-stamp (ctime) of the directory they reside in has not changed since the previous indexing run. This is much faster than the non-lazy check, but won't update messages that have change (rather than having been added or removed), since merely editing a message does not update the directory time-stamp. Of course, you can run *mu-index* occasionally without ~--lazy-check~, to pick up such messages. ** --nocleanup disable the database cleanup that *mu* does by default after indexing. ** --reindex perform a complete reindexing of all the messages in the maildir. #+include: "muhome.inc" :minlevel 2 #+include: "common-options.inc" :minlevel 1 * ENCRYPTION *mu index* does _not_ decrypt messages, and only the metadata (such as headers) of encrypted messages makes it to the database. *mu view* and *mu4e* can decrypt messages, but those work with the message directly and the information is not added to the database. * PERFORMANCE ** indexing in ancient times (2009?) As a non-scientific benchmark, a simple test on the author's machine (a Thinkpad X61s laptop using Linux 2.6.35 and an ext3 file system) with no existing database, and a maildir with 27273 messages: #+begin_example $ sudo sh -c 'sync && echo 3 > /proc/sys/vm/drop_caches' $ time mu index --quiet 66,65s user 6,05s system 27% cpu 4:24,20 total #+end_example (about 103 messages per second) A second run, which is the more typical use case when there is a database already, goes much faster: #+begin_example $ sudo sh -c 'sync && echo 3 > /proc/sys/vm/drop_caches' $ time mu index --quiet 0,48s user 0,76s system 10% cpu 11,796 total #+end_example (more than 56818 messages per second) Note that each test flushes the caches first; a more common use case might be to run *mu index* when new mail has arrived; the cache may stay quite `warm' in that case: #+begin_example $ time mu index --quiet 0,33s user 0,40s system 80% cpu 0,905 total #+end_example which is more than 30000 messages per second. ** indexing in 2012 As per June 2012, we did the same non-scientific benchmark, this time with an Intel i5-2500 CPU @ 3.30GHz, an ext4 file system and a maildir with 22589 messages. We start without an existing database. #+begin_example $ sudo sh -c 'sync && echo 3 > /proc/sys/vm/drop_caches' $ time mu index --quiet 27,79s user 2,17s system 48% cpu 1:01,47 total #+end_example (about 813 messages per second) A second run, which is the more typical use case when there is a database already, goes much faster: #+begin_example $ sudo sh -c 'sync && echo 3 > /proc/sys/vm/drop_caches' $ time mu index --quiet 0,13s user 0,30s system 19% cpu 2,162 total #+end_example (more than 173000 messages per second) ** indexing in 2016 As per July 2016, we did the same non-scientific benchmark, again with the Intel i5-2500 CPU @ 3.30GHz, an ext4 file system. This time, the maildir contains 72525 messages. #+begin_example $ sudo sh -c 'sync && echo 3 > /proc/sys/vm/drop_caches' $ time mu index --quiet 40,34s user 2,56s system 64% cpu 1:06,17 total #+end_example (about 1099 messages per second). ** indexing in 2022 A few years later and it is June 2022. There's a lot more happening during indexing, but indexing became multi-threaded and machines are faster; e.g. this is with an AMD Ryzen Threadripper 1950X (16 cores) @ 3.399GHz. The instructions are a little different since we have a proper repeatable benchmark now. After building, #+begin_example $ sudo sh -c 'sync && echo 3 > /proc/sys/vm/drop_caches' % THREAD_NUM=4 build/lib/tests/bench-indexer -m perf # random seed: R02Sf5c50e4851ec51adaf301e0e054bd52b 1..1 # Start of bench tests # Start of indexer tests indexed 5000 messages in 20 maildirs in 3763ms; 752 μs/message; 1328 messages/s (4 thread(s)) ok 1 /bench/indexer/4-cores # End of indexer tests # End of bench tests #+end_example Things are again a little faster, even though the index does a lot more now (text-normalizatian, and pre-generating message-sexps). A faster machine helps, too! ** recent releases Indexing the the same 93000-message mail corpus with the last few releases: #+ATTR_MAN: :disable-caption t | release | time (sec) | notes | |---------------+------------+------------------------------------------| | 1.4 | 160s | | | 1.6 | 178s | | | 1.8 | 97s | | | 1.10 | 120s | adds html indexing, sexp-caching | | 1.11 (master) | 96s | adds language-guessing, batch-size=50000 | | | | | Quite some variation! Over time new features / refactoring can change the timings quite a bit. At least for now, the latest code is both the fastest and the most featureful! #+include: "exit-code.inc" :minlevel 1 #+include: "prefooter.inc" * SEE ALSO *maildir(5)*, *mu(1)*, *mu-init(1)*, *mu-find(1)*, *mu-cfind(1)*