lib: implement new query parser
mu's query parser is the piece of software that turns your queries
into something the Xapian database can understand. So, if you query
"maildir:/inbox and subject:bla" this must be translated into a
Xapian::Query object which will retrieve the sought after messages.
Since mu's beginning, almost a decade ago, this parser was based on
Xapian's default Xapian::QueryParser. It works okay, but wasn't really
designed for the mu use-case, and had a bit of trouble with anything
that's not A..Z (think: spaces, special characters, unicode etc.).
Over the years, mu added quite a bit of pre-processing trickery to
deal with that. Still, there were corner cases and bugs that were
practically unfixable.
The solution to all of this is to have a custom query processor that
replaces Xapian's, and write it from the ground up to deal with the
special characters etc. I wrote one, as part of my "future, post-1.0
mu" reseach project, and I have now backported it to the mu 0.9.19.
From a technical perspective, this is a major cleanup, and allows us
to get rid of much of the fragile preprocessing both for indexing and
querying. From and end-user perspective this (hopefully) means that
many of the little parsing issues are gone, and it opens the way for
some new features.
From an end-user perspective:
- better support for special characters.
- regexp search! yes, you can now search for regular expressions, e.g.
subject:/h.ll?o/
will find subjects with hallo, hello, halo, philosophy, ...
As you can imagine, this can be a _heavy_ operation on the database,
and might take quite a bit longer than a normal query; but it can be
quite useful.
2017-10-24 21:55:35 +02:00
|
|
|
/*
|
2022-06-12 18:44:00 +02:00
|
|
|
** Copyright (C) 2017-2022 Dirk-Jan C. Binnema <djcb@djcbsoftware.nl>
|
lib: implement new query parser
mu's query parser is the piece of software that turns your queries
into something the Xapian database can understand. So, if you query
"maildir:/inbox and subject:bla" this must be translated into a
Xapian::Query object which will retrieve the sought after messages.
Since mu's beginning, almost a decade ago, this parser was based on
Xapian's default Xapian::QueryParser. It works okay, but wasn't really
designed for the mu use-case, and had a bit of trouble with anything
that's not A..Z (think: spaces, special characters, unicode etc.).
Over the years, mu added quite a bit of pre-processing trickery to
deal with that. Still, there were corner cases and bugs that were
practically unfixable.
The solution to all of this is to have a custom query processor that
replaces Xapian's, and write it from the ground up to deal with the
special characters etc. I wrote one, as part of my "future, post-1.0
mu" reseach project, and I have now backported it to the mu 0.9.19.
From a technical perspective, this is a major cleanup, and allows us
to get rid of much of the fragile preprocessing both for indexing and
querying. From and end-user perspective this (hopefully) means that
many of the little parsing issues are gone, and it opens the way for
some new features.
From an end-user perspective:
- better support for special characters.
- regexp search! yes, you can now search for regular expressions, e.g.
subject:/h.ll?o/
will find subjects with hallo, hello, halo, philosophy, ...
As you can imagine, this can be a _heavy_ operation on the database,
and might take quite a bit longer than a normal query; but it can be
quite useful.
2017-10-24 21:55:35 +02:00
|
|
|
**
|
|
|
|
** This library is free software; you can redistribute it and/or
|
|
|
|
** modify it under the terms of the GNU Lesser General Public License
|
|
|
|
** as published by the Free Software Foundation; either version 2.1
|
|
|
|
** of the License, or (at your option) any later version.
|
|
|
|
**
|
|
|
|
** This library is distributed in the hope that it will be useful,
|
|
|
|
** but WITHOUT ANY WARRANTY; without even the implied warranty of
|
|
|
|
** MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU
|
|
|
|
** Lesser General Public License for more details.
|
|
|
|
**
|
|
|
|
** You should have received a copy of the GNU Lesser General Public
|
|
|
|
** License along with this library; if not, write to the Free
|
|
|
|
** Software Foundation, 51 Franklin Street, Fifth Floor, Boston, MA
|
|
|
|
** 02110-1301, USA.
|
|
|
|
*/
|
|
|
|
|
2018-05-19 21:22:41 +02:00
|
|
|
#include <config.h>
|
|
|
|
|
lib: implement new query parser
mu's query parser is the piece of software that turns your queries
into something the Xapian database can understand. So, if you query
"maildir:/inbox and subject:bla" this must be translated into a
Xapian::Query object which will retrieve the sought after messages.
Since mu's beginning, almost a decade ago, this parser was based on
Xapian's default Xapian::QueryParser. It works okay, but wasn't really
designed for the mu use-case, and had a bit of trouble with anything
that's not A..Z (think: spaces, special characters, unicode etc.).
Over the years, mu added quite a bit of pre-processing trickery to
deal with that. Still, there were corner cases and bugs that were
practically unfixable.
The solution to all of this is to have a custom query processor that
replaces Xapian's, and write it from the ground up to deal with the
special characters etc. I wrote one, as part of my "future, post-1.0
mu" reseach project, and I have now backported it to the mu 0.9.19.
From a technical perspective, this is a major cleanup, and allows us
to get rid of much of the fragile preprocessing both for indexing and
querying. From and end-user perspective this (hopefully) means that
many of the little parsing issues are gone, and it opens the way for
some new features.
From an end-user perspective:
- better support for special characters.
- regexp search! yes, you can now search for regular expressions, e.g.
subject:/h.ll?o/
will find subjects with hallo, hello, halo, philosophy, ...
As you can imagine, this can be a _heavy_ operation on the database,
and might take quite a bit longer than a normal query; but it can be
quite useful.
2017-10-24 21:55:35 +02:00
|
|
|
#include <xapian.h>
|
2020-02-20 20:53:24 +01:00
|
|
|
#include "mu-xapian.hh"
|
2019-12-30 21:28:53 +01:00
|
|
|
#include <utils/mu-error.hh>
|
lib: implement new query parser
mu's query parser is the piece of software that turns your queries
into something the Xapian database can understand. So, if you query
"maildir:/inbox and subject:bla" this must be translated into a
Xapian::Query object which will retrieve the sought after messages.
Since mu's beginning, almost a decade ago, this parser was based on
Xapian's default Xapian::QueryParser. It works okay, but wasn't really
designed for the mu use-case, and had a bit of trouble with anything
that's not A..Z (think: spaces, special characters, unicode etc.).
Over the years, mu added quite a bit of pre-processing trickery to
deal with that. Still, there were corner cases and bugs that were
practically unfixable.
The solution to all of this is to have a custom query processor that
replaces Xapian's, and write it from the ground up to deal with the
special characters etc. I wrote one, as part of my "future, post-1.0
mu" reseach project, and I have now backported it to the mu 0.9.19.
From a technical perspective, this is a major cleanup, and allows us
to get rid of much of the fragile preprocessing both for indexing and
querying. From and end-user perspective this (hopefully) means that
many of the little parsing issues are gone, and it opens the way for
some new features.
From an end-user perspective:
- better support for special characters.
- regexp search! yes, you can now search for regular expressions, e.g.
subject:/h.ll?o/
will find subjects with hallo, hello, halo, philosophy, ...
As you can imagine, this can be a _heavy_ operation on the database,
and might take quite a bit longer than a normal query; but it can be
quite useful.
2017-10-24 21:55:35 +02:00
|
|
|
|
2019-12-16 21:41:17 +01:00
|
|
|
using namespace Mu;
|
lib: implement new query parser
mu's query parser is the piece of software that turns your queries
into something the Xapian database can understand. So, if you query
"maildir:/inbox and subject:bla" this must be translated into a
Xapian::Query object which will retrieve the sought after messages.
Since mu's beginning, almost a decade ago, this parser was based on
Xapian's default Xapian::QueryParser. It works okay, but wasn't really
designed for the mu use-case, and had a bit of trouble with anything
that's not A..Z (think: spaces, special characters, unicode etc.).
Over the years, mu added quite a bit of pre-processing trickery to
deal with that. Still, there were corner cases and bugs that were
practically unfixable.
The solution to all of this is to have a custom query processor that
replaces Xapian's, and write it from the ground up to deal with the
special characters etc. I wrote one, as part of my "future, post-1.0
mu" reseach project, and I have now backported it to the mu 0.9.19.
From a technical perspective, this is a major cleanup, and allows us
to get rid of much of the fragile preprocessing both for indexing and
querying. From and end-user perspective this (hopefully) means that
many of the little parsing issues are gone, and it opens the way for
some new features.
From an end-user perspective:
- better support for special characters.
- regexp search! yes, you can now search for regular expressions, e.g.
subject:/h.ll?o/
will find subjects with hallo, hello, halo, philosophy, ...
As you can imagine, this can be a _heavy_ operation on the database,
and might take quite a bit longer than a normal query; but it can be
quite useful.
2017-10-24 21:55:35 +02:00
|
|
|
|
|
|
|
static Xapian::Query
|
2021-10-20 11:18:15 +02:00
|
|
|
xapian_query_op(const Mu::Tree& tree)
|
lib: implement new query parser
mu's query parser is the piece of software that turns your queries
into something the Xapian database can understand. So, if you query
"maildir:/inbox and subject:bla" this must be translated into a
Xapian::Query object which will retrieve the sought after messages.
Since mu's beginning, almost a decade ago, this parser was based on
Xapian's default Xapian::QueryParser. It works okay, but wasn't really
designed for the mu use-case, and had a bit of trouble with anything
that's not A..Z (think: spaces, special characters, unicode etc.).
Over the years, mu added quite a bit of pre-processing trickery to
deal with that. Still, there were corner cases and bugs that were
practically unfixable.
The solution to all of this is to have a custom query processor that
replaces Xapian's, and write it from the ground up to deal with the
special characters etc. I wrote one, as part of my "future, post-1.0
mu" reseach project, and I have now backported it to the mu 0.9.19.
From a technical perspective, this is a major cleanup, and allows us
to get rid of much of the fragile preprocessing both for indexing and
querying. From and end-user perspective this (hopefully) means that
many of the little parsing issues are gone, and it opens the way for
some new features.
From an end-user perspective:
- better support for special characters.
- regexp search! yes, you can now search for regular expressions, e.g.
subject:/h.ll?o/
will find subjects with hallo, hello, halo, philosophy, ...
As you can imagine, this can be a _heavy_ operation on the database,
and might take quite a bit longer than a normal query; but it can be
quite useful.
2017-10-24 21:55:35 +02:00
|
|
|
{
|
2022-06-12 18:44:00 +02:00
|
|
|
if (tree.node.type == Node::Type::OpNot) { // OpNot x ::= <all> AND NOT x
|
2021-10-20 11:18:15 +02:00
|
|
|
if (tree.children.size() != 1)
|
|
|
|
throw std::runtime_error("invalid # of children");
|
|
|
|
return Xapian::Query(Xapian::Query::OP_AND_NOT,
|
2022-06-12 18:44:00 +02:00
|
|
|
Xapian::Query::MatchAll,
|
|
|
|
xapian_query(tree.children.front()));
|
lib: implement new query parser
mu's query parser is the piece of software that turns your queries
into something the Xapian database can understand. So, if you query
"maildir:/inbox and subject:bla" this must be translated into a
Xapian::Query object which will retrieve the sought after messages.
Since mu's beginning, almost a decade ago, this parser was based on
Xapian's default Xapian::QueryParser. It works okay, but wasn't really
designed for the mu use-case, and had a bit of trouble with anything
that's not A..Z (think: spaces, special characters, unicode etc.).
Over the years, mu added quite a bit of pre-processing trickery to
deal with that. Still, there were corner cases and bugs that were
practically unfixable.
The solution to all of this is to have a custom query processor that
replaces Xapian's, and write it from the ground up to deal with the
special characters etc. I wrote one, as part of my "future, post-1.0
mu" reseach project, and I have now backported it to the mu 0.9.19.
From a technical perspective, this is a major cleanup, and allows us
to get rid of much of the fragile preprocessing both for indexing and
querying. From and end-user perspective this (hopefully) means that
many of the little parsing issues are gone, and it opens the way for
some new features.
From an end-user perspective:
- better support for special characters.
- regexp search! yes, you can now search for regular expressions, e.g.
subject:/h.ll?o/
will find subjects with hallo, hello, halo, philosophy, ...
As you can imagine, this can be a _heavy_ operation on the database,
and might take quite a bit longer than a normal query; but it can be
quite useful.
2017-10-24 21:55:35 +02:00
|
|
|
}
|
2022-06-12 18:44:00 +02:00
|
|
|
|
|
|
|
const auto op = std::invoke([](Node::Type ntype) {
|
|
|
|
#pragma GCC diagnostic push
|
|
|
|
#pragma GCC diagnostic ignored "-Wswitch-enum"
|
|
|
|
switch (ntype) {
|
|
|
|
case Node::Type::OpAnd:
|
|
|
|
return Xapian::Query::OP_AND;
|
|
|
|
case Node::Type::OpOr:
|
|
|
|
return Xapian::Query::OP_OR;
|
|
|
|
case Node::Type::OpXor:
|
|
|
|
return Xapian::Query::OP_XOR;
|
|
|
|
case Node::Type::OpAndNot:
|
|
|
|
return Xapian::Query::OP_AND_NOT;
|
|
|
|
case Node::Type::OpNot:
|
|
|
|
default:
|
|
|
|
throw Mu::Error(Error::Code::Internal, "invalid op"); // bug
|
|
|
|
}
|
2020-11-03 08:58:59 +01:00
|
|
|
#pragma GCC diagnostic pop
|
2022-06-12 18:44:00 +02:00
|
|
|
}, tree.node.type);
|
|
|
|
|
lib: implement new query parser
mu's query parser is the piece of software that turns your queries
into something the Xapian database can understand. So, if you query
"maildir:/inbox and subject:bla" this must be translated into a
Xapian::Query object which will retrieve the sought after messages.
Since mu's beginning, almost a decade ago, this parser was based on
Xapian's default Xapian::QueryParser. It works okay, but wasn't really
designed for the mu use-case, and had a bit of trouble with anything
that's not A..Z (think: spaces, special characters, unicode etc.).
Over the years, mu added quite a bit of pre-processing trickery to
deal with that. Still, there were corner cases and bugs that were
practically unfixable.
The solution to all of this is to have a custom query processor that
replaces Xapian's, and write it from the ground up to deal with the
special characters etc. I wrote one, as part of my "future, post-1.0
mu" reseach project, and I have now backported it to the mu 0.9.19.
From a technical perspective, this is a major cleanup, and allows us
to get rid of much of the fragile preprocessing both for indexing and
querying. From and end-user perspective this (hopefully) means that
many of the little parsing issues are gone, and it opens the way for
some new features.
From an end-user perspective:
- better support for special characters.
- regexp search! yes, you can now search for regular expressions, e.g.
subject:/h.ll?o/
will find subjects with hallo, hello, halo, philosophy, ...
As you can imagine, this can be a _heavy_ operation on the database,
and might take quite a bit longer than a normal query; but it can be
quite useful.
2017-10-24 21:55:35 +02:00
|
|
|
std::vector<Xapian::Query> childvec;
|
2021-10-20 11:18:15 +02:00
|
|
|
for (const auto& subtree : tree.children)
|
lib: implement new query parser
mu's query parser is the piece of software that turns your queries
into something the Xapian database can understand. So, if you query
"maildir:/inbox and subject:bla" this must be translated into a
Xapian::Query object which will retrieve the sought after messages.
Since mu's beginning, almost a decade ago, this parser was based on
Xapian's default Xapian::QueryParser. It works okay, but wasn't really
designed for the mu use-case, and had a bit of trouble with anything
that's not A..Z (think: spaces, special characters, unicode etc.).
Over the years, mu added quite a bit of pre-processing trickery to
deal with that. Still, there were corner cases and bugs that were
practically unfixable.
The solution to all of this is to have a custom query processor that
replaces Xapian's, and write it from the ground up to deal with the
special characters etc. I wrote one, as part of my "future, post-1.0
mu" reseach project, and I have now backported it to the mu 0.9.19.
From a technical perspective, this is a major cleanup, and allows us
to get rid of much of the fragile preprocessing both for indexing and
querying. From and end-user perspective this (hopefully) means that
many of the little parsing issues are gone, and it opens the way for
some new features.
From an end-user perspective:
- better support for special characters.
- regexp search! yes, you can now search for regular expressions, e.g.
subject:/h.ll?o/
will find subjects with hallo, hello, halo, philosophy, ...
As you can imagine, this can be a _heavy_ operation on the database,
and might take quite a bit longer than a normal query; but it can be
quite useful.
2017-10-24 21:55:35 +02:00
|
|
|
childvec.emplace_back(xapian_query(subtree));
|
|
|
|
|
|
|
|
return Xapian::Query(op, childvec.begin(), childvec.end());
|
|
|
|
}
|
|
|
|
|
2018-05-18 14:55:40 +02:00
|
|
|
static Xapian::Query
|
2022-06-12 18:44:00 +02:00
|
|
|
make_query(const FieldValue& fval, bool maybe_wildcard)
|
2018-05-18 14:55:40 +02:00
|
|
|
{
|
2022-06-12 18:44:00 +02:00
|
|
|
const auto vlen{fval.value().length()};
|
|
|
|
if (!maybe_wildcard || vlen <= 1 || fval.value()[vlen - 1] != '*')
|
|
|
|
return Xapian::Query(fval.field().xapian_term(fval.value()));
|
2018-05-18 14:55:40 +02:00
|
|
|
else
|
|
|
|
return Xapian::Query(Xapian::Query::OP_WILDCARD,
|
2022-06-12 18:44:00 +02:00
|
|
|
fval.field().xapian_term(fval.value().substr(0, vlen - 1)));
|
2018-05-18 14:55:40 +02:00
|
|
|
}
|
|
|
|
|
2017-10-26 20:31:22 +02:00
|
|
|
static Xapian::Query
|
2021-10-20 11:18:15 +02:00
|
|
|
xapian_query_value(const Mu::Tree& tree)
|
2017-10-26 20:31:22 +02:00
|
|
|
{
|
2022-06-12 18:44:00 +02:00
|
|
|
// indexable field implies it can be use with a phrase search.
|
|
|
|
const auto& field_val{tree.node.field_val.value()};
|
|
|
|
if (!field_val.field().is_indexable_term()) { //
|
|
|
|
/* not an indexable field; no extra magic needed*/
|
|
|
|
return make_query(field_val, true /*maybe-wildcard*/);
|
|
|
|
}
|
2017-10-27 17:42:58 +02:00
|
|
|
|
Avoid word-splitting regular expression matches
Previously, we would conduct regular expression searches by
enumerating all values of a given term, manually regex-matching each
one against our search regular expression, remember all the term
values that matched our regular expression, then do a big Xapian
OR-query that matched any of those term values. In constructing this
OR-query, however, we would split each term value on space and add a
separate Xapian phrase search term for each resulting word. This
approach worked fine most of the time, beacuse when we index a term,
we index both each word in a term and the whole term by itself.
This word splitting produced false negatives in some matches, however,
because Xapian and the Mu-level word splitting code do word splitting
slightly differently and apply different transformations to the text
while splitting. (For example, Xapian transforms fancy Unicode
apostrophes to ASCII apostrophes.)
This patch avoids the problem by not word splitting when constructing
the big Xapian OR-query for finding the results of regular
expression matching.
2022-11-14 17:35:10 +01:00
|
|
|
const bool is_atomic = tree.node.type == Node::Type::ValueAtomic;
|
|
|
|
|
2022-06-12 18:44:00 +02:00
|
|
|
const auto parts{split(field_val.value(), " ")};
|
2017-10-26 20:31:22 +02:00
|
|
|
if (parts.empty())
|
|
|
|
return Xapian::Query::MatchNothing; // shouldn't happen
|
Avoid word-splitting regular expression matches
Previously, we would conduct regular expression searches by
enumerating all values of a given term, manually regex-matching each
one against our search regular expression, remember all the term
values that matched our regular expression, then do a big Xapian
OR-query that matched any of those term values. In constructing this
OR-query, however, we would split each term value on space and add a
separate Xapian phrase search term for each resulting word. This
approach worked fine most of the time, beacuse when we index a term,
we index both each word in a term and the whole term by itself.
This word splitting produced false negatives in some matches, however,
because Xapian and the Mu-level word splitting code do word splitting
slightly differently and apply different transformations to the text
while splitting. (For example, Xapian transforms fancy Unicode
apostrophes to ASCII apostrophes.)
This patch avoids the problem by not word splitting when constructing
the big Xapian OR-query for finding the results of regular
expression matching.
2022-11-14 17:35:10 +01:00
|
|
|
else if (parts.size() == 1 && !is_atomic)
|
2022-06-12 18:44:00 +02:00
|
|
|
return make_query(field_val, true /*maybe-wildcard*/);
|
Avoid word-splitting regular expression matches
Previously, we would conduct regular expression searches by
enumerating all values of a given term, manually regex-matching each
one against our search regular expression, remember all the term
values that matched our regular expression, then do a big Xapian
OR-query that matched any of those term values. In constructing this
OR-query, however, we would split each term value on space and add a
separate Xapian phrase search term for each resulting word. This
approach worked fine most of the time, beacuse when we index a term,
we index both each word in a term and the whole term by itself.
This word splitting produced false negatives in some matches, however,
because Xapian and the Mu-level word splitting code do word splitting
slightly differently and apply different transformations to the text
while splitting. (For example, Xapian transforms fancy Unicode
apostrophes to ASCII apostrophes.)
This patch avoids the problem by not word splitting when constructing
the big Xapian OR-query for finding the results of regular
expression matching.
2022-11-14 17:35:10 +01:00
|
|
|
else if (is_atomic)
|
|
|
|
return make_query(field_val, false /*maybe-wildcard*/);
|
2017-10-26 20:31:22 +02:00
|
|
|
|
2018-11-11 12:15:08 +01:00
|
|
|
std::vector<Xapian::Query> phvec;
|
2022-06-12 18:44:00 +02:00
|
|
|
for (const auto& p : parts) {
|
|
|
|
FieldValue fv{field_val.field_id, p};
|
|
|
|
phvec.emplace_back(make_query(fv, false /*no wildcards*/));
|
|
|
|
}
|
2018-11-11 12:15:08 +01:00
|
|
|
|
2021-10-20 11:18:15 +02:00
|
|
|
return Xapian::Query(Xapian::Query::OP_PHRASE, phvec.begin(), phvec.end());
|
2017-10-26 20:31:22 +02:00
|
|
|
}
|
|
|
|
|
|
|
|
static Xapian::Query
|
2021-10-20 11:18:15 +02:00
|
|
|
xapian_query_range(const Mu::Tree& tree)
|
2017-10-26 20:31:22 +02:00
|
|
|
{
|
2022-06-12 18:44:00 +02:00
|
|
|
const auto& field_val{tree.node.field_val.value()};
|
2017-10-26 20:31:22 +02:00
|
|
|
|
2021-10-20 11:18:15 +02:00
|
|
|
return Xapian::Query(Xapian::Query::OP_VALUE_RANGE,
|
2022-06-12 18:44:00 +02:00
|
|
|
field_val.field().value_no(),
|
|
|
|
field_val.range().first,
|
|
|
|
field_val.range().second);
|
2018-11-11 12:15:08 +01:00
|
|
|
}
|
2017-10-26 20:31:22 +02:00
|
|
|
|
lib: implement new query parser
mu's query parser is the piece of software that turns your queries
into something the Xapian database can understand. So, if you query
"maildir:/inbox and subject:bla" this must be translated into a
Xapian::Query object which will retrieve the sought after messages.
Since mu's beginning, almost a decade ago, this parser was based on
Xapian's default Xapian::QueryParser. It works okay, but wasn't really
designed for the mu use-case, and had a bit of trouble with anything
that's not A..Z (think: spaces, special characters, unicode etc.).
Over the years, mu added quite a bit of pre-processing trickery to
deal with that. Still, there were corner cases and bugs that were
practically unfixable.
The solution to all of this is to have a custom query processor that
replaces Xapian's, and write it from the ground up to deal with the
special characters etc. I wrote one, as part of my "future, post-1.0
mu" reseach project, and I have now backported it to the mu 0.9.19.
From a technical perspective, this is a major cleanup, and allows us
to get rid of much of the fragile preprocessing both for indexing and
querying. From and end-user perspective this (hopefully) means that
many of the little parsing issues are gone, and it opens the way for
some new features.
From an end-user perspective:
- better support for special characters.
- regexp search! yes, you can now search for regular expressions, e.g.
subject:/h.ll?o/
will find subjects with hallo, hello, halo, philosophy, ...
As you can imagine, this can be a _heavy_ operation on the database,
and might take quite a bit longer than a normal query; but it can be
quite useful.
2017-10-24 21:55:35 +02:00
|
|
|
Xapian::Query
|
2021-10-20 11:18:15 +02:00
|
|
|
Mu::xapian_query(const Mu::Tree& tree)
|
lib: implement new query parser
mu's query parser is the piece of software that turns your queries
into something the Xapian database can understand. So, if you query
"maildir:/inbox and subject:bla" this must be translated into a
Xapian::Query object which will retrieve the sought after messages.
Since mu's beginning, almost a decade ago, this parser was based on
Xapian's default Xapian::QueryParser. It works okay, but wasn't really
designed for the mu use-case, and had a bit of trouble with anything
that's not A..Z (think: spaces, special characters, unicode etc.).
Over the years, mu added quite a bit of pre-processing trickery to
deal with that. Still, there were corner cases and bugs that were
practically unfixable.
The solution to all of this is to have a custom query processor that
replaces Xapian's, and write it from the ground up to deal with the
special characters etc. I wrote one, as part of my "future, post-1.0
mu" reseach project, and I have now backported it to the mu 0.9.19.
From a technical perspective, this is a major cleanup, and allows us
to get rid of much of the fragile preprocessing both for indexing and
querying. From and end-user perspective this (hopefully) means that
many of the little parsing issues are gone, and it opens the way for
some new features.
From an end-user perspective:
- better support for special characters.
- regexp search! yes, you can now search for regular expressions, e.g.
subject:/h.ll?o/
will find subjects with hallo, hello, halo, philosophy, ...
As you can imagine, this can be a _heavy_ operation on the database,
and might take quite a bit longer than a normal query; but it can be
quite useful.
2017-10-24 21:55:35 +02:00
|
|
|
{
|
2020-11-03 08:58:59 +01:00
|
|
|
#pragma GCC diagnostic push
|
2021-10-20 11:18:15 +02:00
|
|
|
#pragma GCC diagnostic ignored "-Wswitch-enum"
|
lib: implement new query parser
mu's query parser is the piece of software that turns your queries
into something the Xapian database can understand. So, if you query
"maildir:/inbox and subject:bla" this must be translated into a
Xapian::Query object which will retrieve the sought after messages.
Since mu's beginning, almost a decade ago, this parser was based on
Xapian's default Xapian::QueryParser. It works okay, but wasn't really
designed for the mu use-case, and had a bit of trouble with anything
that's not A..Z (think: spaces, special characters, unicode etc.).
Over the years, mu added quite a bit of pre-processing trickery to
deal with that. Still, there were corner cases and bugs that were
practically unfixable.
The solution to all of this is to have a custom query processor that
replaces Xapian's, and write it from the ground up to deal with the
special characters etc. I wrote one, as part of my "future, post-1.0
mu" reseach project, and I have now backported it to the mu 0.9.19.
From a technical perspective, this is a major cleanup, and allows us
to get rid of much of the fragile preprocessing both for indexing and
querying. From and end-user perspective this (hopefully) means that
many of the little parsing issues are gone, and it opens the way for
some new features.
From an end-user perspective:
- better support for special characters.
- regexp search! yes, you can now search for regular expressions, e.g.
subject:/h.ll?o/
will find subjects with hallo, hello, halo, philosophy, ...
As you can imagine, this can be a _heavy_ operation on the database,
and might take quite a bit longer than a normal query; but it can be
quite useful.
2017-10-24 21:55:35 +02:00
|
|
|
switch (tree.node.type) {
|
2022-06-12 18:44:00 +02:00
|
|
|
case Node::Type::Empty:
|
|
|
|
return Xapian::Query();
|
lib: implement new query parser
mu's query parser is the piece of software that turns your queries
into something the Xapian database can understand. So, if you query
"maildir:/inbox and subject:bla" this must be translated into a
Xapian::Query object which will retrieve the sought after messages.
Since mu's beginning, almost a decade ago, this parser was based on
Xapian's default Xapian::QueryParser. It works okay, but wasn't really
designed for the mu use-case, and had a bit of trouble with anything
that's not A..Z (think: spaces, special characters, unicode etc.).
Over the years, mu added quite a bit of pre-processing trickery to
deal with that. Still, there were corner cases and bugs that were
practically unfixable.
The solution to all of this is to have a custom query processor that
replaces Xapian's, and write it from the ground up to deal with the
special characters etc. I wrote one, as part of my "future, post-1.0
mu" reseach project, and I have now backported it to the mu 0.9.19.
From a technical perspective, this is a major cleanup, and allows us
to get rid of much of the fragile preprocessing both for indexing and
querying. From and end-user perspective this (hopefully) means that
many of the little parsing issues are gone, and it opens the way for
some new features.
From an end-user perspective:
- better support for special characters.
- regexp search! yes, you can now search for regular expressions, e.g.
subject:/h.ll?o/
will find subjects with hallo, hello, halo, philosophy, ...
As you can imagine, this can be a _heavy_ operation on the database,
and might take quite a bit longer than a normal query; but it can be
quite useful.
2017-10-24 21:55:35 +02:00
|
|
|
case Node::Type::OpNot:
|
|
|
|
case Node::Type::OpAnd:
|
|
|
|
case Node::Type::OpOr:
|
|
|
|
case Node::Type::OpXor:
|
2022-06-12 18:44:00 +02:00
|
|
|
case Node::Type::OpAndNot:
|
|
|
|
return xapian_query_op(tree);
|
|
|
|
case Node::Type::Value:
|
Avoid word-splitting regular expression matches
Previously, we would conduct regular expression searches by
enumerating all values of a given term, manually regex-matching each
one against our search regular expression, remember all the term
values that matched our regular expression, then do a big Xapian
OR-query that matched any of those term values. In constructing this
OR-query, however, we would split each term value on space and add a
separate Xapian phrase search term for each resulting word. This
approach worked fine most of the time, beacuse when we index a term,
we index both each word in a term and the whole term by itself.
This word splitting produced false negatives in some matches, however,
because Xapian and the Mu-level word splitting code do word splitting
slightly differently and apply different transformations to the text
while splitting. (For example, Xapian transforms fancy Unicode
apostrophes to ASCII apostrophes.)
This patch avoids the problem by not word splitting when constructing
the big Xapian OR-query for finding the results of regular
expression matching.
2022-11-14 17:35:10 +01:00
|
|
|
case Node::Type::ValueAtomic:
|
2022-06-12 18:44:00 +02:00
|
|
|
return xapian_query_value(tree);
|
|
|
|
case Node::Type::Range:
|
|
|
|
return xapian_query_range(tree);
|
|
|
|
default:
|
|
|
|
throw Mu::Error(Error::Code::Internal, "invalid query"); // bug
|
lib: implement new query parser
mu's query parser is the piece of software that turns your queries
into something the Xapian database can understand. So, if you query
"maildir:/inbox and subject:bla" this must be translated into a
Xapian::Query object which will retrieve the sought after messages.
Since mu's beginning, almost a decade ago, this parser was based on
Xapian's default Xapian::QueryParser. It works okay, but wasn't really
designed for the mu use-case, and had a bit of trouble with anything
that's not A..Z (think: spaces, special characters, unicode etc.).
Over the years, mu added quite a bit of pre-processing trickery to
deal with that. Still, there were corner cases and bugs that were
practically unfixable.
The solution to all of this is to have a custom query processor that
replaces Xapian's, and write it from the ground up to deal with the
special characters etc. I wrote one, as part of my "future, post-1.0
mu" reseach project, and I have now backported it to the mu 0.9.19.
From a technical perspective, this is a major cleanup, and allows us
to get rid of much of the fragile preprocessing both for indexing and
querying. From and end-user perspective this (hopefully) means that
many of the little parsing issues are gone, and it opens the way for
some new features.
From an end-user perspective:
- better support for special characters.
- regexp search! yes, you can now search for regular expressions, e.g.
subject:/h.ll?o/
will find subjects with hallo, hello, halo, philosophy, ...
As you can imagine, this can be a _heavy_ operation on the database,
and might take quite a bit longer than a normal query; but it can be
quite useful.
2017-10-24 21:55:35 +02:00
|
|
|
}
|
2020-11-03 08:58:59 +01:00
|
|
|
#pragma GCC diagnostic pop
|
lib: implement new query parser
mu's query parser is the piece of software that turns your queries
into something the Xapian database can understand. So, if you query
"maildir:/inbox and subject:bla" this must be translated into a
Xapian::Query object which will retrieve the sought after messages.
Since mu's beginning, almost a decade ago, this parser was based on
Xapian's default Xapian::QueryParser. It works okay, but wasn't really
designed for the mu use-case, and had a bit of trouble with anything
that's not A..Z (think: spaces, special characters, unicode etc.).
Over the years, mu added quite a bit of pre-processing trickery to
deal with that. Still, there were corner cases and bugs that were
practically unfixable.
The solution to all of this is to have a custom query processor that
replaces Xapian's, and write it from the ground up to deal with the
special characters etc. I wrote one, as part of my "future, post-1.0
mu" reseach project, and I have now backported it to the mu 0.9.19.
From a technical perspective, this is a major cleanup, and allows us
to get rid of much of the fragile preprocessing both for indexing and
querying. From and end-user perspective this (hopefully) means that
many of the little parsing issues are gone, and it opens the way for
some new features.
From an end-user perspective:
- better support for special characters.
- regexp search! yes, you can now search for regular expressions, e.g.
subject:/h.ll?o/
will find subjects with hallo, hello, halo, philosophy, ...
As you can imagine, this can be a _heavy_ operation on the database,
and might take quite a bit longer than a normal query; but it can be
quite useful.
2017-10-24 21:55:35 +02:00
|
|
|
}
|