mu/lib/parser/xapian.cc

103 lines
3.0 KiB
C++
Raw Normal View History

lib: implement new query parser mu's query parser is the piece of software that turns your queries into something the Xapian database can understand. So, if you query "maildir:/inbox and subject:bla" this must be translated into a Xapian::Query object which will retrieve the sought after messages. Since mu's beginning, almost a decade ago, this parser was based on Xapian's default Xapian::QueryParser. It works okay, but wasn't really designed for the mu use-case, and had a bit of trouble with anything that's not A..Z (think: spaces, special characters, unicode etc.). Over the years, mu added quite a bit of pre-processing trickery to deal with that. Still, there were corner cases and bugs that were practically unfixable. The solution to all of this is to have a custom query processor that replaces Xapian's, and write it from the ground up to deal with the special characters etc. I wrote one, as part of my "future, post-1.0 mu" reseach project, and I have now backported it to the mu 0.9.19. From a technical perspective, this is a major cleanup, and allows us to get rid of much of the fragile preprocessing both for indexing and querying. From and end-user perspective this (hopefully) means that many of the little parsing issues are gone, and it opens the way for some new features. From an end-user perspective: - better support for special characters. - regexp search! yes, you can now search for regular expressions, e.g. subject:/h.ll?o/ will find subjects with hallo, hello, halo, philosophy, ... As you can imagine, this can be a _heavy_ operation on the database, and might take quite a bit longer than a normal query; but it can be quite useful.
2017-10-24 21:55:35 +02:00
/*
** Copyright (C) 2017 Dirk-Jan C. Binnema <djcb@djcbsoftware.nl>
**
** This library is free software; you can redistribute it and/or
** modify it under the terms of the GNU Lesser General Public License
** as published by the Free Software Foundation; either version 2.1
** of the License, or (at your option) any later version.
**
** This library is distributed in the hope that it will be useful,
** but WITHOUT ANY WARRANTY; without even the implied warranty of
** MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU
** Lesser General Public License for more details.
**
** You should have received a copy of the GNU Lesser General Public
** License along with this library; if not, write to the Free
** Software Foundation, 51 Franklin Street, Fifth Floor, Boston, MA
** 02110-1301, USA.
*/
#include <xapian.h>
#include "parser/xapian.hh"
using namespace Mux;
static Xapian::Query
xapian_query_op (const Mux::Tree& tree)
{
Xapian::Query::op op;
switch (tree.node.type) {
case Node::Type::OpNot: // OpNot x ::= <all> AND NOT x
if (tree.children.size() != 1)
throw std::runtime_error ("invalid # of children");
return Xapian::Query (Xapian::Query::OP_AND_NOT,
Xapian::Query::MatchAll,
xapian_query(tree.children.front()));
case Node::Type::OpAnd: op = Xapian::Query::OP_AND; break;
case Node::Type::OpOr: op = Xapian::Query::OP_OR; break;
case Node::Type::OpXor: op = Xapian::Query::OP_XOR; break;
case Node::Type::OpAndNot: op = Xapian::Query::OP_AND_NOT; break;
default: throw std::runtime_error ("invalid op"); // bug
}
std::vector<Xapian::Query> childvec;
for (const auto& subtree: tree.children)
childvec.emplace_back(xapian_query(subtree));
return Xapian::Query(op, childvec.begin(), childvec.end());
}
2017-10-26 20:31:22 +02:00
static Xapian::Query
xapian_query_value (const Mux::Tree& tree)
{
const auto v = dynamic_cast<Value*> (tree.node.data.get());
2017-10-27 17:42:58 +02:00
if (!v->phrase)
return Xapian::Query(v->prefix + v->value);
2017-10-26 20:31:22 +02:00
const auto parts = split (v->value, " ");
std::vector<Xapian::Query> phvec;
for (const auto p: parts)
phvec.push_back(Xapian::Query(v->prefix + p));
if (parts.empty())
return Xapian::Query::MatchNothing; // shouldn't happen
if (parts.size() == 1)
return phvec.front();
return Xapian::Query (Xapian::Query::OP_PHRASE,
phvec.begin(), phvec.end());
}
static Xapian::Query
xapian_query_range (const Mux::Tree& tree)
{
const auto r = dynamic_cast<Range*> (tree.node.data.get());
return Xapian::Query(Xapian::Query::OP_VALUE_RANGE,
(Xapian::valueno)r->id, r->lower, r->upper);
}
lib: implement new query parser mu's query parser is the piece of software that turns your queries into something the Xapian database can understand. So, if you query "maildir:/inbox and subject:bla" this must be translated into a Xapian::Query object which will retrieve the sought after messages. Since mu's beginning, almost a decade ago, this parser was based on Xapian's default Xapian::QueryParser. It works okay, but wasn't really designed for the mu use-case, and had a bit of trouble with anything that's not A..Z (think: spaces, special characters, unicode etc.). Over the years, mu added quite a bit of pre-processing trickery to deal with that. Still, there were corner cases and bugs that were practically unfixable. The solution to all of this is to have a custom query processor that replaces Xapian's, and write it from the ground up to deal with the special characters etc. I wrote one, as part of my "future, post-1.0 mu" reseach project, and I have now backported it to the mu 0.9.19. From a technical perspective, this is a major cleanup, and allows us to get rid of much of the fragile preprocessing both for indexing and querying. From and end-user perspective this (hopefully) means that many of the little parsing issues are gone, and it opens the way for some new features. From an end-user perspective: - better support for special characters. - regexp search! yes, you can now search for regular expressions, e.g. subject:/h.ll?o/ will find subjects with hallo, hello, halo, philosophy, ... As you can imagine, this can be a _heavy_ operation on the database, and might take quite a bit longer than a normal query; but it can be quite useful.
2017-10-24 21:55:35 +02:00
Xapian::Query
Mux::xapian_query (const Mux::Tree& tree)
{
switch (tree.node.type) {
case Node::Type::Empty:
return Xapian::Query();
case Node::Type::OpNot:
case Node::Type::OpAnd:
case Node::Type::OpOr:
case Node::Type::OpXor:
case Node::Type::OpAndNot:
return xapian_query_op (tree);
2017-10-26 20:31:22 +02:00
case Node::Type::Value:
return xapian_query_value (tree);
case Node::Type::Range:
return xapian_query_range (tree);
lib: implement new query parser mu's query parser is the piece of software that turns your queries into something the Xapian database can understand. So, if you query "maildir:/inbox and subject:bla" this must be translated into a Xapian::Query object which will retrieve the sought after messages. Since mu's beginning, almost a decade ago, this parser was based on Xapian's default Xapian::QueryParser. It works okay, but wasn't really designed for the mu use-case, and had a bit of trouble with anything that's not A..Z (think: spaces, special characters, unicode etc.). Over the years, mu added quite a bit of pre-processing trickery to deal with that. Still, there were corner cases and bugs that were practically unfixable. The solution to all of this is to have a custom query processor that replaces Xapian's, and write it from the ground up to deal with the special characters etc. I wrote one, as part of my "future, post-1.0 mu" reseach project, and I have now backported it to the mu 0.9.19. From a technical perspective, this is a major cleanup, and allows us to get rid of much of the fragile preprocessing both for indexing and querying. From and end-user perspective this (hopefully) means that many of the little parsing issues are gone, and it opens the way for some new features. From an end-user perspective: - better support for special characters. - regexp search! yes, you can now search for regular expressions, e.g. subject:/h.ll?o/ will find subjects with hallo, hello, halo, philosophy, ... As you can imagine, this can be a _heavy_ operation on the database, and might take quite a bit longer than a normal query; but it can be quite useful.
2017-10-24 21:55:35 +02:00
default:
throw std::runtime_error ("invalid query"); // bug
}
}