charset-specific searches, and continuation lines

Discussion:

Eric Abrahamsen

2014-09-04 09:21:50 UTC

Hi there,

I'm looking into improving IMAP search support for the Gnus Emacs mail
client, and trying to add the ability to search non-ascii characters. So
far as I know, I start this invocation with something like:

. UID SEARCH CHARSET UTF-8 TEXT {NNN}

Where NNN is the number of bytes in my search string. Dovecot then
responds with:

+ OK

So... what do I do then? I don't actually know what the next statement
is, to provide the actual search string itself. Googling has proved
unhelpful, as most of the examples online don't actually show this "+
OK" response. Can someone just briefly outline what's meant to happen
next? I've tried including the search string immediately after the
byte-size, separated by various combinations of \n\r, but that always
gives me a "Missing LF after literal size" error.

I'm using the Archlinux dovecot package, which reports version 2.2.13-1.

Thanks!
Eric

Michael M Slusarz

2014-09-04 18:08:46 UTC

Permalink

This post might be inappropriate. Click to display it.

Eric Abrahamsen

2014-09-05 02:13:17 UTC

Permalink

Post by Michael M Slusarz

Post by Eric Abrahamsen
Hi there,
I'm looking into improving IMAP search support for the Gnus Emacs mail
client, and trying to add the ability to search non-ascii characters. So
. UID SEARCH CHARSET UTF-8 TEXT {NNN}
Where NNN is the number of bytes in my search string. Dovecot then
+ OK
So... what do I do then? I don't actually know what the next statement
is, to provide the actual search string itself. Googling has proved
unhelpful, as most of the examples online don't actually show this "+
OK" response. Can someone just briefly outline what's meant to happen
next? I've tried including the search string immediately after the
byte-size, separated by various combinations of \n\r, but that always
gives me a "Missing LF after literal size" error.

. UID SEARCH CHARSET UTF-8 TEXT {4}
+OK
a?b[CRLF]
* SEARCH XXX
. OK
Literal length is the number of octets in the string - not the number
of characters - so not sure if that was tripping you up.

Hi Michael,

Well that's embarrassing, I could have sworn that was the first thing I
tried. I knew about the octets, and had tried inputting a\303\251b as
the search string, but was sure I'd also tried the plain old search
string. Thanks!

While I've got you here, I hope you'll answer one more question: what's
the format for searching multiple terms with non-ascii strings? Is it
possible in one run to find a utf-8 encoded subject, and a utf-8 encoded
body?

Thanks again,
Eric

Michael M Slusarz

2014-09-05 02:59:37 UTC

Permalink

This post might be inappropriate. Click to display it.

Eric Abrahamsen

2014-09-05 03:44:09 UTC

Permalink

Post by Michael M Slusarz

Post by Eric Abrahamsen
While I've got you here, I hope you'll answer one more question: what's
the format for searching multiple terms with non-ascii strings? Is it
possible in one run to find a utf-8 encoded subject, and a utf-8 encoded
body?

C: . UID SEARCH CHARSET UTF-8 SUBJECT {4}
S: +OK
C: a?b BODY {4}
S: +OK
C: a?b
S: * SEARCH XXX
S: . OK
Even better... if the server supports LITERAL+, you don't have to wait
for the synchronizing literal which prevents the need to wait for 2
C: . UID SEARCH CHARSET UTF-8 SUBJECT {4+}
C: a?b BODY {4+}
C: a?b[CRLF]
S: * SEARCH XXX
S: . OK

Brilliant, thanks a lot! Not something I would have guessed on my own,
and surprisingly hard to find online -- I'm learning to read the RFCs...

Thanks again,
Eric

Eric Abrahamsen

2014-09-11 08:37:12 UTC

Permalink

Post by Michael M Slusarz

One other question:

I've set up full text search indexing via Lucene, and it works great.
But how is this index encoded? Specifically, if I use the above method
to search for non-ascii strings, am I still benefiting from the speedups
of the search index?

I know that some people who are indexing non-ascii, non-UTF-8 messages
are running them through some sort of decoder to force them into UTF-8,
so that Lucene can index them properly. Is this still necessary if I'm
using the method above?

Thanks!
Eric

Michael M Slusarz

2014-09-12 07:24:55 UTC

Permalink

Post by Eric Abrahamsen

Post by Michael M Slusarz

I've set up full text search indexing via Lucene, and it works great.
But how is this index encoded? Specifically, if I use the above method
to search for non-ascii strings, am I still benefiting from the speedups
of the search index?
I know that some people who are indexing non-ascii, non-UTF-8 messages
are running them through some sort of decoder to force them into UTF-8,
so that Lucene can index them properly. Is this still necessary if I'm
using the method above?

I have no insight on Lucene internals.

michael