Discussion:
charset-specific searches, and continuation lines
Eric Abrahamsen
2014-09-04 09:21:50 UTC
Permalink
Hi there,

I'm looking into improving IMAP search support for the Gnus Emacs mail
client, and trying to add the ability to search non-ascii characters. So
far as I know, I start this invocation with something like:

. UID SEARCH CHARSET UTF-8 TEXT {NNN}

Where NNN is the number of bytes in my search string. Dovecot then
responds with:

+ OK

So... what do I do then? I don't actually know what the next statement
is, to provide the actual search string itself. Googling has proved
unhelpful, as most of the examples online don't actually show this "+
OK" response. Can someone just briefly outline what's meant to happen
next? I've tried including the search string immediately after the
byte-size, separated by various combinations of \n\r, but that always
gives me a "Missing LF after literal size" error.

I'm using the Archlinux dovecot package, which reports version 2.2.13-1.

Thanks!
Eric
Michael M Slusarz
2014-09-04 18:08:46 UTC
Permalink
This post might be inappropriate. Click to display it.
Eric Abrahamsen
2014-09-05 02:13:17 UTC
Permalink
Post by Michael M Slusarz
Post by Eric Abrahamsen
Hi there,
I'm looking into improving IMAP search support for the Gnus Emacs mail
client, and trying to add the ability to search non-ascii characters. So
. UID SEARCH CHARSET UTF-8 TEXT {NNN}
Where NNN is the number of bytes in my search string. Dovecot then
+ OK
So... what do I do then? I don't actually know what the next statement
is, to provide the actual search string itself. Googling has proved
unhelpful, as most of the examples online don't actually show this "+
OK" response. Can someone just briefly outline what's meant to happen
next? I've tried including the search string immediately after the
byte-size, separated by various combinations of \n\r, but that always
gives me a "Missing LF after literal size" error.
. UID SEARCH CHARSET UTF-8 TEXT {4}
+OK
a?b[CRLF]
* SEARCH XXX
. OK
Literal length is the number of octets in the string - not the number
of characters - so not sure if that was tripping you up.
Hi Michael,

Well that's embarrassing, I could have sworn that was the first thing I
tried. I knew about the octets, and had tried inputting a\303\251b as
the search string, but was sure I'd also tried the plain old search
string. Thanks!

While I've got you here, I hope you'll answer one more question: what's
the format for searching multiple terms with non-ascii strings? Is it
possible in one run to find a utf-8 encoded subject, and a utf-8 encoded
body?

Thanks again,
Eric
Michael M Slusarz
2014-09-05 02:59:37 UTC
Permalink
This post might be inappropriate. Click to display it.
Eric Abrahamsen
2014-09-05 03:44:09 UTC
Permalink
Post by Michael M Slusarz
Post by Eric Abrahamsen
While I've got you here, I hope you'll answer one more question: what's
the format for searching multiple terms with non-ascii strings? Is it
possible in one run to find a utf-8 encoded subject, and a utf-8 encoded
body?
C: . UID SEARCH CHARSET UTF-8 SUBJECT {4}
S: +OK
C: a?b BODY {4}
S: +OK
C: a?b
S: * SEARCH XXX
S: . OK
Even better... if the server supports LITERAL+, you don't have to wait
for the synchronizing literal which prevents the need to wait for 2
C: . UID SEARCH CHARSET UTF-8 SUBJECT {4+}
C: a?b BODY {4+}
C: a?b[CRLF]
S: * SEARCH XXX
S: . OK
Brilliant, thanks a lot! Not something I would have guessed on my own,
and surprisingly hard to find online -- I'm learning to read the RFCs...

Thanks again,
Eric
Eric Abrahamsen
2014-09-11 08:37:12 UTC
Permalink
Post by Michael M Slusarz
Post by Eric Abrahamsen
While I've got you here, I hope you'll answer one more question: what's
the format for searching multiple terms with non-ascii strings? Is it
possible in one run to find a utf-8 encoded subject, and a utf-8 encoded
body?
C: . UID SEARCH CHARSET UTF-8 SUBJECT {4}
S: +OK
C: a?b BODY {4}
S: +OK
C: a?b
S: * SEARCH XXX
S: . OK
Even better... if the server supports LITERAL+, you don't have to wait
for the synchronizing literal which prevents the need to wait for 2
C: . UID SEARCH CHARSET UTF-8 SUBJECT {4+}
C: a?b BODY {4+}
C: a?b[CRLF]
S: * SEARCH XXX
S: . OK
michael
One other question:

I've set up full text search indexing via Lucene, and it works great.
But how is this index encoded? Specifically, if I use the above method
to search for non-ascii strings, am I still benefiting from the speedups
of the search index?

I know that some people who are indexing non-ascii, non-UTF-8 messages
are running them through some sort of decoder to force them into UTF-8,
so that Lucene can index them properly. Is this still necessary if I'm
using the method above?

Thanks!
Eric
Michael M Slusarz
2014-09-12 07:24:55 UTC
Permalink
Post by Eric Abrahamsen
Post by Michael M Slusarz
Post by Eric Abrahamsen
While I've got you here, I hope you'll answer one more question: what's
the format for searching multiple terms with non-ascii strings? Is it
possible in one run to find a utf-8 encoded subject, and a utf-8 encoded
body?
C: . UID SEARCH CHARSET UTF-8 SUBJECT {4}
S: +OK
C: a?b BODY {4}
S: +OK
C: a?b
S: * SEARCH XXX
S: . OK
Even better... if the server supports LITERAL+, you don't have to wait
for the synchronizing literal which prevents the need to wait for 2
C: . UID SEARCH CHARSET UTF-8 SUBJECT {4+}
C: a?b BODY {4+}
C: a?b[CRLF]
S: * SEARCH XXX
S: . OK
michael
I've set up full text search indexing via Lucene, and it works great.
But how is this index encoded? Specifically, if I use the above method
to search for non-ascii strings, am I still benefiting from the speedups
of the search index?
I know that some people who are indexing non-ascii, non-UTF-8 messages
are running them through some sort of decoder to force them into UTF-8,
so that Lucene can index them properly. Is this still necessary if I'm
using the method above?
I have no insight on Lucene internals.

michael

Loading...