PermaLink Mail rules and international character sets
What follows is a cry for help. As regular readers will know, I don't do code. A little background first.

For as long as I can remember, we have been swamped by Russian spam addressed to several accounts here. Some are spam traps so I don't really mind about them - they provide early warning of new networks with trojaned proxies. Others though are addressed to real users and we have just not been able to stop them. Until something occurred to me earlier today...

Most of these spams use charset=windows-1251 or charset="windows-1251" near the top of the MIME body. The ones that don't use koi8-r.

How about a mail rule that quarantines messages with this characteristic? So we created a rule that looks in the message body for these and moves matching messages to quarantine. Bingo. It works.

When body contains Windows-1251 move to database spamtrap.nsf

The thing is (this will please Mr Schwartz) this doesn't rely on one of those mail rule hacks that I am always going on about. It uses completely standard Domino functionality. That's the good part.

The bad part is that it is the most horrible hack imaginable. Suppose a user here is having an email discourse with an external party and the matter of international character sets is raised. As soon as one or other party uses the text Windows-1251 in an email, the rule is triggered.

So what is needed is a programmatic way of extracting character set information from the MIME header and not from the message body.

Any takers?

Category: Domino: Administration
Technorati:
Comments :

1. Chris LeRoy13/08/2004 18:12:25
Homepage: http://www.brainbent.com


I am not familiar with how you do your antispam, it sounds like you are using a domino based solution?

If so, this wouldn't be very difficult to do, I have a Lotusscript that could be easily modified for this task.




2. Chris Linfoot13/08/2004 20:31:57


Needs to work as a server mail rule. I believe rules are stored as simple @functions




3. Gretchen14/08/2004 12:17:14
Homepage: http://www.flick.com/~gretchen/


Hmmm. It will be ironic (if you use email notifications for your blog) if this message gets dumped into your spam filter. :)

I'm neither a Lotus Notes admin nor a real programmer, but is there a way to tell your filters to match across newlines or to otherwise insert regexps?

Matching on:

--<anything>
Content-Type: <anything>charset=<nasty charsets>

will give you an extremely high probability of hitting a MIME header without actually having to parse it out.

If you can't match across lines, then just the second line is still pretty indicative, especially if you can tell Lotus Notes 'make sure Content-Type is at the beginning of the line.' I.e.:

^Content-Type:.*charset=Windows-1251

or, for all the example nasty charsets, in extended regexp mode:

^Content-Type:.*charset=(\"?Windows-1251\"?|koi8-r)

The really strict way to do it would be to parse out the MIME headers, starting from content-type in the header to look for the boundary, and then searching the content-type field after each boundary for either the encoded string or another boundary field (in the case of multipart MIME messages.) I have a good example of a nested multipart Russian spam here (relevant message headers only, but message included in full up to the first chunk of payload):

From <yadda yadda>
MIME-Version: 1.0
Content-Type: multipart/related;
type="multipart/alternative";
boundary="----=_NextPart_000_0000_8B729DDD.714F0406"

This is a multi-part message in MIME format.

------=_NextPart_000_0000_8B729DDD.714F0406
Content-Type: multipart/alternative;
boundary="----=_NextPart_001_0001_C5DC58C2.B16F4A87"


------=_NextPart_001_0001_C5DC58C2.B16F4A87
Content-Type: text/plain; charset=Windows-1251
Content-Transfer-Encoding: 8bit

<snip>

In my case, if spamprobe thinks it's a decent match for spam, I check the content-type for being html or multipart and chuck it into a 'I haven't seen a false positive in here in ten thousand messages, so I don't have to get nitpicky about verifying this' folder. But I don't get a lot of valid html or multipart to begin with.




4. Nathan T. Freeman14/08/2004 20:05:02


I'm pretty sure you're out of luck on this one, Chris. You could make several rules to check for exact matches in each one, or set up an OR condition if you can for just one rule. But you can't hack up a formula for it, because there's no @functions for MIME decoding. The BODY contents are inaccessible to @function evaluation, except perhaps for length counts.




5. Chris Linfoot16/08/2004 09:31:24


Nathan, that's what I thought too but then another brain wave.

If Domino insists on protecting us from the raw MIME source of the message (and it does - the hack of adding rules like when body contains Windows-1251 only works when the MIME is slightly broken), then why not operate on the decoded text, not the MIME source?

I have created a rule that checks for the presence of what appears to be one fairly common cyrillic character in subject or body and moves matching messages to a trap and that seems to work.

Problem solved, I think.




6. Chris Linfoot16/08/2004 17:13:08


Oh and sorry Gretchen - I'm not ignoring you. Can't use regexp in Notes/Domino mail rules - that would solve an awful lot of problems if we could though...




7. Chris LeRoy16/08/2004 21:41:12
Homepage: http://www.brainbent.com


I wonder how bad the server overhead would be to write an agent that runs against mail as it is deposited to the mailbox that calls the java.util.regex package to perform regex lookups. Or, thinking back to the antispamagent that was in the sandbox a couple years back, which was essentially basic black and whitelisting... could it be improved with this?

Am I overthinking this? Underthinking it perhaps?




8. Gretchen17/08/2004 07:51:24
Homepage: http://www.flick.com/~gretchen/


Hmm, if you need to do something else like this, can you invoke helper apps or spawn external scripts that can interact with the message? That might be pretty vile on server overhead, though. It's a shame about the no built-in regexps, but it's good that you found out a definite identifier. I love the Cyrillic character trick!




9. 12/12/2004 21:51:51





10. erik13/05/2005 00:19:28


Can you set the filter with a set of words in English (or whatever language you DO want) that would be in almost any e-mail that would hit the filter first, example: "if
And OR The OR Is OR You OR Am in body then ---> send to inbox" (or not to "Russian Spam" folder? Does this make sense?




11. Chris Linfoot13/05/2005 08:26:26


@Erik - looks risky. The rule we settled on which traps at least 99.99% of Russian spam (I would say 100% but there may be one exception somewhere) is this:

When subject contains и OR body contains и [action]

Simple really.




12. 08/12/2005 06:45:16





13. Prabha08/12/2005 06:51:01


I uploaded a chinese file. The file has uploaded successfully. After uploading the file has to be modified with some contents. I opened the file in read mode. But the characters are not displayed properly instead they are displayed like ??????. I want to modify the files can anyone help me please

Thanks in priority




14. Chris Linfoot08/12/2005 08:49:26


You need to load Chinese character set support in your OS. Which character set you need I don't know. It may be Big5 or GB2312.




Unable to post a comment? Please read this for a possible explanation...
Add Manual Trackback
Please enter the details of the trackback post. Your trackback will not appear on the site until it has been verified. This won't be immediate, as trackbacks are validated on a scheduled basis. Be patient.











Search
Popular Categories
Monthly Archive
Other stuff
ClustrMaps
Contact Me
Meta
Proudly powered by IBM Lotus Domino 8 Proudly powered by IBM Lotus Domino 8

Subscribe to articles Subscribe to articles feed

Subscribe to comments Subscribe to comments feed

ROR info ROR info


My Amazon wish list Wishlist


Wikio - Top Blogs - Technology
Like what I do?
Then please consider a donation to support the work of Research Autism.

Idea Jam
Planet Lotus
Dilbert