Collecting the Internet So You Don't Have To

We work on the Internet. As such, we are constantly consuming information. Believe me, there is a lot of it out there. Sometimes we even forget things unless we write them down. Our blog covers everything from web standards to the muppets, php to comic books, music and everything else that we find interesting. Leave us a note when you drop by.

Is There Such a Thing as Bayesian Classification for HTTP Requests?

Dwayne Kristjanson
Dwayne Kristjanson Senior Programmer
Visual Lizard
1 (204) 957-5520 ext:154
1 (888) 237-9559
Dwayne Kristjanson Indifference Engine

We use Atomic Secured Linux on most of our servers to lock them down and keep out spammy content, SQL injections, and cross-site scripting attacks. It's not our only layer of security, by any means, but it's definitely one of the most powerful. It's also one of the most strict. Which sometimes causes problems. For example, Casinos of Winnipeg is one of our clients. For obvious reasons, they occasionally trigger rules meant to prevent gambling-related spam. Other clients have had problems embedding videos, since common methods of video embedding can also be used to inject cross-site scripting attacks. Managing this is largely a matter of disabling rules and/or categories of rules when they prove problematic. This can either be across all sites or on a site-by-site basis. However, since the rules in use are constantly being updated to account for the latest attacks, the side effects need to be constantly updated as well.

At this point, Atomic Secured Linux is approaching the problem by creating rules for each specific type of unwanted content. Some of these rules are quite broad, and may block both clearly unwanted content and other content that's possibly permissible. The general approach reminds me of where email spam filtering was ten years ago. Prior to Paul Graham's publication of A Plan For Spam in 2002, email spam filters were similarly rule-heavy. As a consequence, they were usually either too weak, letting in large amounts of junk, or too strict, banning lots of perfectly acceptable content. Paul Graham's suggestion was to train a  Bayesian classifier on email he'd already flagged as junk, as compared to email he'd left in his inbox. It worked extremely well, especially in comparison to the tools available at the time.

We suspect a similar approach would work for classifying incoming HTTP requests as likely to include spam, cross-site scripting attacks, and/or SQL injection attacks. However, so far we have not found any products that make use of the Bayesian approach. We have found a few somewhat related things online. Jon Bulava wrote a proof-of-concept Apache module that would proxy web content and filter out spammy looking sites. That's close to what we'd like to have, but would work on the client side to filter content sent from another server. We'd want to look at the POST or GET data sent to the server, instead. For SQL injection protection, SQLassie is a database firewall that uses Bayesian filtering to look for malformed queries that are likely the result of SQL injection attacks. This is apparently a response to brittleness in how the GreenSQL database firewall applies it's rules. SQLassie hasn't been released as stable yet, but it's definitely worth keeping an eye on.

If you've heard of an Apache module or other tool for scanning incoming HTTP requests and classifying them as either spam or a potential security threat, let us know. In the meantime, we'll keep adjusting the knobs on our security software to find the right balance between "open to attack" and "locked down too tight to do anything".