Fwaf – Machine Learning Driven Web Application Firewall

rfoo · on May 14, 2017

It seems like what it actually "learned" is no better than banning some keywords, for example:

  >>> p=lambda x:lgs.predict(vectorizer.transform([x]))
  >>> p("/product.php?name=etc")
  array([1])
  >>> p("/login.php?name=rfoo&pass=hehe")
  array([1])
  >>> p("/download.php?file=/root/.bashrc")
  array([0])
  >>> p("/example/test/q=" + lorem + "<script>alert(1)</script>") # len(lorem) = 4488
  array([0])

(FYI 1 means malicious and 0 means clean)

Faizann20 · on May 14, 2017

If you have a look at the data, there is a wide variety of malicious commands that it can detect.

elchief · on May 14, 2017

Doesn't look like he did any cross-validation, hence the high accuracy. Always keep a hold-out set to test against

Faizann20 · on May 14, 2017

In case someone is having difficulty with the link, here is an alternative:

https://web.archive.org/web/20170514081124/http://fsecurify....

Apologies for inconvenience.

Thanks

halflings · on May 14, 2017

Like others have said, you might be overfitting your training data here: your model is just memorising the examples you give it and would fail if somebody slightly varies some payload. (inserting some whitespace or something)

Another thing to keep in mind is that an accuracy of 99% doesn't mean much in an unbalanced problem like yours (much more clean queries than malicious once).

What you should show instead is precision (of the ones labeled malicious, how many are actually malicious?) and recall (out of the malicious queries in the dataset, how many did your model label as malicious?)

halflings · on May 14, 2017

Try to also do cross-validation (check sklearn's stratified K-Folds) to be more confident you're not just overfitting.

kermatt · on May 14, 2017

The similarities in name _and_ logo to F-Secure are a little bothersome.

Faizann20 · on May 14, 2017

They have a straight F :p

based2 · on May 14, 2017

http://yararules.com/2017/04/06/yara-rules-strings-statistic...

bllguo · on May 14, 2017

Fun, will look into this! Wonder if anyone can point to other datasets?

Faizann20 · on May 14, 2017

The website is working a bit slow but the page is loading. If you are facing any problem, please wait for a minute and the page will load.

proyb2 · on May 14, 2017

All I got is blank page.

Faizann20 · on May 14, 2017

Here you go: https://github.com/faizann24/Fwaf-Machine-Learning-driven-We...

mcboman · on May 14, 2017

Why use a trigram as n ?

Faizann20 · on May 14, 2017

I tried different n grams and trigrams were performing best.

godmodus · on May 14, 2017

Trigrams are a known sweetspot. Its a practical heuristic. Got no proofs to link. Wrote awikipedia semana parser and witnessed similar results in practice.

Going above 3 adds very,very little precision while guzzling space. Using 2 shows a drop in precision.

Probably has todo with how human built systems are built (we make them in our image - and we seem to have a thing for 3s - map(subject, verb, object) -> (origin, data, destinstion) etc)

Addendum; if you know what PCA is, id wager that added n gram dimensions share a linear dependency with lower dimensions - so sharing a statistical resemblence (covar(A,B) -> 0) that adds very little to the data's variability once you start adding dims above 3.

godmodus · on May 15, 2017

Why the downvotes with out correction?

godmodus · on May 14, 2017

Woo, saving this.im planning a similar ptoject once im done with my studies.