|
Register | Blogging | Today's Posts | Search |
|
Thread Tools | Display Modes |
07-16-2009, 10:03 AM | #1 (permalink) |
Fish in the percolator!
Join Date: Dec 2005
Location: Hobbit Land NZ
Posts: 2,870
|
Spam me!
I never thought I'd be saying this, but I'd like you to forward me your spam to spamme.repository@gmail.com in its original state if possible (without > symbols and Fwd: prefixes).
As a personal project, I'm coding a Bayesian spam filter (link here for anyone who's interested). In a nutshell, Bayesian spam filtering involves classifying e-mails as spam or ham (not spam) based on the word content. To illustrate by simplistic example, words like 'inheritance', 'account', 'rolex' are spammy words which tend to occur in spam e-mails. If a newly received e-mail contains a large number of these spammy words, there is a higher chance it will be chucked in the spambox. In order to build up this database of words and associated probabilistic data, the spam filter needs to undergo a learning phase in which it is fed a mass number of e-mails and told whether they are spam or ham. After the learning phase, it can (theoretically) be entrusted to decide for itself whether an e-mail is spam. I don't actually receive much spam and what spam I do receive is mostly of one type which makes it statistically biased. What I need is a lot of spam of many different types and that's where I hope you guys can help out.
__________________
|
07-16-2009, 04:16 PM | #2 (permalink) |
Partying on the inside
Join Date: Mar 2009
Posts: 5,584
|
Will do.
What I've noticed with most of my spam is that the account name and subject are often typed L1k3 th15 in order to elude spam filters. Is that going to be an issue, or would there be some way to have a program determine numbers used in place of letters?
__________________
|
07-16-2009, 07:05 PM | #4 (permalink) |
Partying on the inside
Join Date: Mar 2009
Posts: 5,584
|
They're so clever.
Gotta love the ones that are all: From: Maria D. Sanchez Subject: Hey! I finally found you!!! _______________________________________________ body: OMG ENLARGE UR PENIS LOL
__________________
|
07-16-2009, 11:20 PM | #5 (permalink) |
Fish in the percolator!
Join Date: Dec 2005
Location: Hobbit Land NZ
Posts: 2,870
|
Deliberate misspelling is a common spam technique which most spam filters account for. I imagine that when filters encounter an unfamiliar word (not in the database), they cycle through all of its possible forms given the possible forms of each letter and check that none of the results match a familiar word before confirming it as a new word and adding it.
And the spamming technique of having an innocent subject line, possibly some innocent text and the spam message following that, is insidious for two reasons. The first is that having examined the legit looking subject line, people will often open the e-mail based on that. The second is Bayesian poisoning which fools some spam filters - spammers will often insert legit looking paragraphs which contain non-spammy words and then follow those with the spam content. Since Bayesian spam filters tend to consider the spamminess of all words in the e-mail, the legit paragraphs lower the overall spamminess of the e-mail and can allow this spam to slip through the filter. There are simple ways of preventing common Bayesian poisoning though. A very common spamming technique I see nowadays is image spam. The idea there is that the spam is contained within the image so text-based spam filters cannot process it. But Gmail uses optical character recognition (as used in scanners) to extract text from those pictures - and it probably treats the text quite suspiciously.
__________________
|
07-23-2009, 07:29 AM | #6 (permalink) |
Fish in the percolator!
Join Date: Dec 2005
Location: Hobbit Land NZ
Posts: 2,870
|
So does anyone else have some delectable spam for me? If you'd like to forward it, the address is spamme.repository@gmail.com
__________________
|
|