Spam me!

Seltzer · 07-16-2009, 10:03 AM

I never thought I'd be saying this, but I'd like you to forward me your spam to spamme.repository@gmail.com in its original state if possible (without > symbols and Fwd: prefixes).

As a personal project, I'm coding a Bayesian spam filter (link here for anyone who's interested). In a nutshell, Bayesian spam filtering involves classifying e-mails as spam or ham (not spam) based on the word content. To illustrate by simplistic example, words like 'inheritance', 'account', 'rolex' are spammy words which tend to occur in spam e-mails. If a newly received e-mail contains a large number of these spammy words, there is a higher chance it will be chucked in the spambox.

In order to build up this database of words and associated probabilistic data, the spam filter needs to undergo a learning phase in which it is fed a mass number of e-mails and told whether they are spam or ham. After the learning phase, it can (theoretically) be entrusted to decide for itself whether an e-mail is spam. I don't actually receive much spam and what spam I do receive is mostly of one type which makes it statistically biased. What I need is a lot of spam of many different types and that's where I hope you guys can help out.

Freebase Dali · 07-16-2009, 04:16 PM

Will do.

What I've noticed with most of my spam is that the account name and subject are often typed L1k3 th15 in order to elude spam filters.
Is that going to be an issue, or would there be some way to have a program determine numbers used in place of letters?

~~The Unfan~~ · 07-16-2009, 05:43 PM

Also sometimes they dodge filters lik.e thi.s.

Freebase Dali · 07-16-2009, 07:05 PM

They're so clever.
Gotta love the ones that are all:

From: Maria D. Sanchez
Subject: Hey! I finally found you!!!
_______________________________________________
body:

OMG ENLARGE UR PENIS LOL

Seltzer · 07-16-2009, 11:20 PM

Deliberate misspelling is a common spam technique which most spam filters account for. I imagine that when filters encounter an unfamiliar word (not in the database), they cycle through all of its possible forms given the possible forms of each letter and check that none of the results match a familiar word before confirming it as a new word and adding it.

And the spamming technique of having an innocent subject line, possibly some innocent text and the spam message following that, is insidious for two reasons. The first is that having examined the legit looking subject line, people will often open the e-mail based on that. The second is Bayesian poisoning which fools some spam filters - spammers will often insert legit looking paragraphs which contain non-spammy words and then follow those with the spam content. Since Bayesian spam filters tend to consider the spamminess of all words in the e-mail, the legit paragraphs lower the overall spamminess of the e-mail and can allow this spam to slip through the filter. There are simple ways of preventing common Bayesian poisoning though.

A very common spamming technique I see nowadays is image spam. The idea there is that the spam is contained within the image so text-based spam filters cannot process it. But Gmail uses optical character recognition (as used in scanners) to extract text from those pictures - and it probably treats the text quite suspiciously.

Seltzer · 07-23-2009, 07:29 AM

So does anyone else have some delectable spam for me? If you'd like to forward it, the address is spamme.repository@gmail.com

07-16-2009, 10:03 AM	#1 (permalink)
Seltzer Fish in the percolator! Join Date: Dec 2005 Location: Hobbit Land NZ Posts: 2,870	Spam me! I never thought I'd be saying this, but I'd like you to forward me your spam to spamme.repository@gmail.com in its original state if possible (without > symbols and Fwd: prefixes). As a personal project, I'm coding a Bayesian spam filter (link here for anyone who's interested). In a nutshell, Bayesian spam filtering involves classifying e-mails as spam or ham (not spam) based on the word content. To illustrate by simplistic example, words like 'inheritance', 'account', 'rolex' are spammy words which tend to occur in spam e-mails. If a newly received e-mail contains a large number of these spammy words, there is a higher chance it will be chucked in the spambox. In order to build up this database of words and associated probabilistic data, the spam filter needs to undergo a learning phase in which it is fed a mass number of e-mails and told whether they are spam or ham. After the learning phase, it can (theoretically) be entrusted to decide for itself whether an e-mail is spam. I don't actually receive much spam and what spam I do receive is mostly of one type which makes it statistically biased. What I need is a lot of spam of many different types and that's where I hope you guys can help out. __________________ In the Court of King Crimson \| Last.fm \| RYM

07-16-2009, 04:16 PM	#2 (permalink)
Freebase Dali Partying on the inside Join Date: Mar 2009 Posts: 5,584	Will do. What I've noticed with most of my spam is that the account name and subject are often typed L1k3 th15 in order to elude spam filters. Is that going to be an issue, or would there be some way to have a program determine numbers used in place of letters? __________________

07-16-2009, 07:05 PM	#4 (permalink)
Freebase Dali Partying on the inside Join Date: Mar 2009 Posts: 5,584	They're so clever. Gotta love the ones that are all: From: Maria D. Sanchez Subject: Hey! I finally found you!!! _______________________________________________ body: OMG ENLARGE UR PENIS LOL __________________

07-16-2009, 11:20 PM	#5 (permalink)
Seltzer Fish in the percolator! Join Date: Dec 2005 Location: Hobbit Land NZ Posts: 2,870	Deliberate misspelling is a common spam technique which most spam filters account for. I imagine that when filters encounter an unfamiliar word (not in the database), they cycle through all of its possible forms given the possible forms of each letter and check that none of the results match a familiar word before confirming it as a new word and adding it. And the spamming technique of having an innocent subject line, possibly some innocent text and the spam message following that, is insidious for two reasons. The first is that having examined the legit looking subject line, people will often open the e-mail based on that. The second is Bayesian poisoning which fools some spam filters - spammers will often insert legit looking paragraphs which contain non-spammy words and then follow those with the spam content. Since Bayesian spam filters tend to consider the spamminess of all words in the e-mail, the legit paragraphs lower the overall spamminess of the e-mail and can allow this spam to slip through the filter. There are simple ways of preventing common Bayesian poisoning though. A very common spamming technique I see nowadays is image spam. The idea there is that the spam is contained within the image so text-based spam filters cannot process it. But Gmail uses optical character recognition (as used in scanners) to extract text from those pictures - and it probably treats the text quite suspiciously. __________________ In the Court of King Crimson \| Last.fm \| RYM

07-23-2009, 07:29 AM	#6 (permalink)
Seltzer Fish in the percolator! Join Date: Dec 2005 Location: Hobbit Land NZ Posts: 2,870	So does anyone else have some delectable spam for me? If you'd like to forward it, the address is spamme.repository@gmail.com __________________ In the Court of King Crimson \| Last.fm \| RYM

07-16-2009, 05:43 PM	#3 (permalink)
~~The Unfan~~ Account Disabled Join Date: Dec 2006 Location: Methville Posts: 2,116	Also sometimes they dodge filters lik.e thi.s.