The above script you're worked on looks like it might somewhat do what is wanted here:
http://www.digitalfaq.com/forum/news...ase-email.html
- sort emails by TLD
- sort by FQDN (domain, not subdomain)
- sort by TLD
- main missing aspect is filter out freemails (gmail, hotmail, etc)
Happy butterflies in stomach, elated anticipation there, for that one.
For the spam/disposable de-dupe, this what is needed, why, where the data comes from, how it will be used:
These are our current internal lists, which block forum registrations:
Worthless TLDs, 99-100% used by spam or abuse.
Code:
.accountant
.ae
.af
.asia
.bid
.by
.cat
.cc
.cf
.click
.club
.country
.cn
.cx
.date
.dj
.download
.faith
.fun
.ga
.gdn
.gq
.hk
.hm
.id
.info
.ir
.jetzt
.kim
.kr
.la
.life
.link
.live
.loan
.ltd
.me
.men
.ml
.mobi
.mom
.museum
.nf
.ninja
.np
.qq
.online
.ooo
.ovh
.party
.ph
.pw
.pm
.pro
.racing
.red
.reise
.ren
.rocks
.ru
.science
.so
.space
.st
.stream
.su
.tc
.tk
.to
.top
.trade
.ua
.vip
.vn
.wang
.webcam
.win
.work
.world
.ws
.xyz
.zip
Known subdomain use, bad domains. These are wildcarded entries:
Code:
.10mail.org
.1s.fr
.33mail.com
.abrupter.com
.adriaticmail.com
.anonbox.net
.axeprim.eu
.bulc.club
.ceramicsouvenirs.com
.coolyarddecorations.com
.dropmail.me
.dynainbox.com
.e4ward.com
.emailtmp.com
.emific.com
.factorican.com
.goverloe.com
.hopto.org
.ilyushu.com
.instambox.com
.itemxyz.com
.mailcatch.com
.mailexpire.com
.marvsz.com
.mezimages.net
.minemail.in
.minespace.in
.mintemail.com
.mistrioni.com
.otherinbox.com
.piquate.com
.pixymix.com
.thc.lv
.universallightkeys.com
.vondata.com.ar
.yourdomain.com
.x24hr.com
Bad domains:
Code:
@0clickemail.com
@0hcow.com
@0hdear.com
@0hio.net
@0ils.org
@0live.org
@0nce.net
@0wnd.net
@0wnd.org
@10mail.org
@10minut.com
@10minut.com.pl
@10minutemail.co.uk
@10minutemail.com
@10minutemail.de
@10minutemail.eu
@10minutemail.net
@10minutemail.org
@10minutemail.us
@10minutmail.pl
@115mail.net
@123-m.com
@126.com
@139.com
@163.com
@1pad.de
@10minutemail.co.uk
@10minutemail.co.za
@1shivom.com
@20email.eu
@20mail.in
@20mail.it
@20minutemail.com
@21cn.com
@24hinbox.com
@2emea.com
@2odem.com
@2prong.com
@30minutemail.com
@33mail.com
@3d-painting.com
@4warding.com
@4warding.net
@4warding.org
@6ip.us
@6paq.com
@6url.com
@60minutemail.com
@675hosting.com
@675hosting.net
@675hosting.org
@7days-printing.com
@7tags.com
@75hosting.com
@75hosting.net
@75hosting.org
@9ox.net
@99experts.com
etc
We audit registrations, and update this list as needed. But that is a reactive method. To be more proactive, I want to compare the internal domain list to these:
https://gist.github.com/adamloving/4401361
https://github.com/ivolo/disposable-.../wildcard.json
https://gist.github.com/pakoito/adfc...ddress-domains
https://github.com/TicketeStartup/te...l-domains.json
https://www.mogelmail.de/# (bottom of page)
I can prepare a single text file, with all domains, one per line, from all available lists/sources.
Each lists is a tad different, but Notepad++ can help there.
Some of the above lists are also no longer updated, so this will be a one-and-done on those lists.
More will be added as needed, updated lists re-processed when re-running the script.
Now look at that first "adamloving" link, specifically these entries at/near the top:
Code:
0-mail.com
0815.ru
0clickemail.com
0wnd.net
0wnd.org
antichef.net
antispam.de
baxomale.ht.cx
It has worthless entries. Specifically the .RU and .CX domain. The script that processes these should remove/trash those entries on the de-dupe and sort (using the worthless TLD list at top, and I'm assuming a back-to-front search algorithm will find those).
We tried to de-dupe in Excel, and got nowhere, overly complex, version specific.
And it still didn't address all the worthless TLD-using domain in the list, and I found myself trying manually removing way too many entries.
With the sorted, de-duped, and bad-TLD-removed list, as output by the script, we can quickly add back the @ (or can the script do that in output file?), and it's ready for a quick manual review. After review, ready for use in the internal blocklists.
The only possible sticking point is subdomains, which are wildcarded by the spammers (and thus wildcard blocked), so those should not appear in the sorted domains list. Manual review may be unavoidable, because a simple "two-dots" method won't work due to ccTLD like domain.co.uk. Something that states if length >3 for either dot then subdomain (spammer.
spam.com where "spam"=4), if <3 assume ccTLD (notspam.
co.uk, as both =2). I can help program, I can follow the logic, but writing is a weakness.
This is one method used to keep the forum free of spam and "anonymous"/disposable abuse.
On your nested question, it is realistically never more than 3 or 4.
- subdomain.spammer.com
- subdomain.spammer.co.uk
- the "adamloving" list has an entry for "ypmail.webarnak.fr.eu.org", but that's a weird outlier that you'll never see in practice. I'm not concerned about anything so large, as it gets into complex DNS that most spammers won't know how to do.
Does this all make more sense now?
I think what you've done so far is on the right track to zap both of our needs.
I need one of those bowing smileys.