Help wanted: Comparision + de-dupe script/app?
This site needs your help. :old:
I need an "easy button" for some forum admin tasks. The manual methods are just too unwieldy. The bounty: I'm not necessarily asking for free help here -- though it would be nice. Perhaps you can return the favor for my past advice that helped you? If not, I can offer: - free Premium Membership (with benefits growing after new site is launched, price increasing) - discount on my hardware - some money - a favor, to be used at a later date, and that's never a bad thing to have with me - some tape conversion work, either discounted or free (least preferred, but will do it begrudgingly) - some combination of the above So, with that out of the way, onto the help needed... The reason: Spam is a problem. Disposable email accounts are a problem. In recent years, lots of anti-spam and anti-disposable APIs have popped up. Most want a high monthly fee. Several are dubious in origin. Most are dubious in reliability. All seem to have wasteful API calls (thus incur more charges) by verifying known-crap addresses and TLDs. Most/all of these services were born from updates lists found at Github. I wouldn't even be surprised if these pay services were using Github/etc as their backend database source origins. I want to do better, less wasteful for both expenses and resources. I want to get multiple external lists, our own internal lists, merge/de-dupe, then implement on a regular basis. We audit email signups at least once per month, and I cringe when I see that spam or disposables were allowed in. The script(s?) wanted: I want to be able to dump a list of domains. Example: Code:
0-mail.com That will leave .com and others. Another script (or the same script) can de-dupe, and then sort. I need two sorts available: (1) alphabetical, and (2) alphabetical by TLD. This way, bad TLDs can be seen more easily. Subdomains and compound CC TLDs must be considered. So "spam.163.com" is alphabetical by "163" and TLD is ".com". I'm fungible on this small area of the script sorting, something just needs to work easily here. Full email lists are available for this, so you can test sort with live data. Questions? Post questions here, any clarifications needed. Then we can move to PM/email as needed. I really hope somebody(ies) here has the skills to do any/some/all of this. This is too far outside my skills. |
I spent a couple hours doing a quick python script to take a text file as input with one domain entry per line, and generate two text files out. Here's a sample of the test input and output I did, to see if this is what you are envisioning.
Input file Code:
www.alpha.com Code:
alien.io Code:
www.beta.cc Assuming this is on the right track so far, it seems you also want to give it a second input file that looks like this Code:
.xyz Would this input file only contain entries with a single dot in them, i.e. ".ru", but not ".domain.ru"? If not, how many layers of "nesting" would be the maximum? If output files for 'bad' domains is wanted. Do they need to be added to existing output files? Or would they also be generated from scratch each time like the 'good' output files example mentioned above? |
The above script you're worked on looks like it might somewhat do what is wanted here: http://www.digitalfaq.com/forum/news...ase-email.html
- sort emails by TLD - sort by FQDN (domain, not subdomain) - sort by TLD - main missing aspect is filter out freemails (gmail, hotmail, etc) Happy butterflies in stomach, elated anticipation there, for that one. :) For the spam/disposable de-dupe, this what is needed, why, where the data comes from, how it will be used: These are our current internal lists, which block forum registrations: Worthless TLDs, 99-100% used by spam or abuse. Code:
.accountant Code:
.10mail.org Code:
@0clickemail.com https://gist.github.com/adamloving/4401361 https://github.com/ivolo/disposable-.../wildcard.json https://gist.github.com/pakoito/adfc...ddress-domains https://github.com/TicketeStartup/te...l-domains.json https://www.mogelmail.de/# (bottom of page) I can prepare a single text file, with all domains, one per line, from all available lists/sources. Each lists is a tad different, but Notepad++ can help there. Some of the above lists are also no longer updated, so this will be a one-and-done on those lists. More will be added as needed, updated lists re-processed when re-running the script. Now look at that first "adamloving" link, specifically these entries at/near the top: Code:
0-mail.com We tried to de-dupe in Excel, and got nowhere, overly complex, version specific. And it still didn't address all the worthless TLD-using domain in the list, and I found myself trying manually removing way too many entries. With the sorted, de-duped, and bad-TLD-removed list, as output by the script, we can quickly add back the @ (or can the script do that in output file?), and it's ready for a quick manual review. After review, ready for use in the internal blocklists. The only possible sticking point is subdomains, which are wildcarded by the spammers (and thus wildcard blocked), so those should not appear in the sorted domains list. Manual review may be unavoidable, because a simple "two-dots" method won't work due to ccTLD like domain.co.uk. Something that states if length >3 for either dot then subdomain (spammer.spam.com where "spam"=4), if <3 assume ccTLD (notspam.co.uk, as both =2). I can help program, I can follow the logic, but writing is a weakness. This is one method used to keep the forum free of spam and "anonymous"/disposable abuse. On your nested question, it is realistically never more than 3 or 4. - subdomain.spammer.com - subdomain.spammer.co.uk - the "adamloving" list has an entry for "ypmail.webarnak.fr.eu.org", but that's a weird outlier that you'll never see in practice. I'm not concerned about anything so large, as it gets into complex DNS that most spammers won't know how to do. Does this all make more sense now? I think what you've done so far is on the right track to zap both of our needs. :D I need one of those bowing smileys. |
Site design, images and content © 2002-2024 The Digital FAQ, www.digitalFAQ.com
Forum Software by vBulletin · Copyright © 2024 Jelsoft Enterprises Ltd.