Quantcast Help wanted: Comparision + de-dupe script/app? - digitalFAQ Forum
Go Back    Forum > Featured > General Discussion

LinkBack Thread Tools
08-02-2020, 07:58 AM
lordsmurf's Avatar
lordsmurf lordsmurf is offline
Site Staff | Video
Join Date: Dec 2002
Posts: 9,466
Thanked 1,573 Times in 1,373 Posts
This site needs your help.

I need an "easy button" for some forum admin tasks. The manual methods are just too unwieldy.

The bounty:

I'm not necessarily asking for free help here -- though it would be nice. Perhaps you can return the favor for my past advice that helped you?

If not, I can offer:
- free Premium Membership (with benefits growing after new site is launched, price increasing)
- discount on my hardware
- some money
- a favor, to be used at a later date, and that's never a bad thing to have with me
- some tape conversion work, either discounted or free (least preferred, but will do it begrudgingly)
- some combination of the above

So, with that out of the way, onto the help needed...

The reason:

Spam is a problem.
Disposable email accounts are a problem.

In recent years, lots of anti-spam and anti-disposable APIs have popped up. Most want a high monthly fee. Several are dubious in origin. Most are dubious in reliability. All seem to have wasteful API calls (thus incur more charges) by verifying known-crap addresses and TLDs. Most/all of these services were born from updates lists found at Github. I wouldn't even be surprised if these pay services were using Github/etc as their backend database source origins.

I want to do better, less wasteful for both expenses and resources.

I want to get multiple external lists, our own internal lists, merge/de-dupe, then implement on a regular basis. We audit email signups at least once per month, and I cringe when I see that spam or disposables were allowed in.

The script(s?) wanted:

I want to be able to dump a list of domains.
Dump those into a script. It will auto-remove any domains with TLDs that have been blocked wholesale. For example, .ru and .xyz are fully blocked here, we have near-zero legitimate signups from those TLD sources.

That will leave .com and others.

Another script (or the same script) can de-dupe, and then sort.

I need two sorts available:
(1) alphabetical, and
(2) alphabetical by TLD. This way, bad TLDs can be seen more easily.

Subdomains and compound CC TLDs must be considered. So "spam.163.com" is alphabetical by "163" and TLD is ".com". I'm fungible on this small area of the script sorting, something just needs to work easily here.

Full email lists are available for this, so you can test sort with live data.


Post questions here, any clarifications needed. Then we can move to PM/email as needed.

I really hope somebody(ies) here has the skills to do any/some/all of this. This is too far outside my skills.

- Did my advice help you? Then become a Premium Member and support this site.
- Find television shows, cartoons, DVDs and Blu-ray releases at the TVPast forums.
Reply With Quote
Someday, 12:01 PM
admin's Avatar
Ads / Sponsors
Join Date: ∞
Posts: 42
Thanks: ∞
Thanked 42 Times in 42 Posts
08-04-2020, 09:52 PM
keaton keaton is offline
Premium Member
Join Date: Jan 2017
Posts: 113
Thanked 45 Times in 32 Posts
I spent a couple hours doing a quick python script to take a text file as input with one domain entry per line, and generate two text files out. Here's a sample of the test input and output I did, to see if this is what you are envisioning.

Input file
Output - Alpha sorted
Output - TLD sorted
Currently, these two output files are generated from scratch each time.

Assuming this is on the right track so far, it seems you also want to give it a second input file that looks like this
This input file would be used to prune the list of domains before they are output, as shown above. Another output file could be added which would list all the domains that were pruned during this process. I suppose those could also be output as two separate files using the same sort method described above for the 'good' domains, if desired.

Would this input file only contain entries with a single dot in them, i.e. ".ru", but not ".domain.ru"? If not, how many layers of "nesting" would be the maximum?

If output files for 'bad' domains is wanted. Do they need to be added to existing output files? Or would they also be generated from scratch each time like the 'good' output files example mentioned above?
Reply With Quote
08-05-2020, 09:17 AM
lordsmurf's Avatar
lordsmurf lordsmurf is offline
Site Staff | Video
Join Date: Dec 2002
Posts: 9,466
Thanked 1,573 Times in 1,373 Posts
The above script you're worked on looks like it might somewhat do what is wanted here: Help wanted: Database email export sorting script/app?
- sort emails by TLD
- sort by FQDN (domain, not subdomain)
- sort by TLD
- main missing aspect is filter out freemails (gmail, hotmail, etc)

Happy butterflies in stomach, elated anticipation there, for that one.

For the spam/disposable de-dupe, this what is needed, why, where the data comes from, how it will be used:

These are our current internal lists, which block forum registrations:

Worthless TLDs, 99-100% used by spam or abuse.
Known subdomain use, bad domains. These are wildcarded entries:
Bad domains:
We audit registrations, and update this list as needed. But that is a reactive method. To be more proactive, I want to compare the internal domain list to these:

https://www.mogelmail.de/# (bottom of page)

I can prepare a single text file, with all domains, one per line, from all available lists/sources.
Each lists is a tad different, but Notepad++ can help there.
Some of the above lists are also no longer updated, so this will be a one-and-done on those lists.
More will be added as needed, updated lists re-processed when re-running the script.

Now look at that first "adamloving" link, specifically these entries at/near the top:
It has worthless entries. Specifically the .RU and .CX domain. The script that processes these should remove/trash those entries on the de-dupe and sort (using the worthless TLD list at top, and I'm assuming a back-to-front search algorithm will find those).

We tried to de-dupe in Excel, and got nowhere, overly complex, version specific.
And it still didn't address all the worthless TLD-using domain in the list, and I found myself trying manually removing way too many entries.

With the sorted, de-duped, and bad-TLD-removed list, as output by the script, we can quickly add back the @ (or can the script do that in output file?), and it's ready for a quick manual review. After review, ready for use in the internal blocklists.

The only possible sticking point is subdomains, which are wildcarded by the spammers (and thus wildcard blocked), so those should not appear in the sorted domains list. Manual review may be unavoidable, because a simple "two-dots" method won't work due to ccTLD like domain.co.uk. Something that states if length >3 for either dot then subdomain (spammer.spam.com where "spam"=4), if <3 assume ccTLD (notspam.co.uk, as both =2). I can help program, I can follow the logic, but writing is a weakness.

This is one method used to keep the forum free of spam and "anonymous"/disposable abuse.

On your nested question, it is realistically never more than 3 or 4.
- subdomain.spammer.com
- subdomain.spammer.co.uk
- the "adamloving" list has an entry for "ypmail.webarnak.fr.eu.org", but that's a weird outlier that you'll never see in practice. I'm not concerned about anything so large, as it gets into complex DNS that most spammers won't know how to do.

Does this all make more sense now?

I think what you've done so far is on the right track to zap both of our needs.
I need one of those bowing smileys.

- Did my advice help you? Then become a Premium Member and support this site.
- Find television shows, cartoons, DVDs and Blu-ray releases at the TVPast forums.
Reply With Quote

Similar Threads
Thread Thread Starter Forum Replies Last Post
Help wanted: Database email export sorting script/app? lordsmurf General Discussion 0 08-01-2020 07:20 AM
Luminance clipping in capture Pt1: PAL VCR comparision Bogilein Capture, Record, Transfer 12 05-27-2020 03:04 AM
Composite vs. s-video output mod, image quality comparision ofesad Capture, Record, Transfer 11 07-17-2019 10:18 PM
Dupe content a.k.a. Google doesn't care if you own the content admin Web Design + Site Planning 1 07-25-2013 07:04 PM
How to author a comparision DVD? manthing Author, Make Menus, Slideshows, Burn 4 06-20-2010 05:15 AM

Thread Tools

All times are GMT -5. The time now is 03:02 AM