kaashif's blog

Programming, with some mathematics on the side

Making a list of the websites of people on nixers.net

2016-05-05

I wanted to make a list of the websites of the people on the website http://nixers.net, and I decided to solve it not by asking people to tell me what their sites were called, but by scraping the forum.

I didn't scrape the whole forum, I just scraped one topic on the forum that I created a few years ago: https://nixers.net/showthread.php?tid=1547

There really isn't much to it, all I had to do was to fetch all of the pages of the forum post and somehow go through them and retrieve all of the URLs, then do some filtering.

Getting the pages wasn't very difficult, I just used my good old friend cURL:

$ curl https://nixers.net/showthread.php?tid=1547 > 1.html
$ for i in {2..8}; do curl "https://nixers.net/showthread.php?tid=1547&page=$i" > ${i}.html; done

So I have all of the pages in the current directory. All I need to do now was do some text processing. As always, the first tool in my toolbox when I need to do complicated string matching involving regexes is Perl. Sure enough, there is a Perl module on CPAN for this. It even comes with a script to make running it super easy.

The module is URI::Find and after CPANing that, the script it installs is urlfind. The documentation for the script can be found here.

What I need to do is find all the URLs, remove all of the ones that aren't personal sites, remove all of the duplicates, then store the result in a file. Psh, no problem.

$ urifind -n * | \
grep -vE 'nixers.net|github|imgur|openbsd|tumblr' | \
sed -e 's/https/http/' -e 's,/$,,' -e 's,http://,,' -e 's,/.*$,,' | \
sort | uniq > sites

I also snuck in a sed command that removes trailing slashes and all of the URI path stuff, just leaving us with the domain names, which is all we really want, anyway.

This gave me a file that I then went through to remove anything the command didn't get rid of. The list I got is reproduced faithfully below:

albertocg.com
andrew.harrison.nu
arcetera.moe
arcetera.party
b4dtr1p.tk
blog.neeasade.net
blog.xero.nu
bugsofberk.net
charliethe.ninja
code.xero.nu
elliottpardee.me
eyenx.ch
fontvir.us
git.b4dtr1p.tk
icetimux.com
jona.io
josm.xyz
kaashif.co.uk
literallyryan.weebly.com
lugm.org
neeasa.de
neeasade.net
nullball.nu
pluviophile.xyz
ports.brianctomlinson.com
pub.iotek.org
punkweb.co
purestench.blogspot.com
qoob.nu
quitter.se
redpanduh.com
rocx.rocks
s0lll0s.me
stenchforums.net
strangequark.tk
thevypr.com
u2620.net
venam.1.ai
wildefyr.net
www.brianctomlinson.com
www.dafont.com
www.letterheadfonts.com
www.unixcri.me
xcelq.org
xero.nu
xero.owns.us

So there you have it, a list of websites you might want to check out. I'll also put this on my about page, in case this post gets buried (by me, in the future).