“Some Statistics” about sizeof(cat)’s feed list

Now, if only there was someone willing to generate some statistics on the contents of the OPML file, like how many Wordpress or Hugo websites, author nationality, what web server software, how many are behind Cancerflare, etc.

In direct response to the Personal feedlist and update post

This took me about 4 hours. The easy part is fingerprinting servers. The hard part is cleaning the resulting data which is always 95% garbage.

This was written for sizeof(cat)’s “guest posting”. It is mirrored to 0x19 for multitirectional trafic flow. Thanks for hosting! Link to the original post

0x19# cat $(which cat) | wc -c 
  149856
0x19# ls -lh /bin/cat
-r-xr-xr-x  1 root  bin  149856 Mar 20 15:16 /bin/cat

Using Lainchan Statistics Script

Refer to my post about the various server software used by the members of the Lainchan Webring for more information.

1. sed, grep, awk the sizeofcat.opml file to get a newline delimited list of urls

Baby’s first shell spaghetti. We are better than copy+pasting.

2. Grab http headers with a modified version of my previous lainchan script

The modified version of the script grabs http headers and checks to see if the http server is sending the Server: header. It outputs .csv and looks like this:

<?php

$f = file_get_contents("sites.txt");

$sites = array_filter(explode("\n", $f));
printf("\"URL\",\"Status\",\"Server\",\"Redirect\"\n");

foreach($sites as $s){
    $status = null;
    $server = null;
    $location = null;

    $h = get_headers($s);

    if(!$h || count($h) === 0){
        $status = "offline";
    } else {
        for($i=0; $i<count($h); $i++){
            if(str_starts_with($h[$i], "HTTP")){
                $status = $h[$i];
            }
            if(str_starts_with($h[$i], "Server:")){
                $server = $h[$i];
            }
            if(str_starts_with($h[$i], "Location:")){
                $location = $h[$i];
            }

        }
    }

    printf("\"%s\",\"%s\",\"%s\",\"%s\"\n", $s, $status, $server, $location);
}


?>

3. Process the csv output from the php script

I could have done more in php but I would rather do it in shell. It’s faster for me to think in shell.

0x19# cat http-headers | tail -n +2 | wc -l
    134
0x19# cat http-headers | tail -n +2 | awk -F',' '{print $3;}' | sed -e s/\"//g -e s/Server:\ //g -e s/[[:digit:]].*$//g | tr -d '/' | sort | uniq -c | sort -r
  34 nginx
  24 cloudflare
  22 Apache
   8 GitHub.com
   5 Netlify
   4 openresty
   3 SucuriCloudproxy
   3 OpenBSD httpd
   3 Caddy
   2 Squarespace
   2 AmazonS3
   1 nginx-rc
   1 neocities
   1 lighttpd
   1 lantern
   1 Vercel
   1 Pepyaka
   1 Microsoft-IIS
   1 GSE
   1 CloudFront
   1 BunnyCDN-SIL

There are 14 “unknown” servers left over. You might be thinking, “hey, this is pretty good already” but not quite. Some of the URLS actually get redirected.

4. Realize http headers do not provide the enough of the information syscall(cat) wanted, goto 1;

New process

I initially thought I could use nmap and it’s OS detection to figure out what operating systems the servers are running. This information is more interesting to me than (^-.-^) because he did not explicitly state that he wanted to know the server operating systems, only the server web stacks. Typically, running nmap against systems you don’t own is frowned upon and considered a declaration of war.

I used chose to use whatweb because it does not probe every port. Instead, it looks what the http server advertises, the contents of html files, and how the http daemon responds.

1. Scan

I told whatweb to output using the sql format and loaded it into mariadb so it would be somewhat parsable.

0x19# whatweb --log-sql-create=dbinit.sql
0x19# whatweb -U="Mozilla/5.0 (X11; Ubuntu; Linux x86_64; rv:124.0) Gecko/20100101 Firefox/124.0" --input-file=sites.txt --log-sql=out.sql
0x19# mysql < dbinit.sql
0x19# mysql < out.sql

2. Process, Clean, Process again

Neither of the sql files loaded into the database without manual modifications. I processed the records in database using a different PHP script that I will not include here because it only gets you halfway to something useful. I did the remainder of the processing using various shell commands to transform garbage into a series of newline delimited key:value pairs. The key is the name of the whatweb module used, the value is the string+version+os fields from the database. I dumped this into a text file which I have included. Emails and IP addresses have been removed.

the text file with data in the key:value format

Using further combinations of grep, awk, cut, sed, sort, and uniq, I tabulated the data into something that I could copy and paste into libreoffice calc. Although it would be fun to do it in R or Gnuplot, spreadsheets are designed to be used by tech illiterate office drones. I was feeling braindead after manipulating garbage so I am no better.

Analysis

These numbers are slightly different than the numbers from the first section because redirects were followed. The total number of servers is 140 (certain http to https redirects and path redirects are logged as if they are separate servers for some reason).

Servers by the country they are hosted in:

The RESERVED servers all had the country code of ZZ which is often used to denote Unknown or unspecified country. The Unknown servers did not present a country code.

Servers by Country

    72 UNITED STATES
    17 RESERVED
    6  GERMANY
    4  UNITED KINGDOM
    4  NETHERLANDS
    4  GREECE
    4  CANADA
    3  FRANCE
    2  SPAIN
    2  EUROPEAN UNION
    1  UKRAINE
    1  SWITZERLAND
    1  SWEDEN
    1  SLOVAKIA (Slovak Republic)
    1  RUSSIAN FEDERATION
    1  POLAND
    1  NORWAY
    1  IRAN (ISLAMIC REPUBLIC OF)
    1  DENMARK
    1  CZECH REPUBLIC
    1  AUSTRIA
    1  AUSTRALIA
    10 Unknown

Servers by the httpd they run

The servers reported different versions or slightly mangled names. I condensed them down into something without version numbers. Cloudflare and related CDNs will usually present and advertise themselves through the Server: http header, pretending to be a real httpd daemon and not software as a service in search of a solution.

Servers by HTTPD

    42 nginx
    24 cloudflare
    23 Apache
    8  GitHub.com
    6  LiteSpeed
    5  Netlify
    4  openresty
    3  Sucuri/Cloudproxy
    3  OpenBSD httpd
    3  Caddy
    3  AmazonS3
    2  Squarespace
    1  neocities
    1  lighttpd
    1  lantern
    1  Vercel
    1  Pepyaka
    1  Microsoft-IIS
    1  GSE
    7  Unknown

Servers by CMS/SSG stack:

Again, all of the records say something slightly different. Sites that reported a wordpress theme as their CMS got condensed into general ‘wordpress’. Hard and soft forks of wordpress were not condensed into general ‘wordpress’. Hugo, Jekyll, and PHP all had a version number attached which I truncated. Some websites reported that they are powered by a random string or character. I left these alone because it is somewhat humorous to realize that (^-.-^) subscribes to not only a wix website but also multiple webservers where the admin actually changed the “powered by” http headers and HTML metadata to garbage.

52 servers have an unknown CMS/SSG which can either mean “the admin turned off the skiddie bait” or that the software is custom. Additionally, server reporting just PHP are either running something custom or leaking php versions because the badmin didn’t rtfm.

servers by stack

    25 Hugo
    25 WordPress
    4  PHP
    5  Jekyll
    3  Hexo
    3  Ghost
    2  blogger
    1  werc
    1  two
    1  the
    1  fossil
    1  debian,dr
    1  a
    1  Write.as
    1  WooCommerce
    1  Wix.com Website Builder
    1  W
    1  Quartz
    1  Org Mode
    1  Newspack
    1  Nefelibata
    1  Love
    1  HTML Tidy for HTML
    1  Express
    1  Elementor
    1  Coffee
    1  Buttondown
    1  AGA
    52 Unknown

JavaScript usage

If a <script> tag was detected on a website, it was counted. This is probably not very accurate but it was interesting nonetheless.

script tags detected

    64 JS
    76 No JS

Servers reporting their OS (public shaming section)

Most servers did not announce their OS. Either this means good defaults or good admins.

Debianoid admins will never learn.

Servers that reported their OS

    8   Debian Linux
    6   Ubuntu Linux
    2   Unix
    1   Linux
    123 Good Sysadmin/defaults