Wouter from Brain Baking inspired me to check the liveness of the links I link to from this site.

I too based my calculations on _post instead of _site, which makes for simpler link parsing and cleanup. Plus this excludes the links to this site here, since they are being generated on each publish and thus are guaranteed to be up-to-date.

Technically they can break, but all pages are always rebuilt on publish, and the build log would show those as warning. Except for legacy redirect aliases, but those were already covered by the Google Webmaster Search Console Tools.

Show me the Details

_posts % grep -h -re "http[s?]\:" . > ~/links.txt

That gives me 289 raw results from 116 posts (minus this one).

After removing the cruft before and after the links using .*(http.*)[\)>].* with $1, there were a few placeholder links left, like in here, here or here.

First without the follow-redirects option to capture the Redirects status codes.

for url in $(cat links.txt)
do
    curl -I -o /dev/null -w "%{url}: %{http_code}" "$url"
    echo "\n"
done > curl.txt

While executing this, two domains weren’t even able to resolve/connect: www.saltstack.com which they forgot to redirect after the VMWare acquisition and nitter.net for obvious reasons.

And a second time with the redirects followed to see which links actually show “something” when clicked:

for url in $(cat links.txt)
do
   curl -IL -o /dev/null -w "%{url} -> %{url_effective}: %{http_code}" "$url"
   echo "\n"
done > curl.txt

Results

Of the 289 links 276 remained after deduplication, 274 of which I got a response. Plus 405 responses for 7 requests that disallow HEAD requests, but which worked after switching those to GET:

Bar Graph showing the distribution between the different HTTP status codes: 2 unreachable, 161 working, 57 permanent redirects, 19 found elsewhere, 1 see other, 2 temporary redirects, 9 forbidden, 19 not found and 4 internal server errors

With redirects enabled this makes a total of 230 working links out of 274, resulting in 16.06% link rot.

I notice a lot of broken GitHub and Google Play Store links in there, even if Manton Reese’s theory holds true, only those that are not explicitly unpublished remain there:

… of all the new web companies, there are only two that will last 100 years, still hosting our stuff at URLs that don’t change: GitHub and Automattic

But still far from the 38% to 66.5% mark.