A blog by Gary Bernhardt, Creator & Destroyer of Software

A Raw View into My Unix Hackery

27 Apr 2010

I just converted my blog from PyBlosxom to Jekyll. After the conversion, I wanted to make sure that no incoming links were broken. I recorded the following screencast of me answering that question.

This wasn't rehearsed; I didn't even spend time thinking about how I'd solve the problem ahead of time. I was also sort of run down while recording it; I'd been hacking away at this stuff for many hours by the time I recorded. Still, it gives you a glimpse of what it looks like when I puzzle through a problem at the Unix command line, something that people seem to be interested in.

A Raw View into My Unix Hackery from Gary Bernhardt on Vimeo.

At the end, I come up with a pretty long command. I did some further tweaking after the screencast ended; the final command is below. Although I've formatted it on multiple lines here, keep in mind that this was written at the console without newlines or regard for quality or brevity, so it's not pretty. See the video to learn how such things are born.

cat logs/apache/access_blog* |
grep ' 200 ' |
grep -v '"http://blog.extracheese.org/[^"]' |
grep -v 'Feedfetcher-Google' |
grep -v 'Googlebot' |
grep -v 'my6sense' |
grep -v 'search.msn.com' |
grep -v 'scoutjet' |
grep -v 'betaBot' |
grep -v 'Yahoo! Slurp' |
grep -v 'aggregator:Spinn3r' |
grep -v 'FeedBurner' |
grep -v 'Planet Python' |
grep -v 'FeedBurner/1.0' |
grep -v 80legs.com |
grep -v Yandex |
grep -v '.NET CLR' |
grep -v 'seoprofiler' |
grep -v 'urdland' |
grep -v 'Speedy Spider' |
grep -v 'Ask Jeeves' |
cut -d '"' -f 2 |
cut -d ' ' -f 2 |
awk '{urls[$1]++} END {for (url in urls) print urls[url], url}' |
sort -nr |
grep -v 'index.php' |
grep -v 'widgetType' |
cut -d ' ' -f 2 |
cut -d '?' -f 1 |
while read url; do
    curl "http://localhost:4000$url" 2>&1 |
    grep -i '<h1>not found</h1>' > /dev/null;
    if [ $? -eq 0 ]; then
        echo $url;
    fi;
done |
grep -v '\.css$' |
grep -v '\.js$'

Happily, I only had to redirect the RSS feed URLs. Everything else that I actually cared about worked the first time!