Scrape full link set from site:
wget -r -l4 –spider -D blog.xargs.io http://blog.xargs.io
Analyze link set from site
tree -J -f blog.xargs.io | grep file | grep -o 'name.*' | \
awk -F":" '{print $2}' | tr -d '",}' | sort -u
Curl to see which currently work
# Deal with multiple saved copies of same entry from wget
cat current_links.log | grep -v "\.[[:digit:]]*$" | \
sed 's/blog.xargs.io/http:\/\/blog.xargs.io/g' | \
parallel -- \
"curl -o /dev/null --silent --head --write-out '%{http_code} %{url_effective}\n' {}" | \
sort -u | tail -r > current_links_master.log
Then run the results.csv through a processor to compare your staging site to your production site. Watch for those 404s and make sure your 302s look good.
cat current_links_master.log | sed 's/blog.xargs.io/localhost:5000/g' | \
parallel -- \
"curl -o /dev/null --silent --head --write-out '%{http_code} %{url_effective}\n' {}" | \
sort -u | tail -r | grep -v "(200|302)"
Credit for these scripts: