Fast website link checker in Go

jchw · on April 24, 2018

I find the use of rake to be kind of unorthodox, and yet I don't know what else you'd use in the Go world, other than maybe just Makefiles. Any particular reason to choose Rake? It's probably not easy to get it running on Windows based on my experience playing with Rails on Windows.

Other than that it looks quite useful, and it's definitely something to keep in the tool belt. Bonus points for the subtle Undertale references too :)

beefsack · on April 24, 2018

When I've been working in Go projects which required external helpers of some sort, I've always just written them as separate Go binaries because they're relatively simple and quick to get up and running.

I've started doing this for my Rust projects too because Rust is becoming a really nice language for writing these sorts of things, and having a separate binary as a Rust file in src/bin is convenient.

Often you want to use some logic from the main application in your tasks and using the same language streamlines that.

mjk7841 · on April 24, 2018

I'm also curious about this. It seems that many go developers out there are using Makefiles. Makefiles are a good solution for golang projects in some cases, but I've seen a lot of people really abusing Makefiles and trying to use them for more generic task running.

In a past life, we used invoke [1] for task running. It was incredible but has the same problem as rake: it introduces another language (Python) and more dependencies.

There's a fairly new task runner being developed in go called mage [2], but it didn't seem worth the jump yet to me as it's still pretty immature (I haven't played with it in a few months, though). Did you consider trying that out?

[1] https://github.com/pyinvoke/invoke [2] https://github.com/magefile/mage

anonfunction · on April 24, 2018

I like using make even for basic task running. It’s usually installed where I’m working and is familiar. I don’t want to install rake or mage.

arbie · on April 24, 2018

Make gets a bad rap but I've seen it being used for susbtantially complex workflows. If you wrap your commands in something that can reattach to running processes and use dependencies correctly, it is hard to displace.

MartijnBraam · on April 24, 2018

I've used make as an static site generator, it is actually very neat for that purpose and makes rebuilds quite fast and parallel.

davars · on April 24, 2018

I just started using godo [1] and I like it so far. Implement tasks in Go, bash, or the provided wrappers around os/exec.

Biggest advantage over a generic tool like Make is that you can import your project dependencies (such as your database configuration) into your task.

[1] https://github.com/go-godo/godo

timlyo · on April 24, 2018

Just [0] is a pretty great generic task runner incase anyone is looking for one. It's written in Rust so it doesn't introduce any new languages (single binary install) and it uses a make inspired syntax but aims to avoid some of Make's issues.

[0] https://github.com/casey/just

ENTP · on April 24, 2018

I've used task with much success https://github.com/go-task/task

blowski · on April 24, 2018

What’s wrong with using Make for running generic tasks?

slashdev · on April 24, 2018

I use matr[1], which is lets you define your build functions/tasks as ordinary go functions. There's a complementary library for issuing shell commands.

I like keeping just one language and tooling, but the biggest benefit to me is I'm always forgetting the arcane bits of shell syntax, and using Go saves me from having to search it all the time. Despite being more verbose than bash, I find it saves me time overall.

[1] https://github.com/matr-builder/matr

barsonme · on April 24, 2018

I just write homebrew "makefiles" with Go. The language kinda doubles as a scripting language almost, so I've found it much nicer than makefiles.

reificator · on April 24, 2018

I find it helpful when CI providers run Go tooling by default, except if there's a makefile in which case they run that instead.

poodles00 · on April 24, 2018

I like Mage (https://magefile.org).

komuW · on April 24, 2018

There's mage.

https://github.com/magefile/mage

omeid2 · on April 24, 2018

A comprehensive prompt is an extremely useful and an underappreciated productivity booster.

But please, when making screencast or screenshots, use a simple prompt.

Because the information included in the prompt is often not only irrelevant, it can be actively distracting and makes reading through less pleasant by increasing the required eye-movement and cognitive effort to filter out the useless content.

Small little things makes a difference.

chrismorgan · on April 24, 2018

I wrote a script bash-for-recording for myself a few years back, for invocation by termrec, which set the window size to 80×24, set HOME to a new empty directory, set USER and NAME to dummy values (creating an actual new throwaway user account would be better here, but I was lazy), set TERM to xterm-256color (I think it was), cleared the environment (env -i) and possibly one or two other things, set a deliberately very simple and obvious prompt (which sets the title as well), cleared the screen, then finally started a nice clean bash profile.

I should pull out my old laptop or my backups and resuscitate the script.

The example at http://tty-player.chrismorgan.info/ was generated using that script.

chrismorgan · on April 24, 2018

I remembered a couple of details incorrectly, but this was my bash-for-recording script:

  #!/bin/bash

  # Start a new bash shell with a severely filtered environment and no initfile.
  if [ -z "$_BFR_RUNNING" ]; then
      env -i \
          _BFR_RUNNING=1 \
          PATH="$PATH" \
          LD_LIBRARY_PATH="$LD_LIBRARY_PATH" \
          TERM="$TERM" \
          SHELL="$SHELL" \
          USER="$USER" \
          HOME="$HOME/bfr-home" \
          LANG="$LANG" \
          bash --init-file "$0" "$@" <&0
      exit $?
  else
      unset _BFR_RUNNING
  fi

  # What remains of this file is the initfile.

  USER=user
  HOSTNAME=hostname
  PS1='\n\[\033[32;45;1m\]\w\[\033[m\]\$ '
  eval "$(dircolors -b)"
  alias ls='ls --color=auto'

NetOpWibby · on April 24, 2018

This is neat! Thanks for sharing.

jakear · on April 24, 2018

The prompt is so long that you can’t actually see the command being typed on mobile, asciienema just clips the stream.

JasonFruit · on April 24, 2018

You mean `asciinema`. That, my friend, is one unfortunate typo.

jakear · on April 24, 2018

Ahh. That makes more sense. I took it as a bit of (very) crude humor.

gtirloni · on April 24, 2018

Pretty nice tool that Just Works (tm).

Although the concurrency level (512 connections) is a bit too aggressive for most servers. You'll get throttled, blocked or your backend will crash (which isn't too bad in any case, except that might not be what you were after with a link checker).

raviqqe42 · on April 24, 2018

Actually, I totally agree with you. I decided the number based on the default maximum number of open files on Linux because I was not sure about common limits of concurrent connections between clients and HTTP servers. Or, it should probably regulate numbers of requests per second sent to the same hosts. If someone suggests other options, I would adopt it.

anonacct37 · on April 24, 2018

A good default might be based on current browser behavior. Keep in mind http2 might make everything use one connection but allow 100 concurrent requests.

gtirloni · on April 24, 2018

I think your approach is fine for local dev servers.

Maybe just introduce something like an exponential backoff algorithm if you start getting too many 5xx errors or the requests are hanging.

jlgaddis · on April 24, 2018

Both Chrome and Firefox limited the number of connections to a server to six (6), if memory serves. I'm not certain if those limits have changed or if the number is different based on HTTP 1.1 versus HTTP 2.

brobinson · on April 24, 2018

The limit is per _hostname_ not per _server_ (unless things have changed in the last 10 years).

This is why you'll see assets1.foo.com, assets2.foo.com, etc. all pointing to the same IP address(es). Server-side code picks one based on a modulus of a hash of the filename or something similar when rendering the HTML to get additional pipelines in the user's browser. Not sure how or if this is done in SPA.

zdw · on April 24, 2018

How fast is this? Is there even a common benchmark for this sort of thing?

How does it compare to the older python "linkchecker", which was resurrected here (and is quite fast): https://github.com/linkchecker/linkchecker

chrismorgan · on April 24, 2018

I have found linkchecker to be unreasonably slow by default, very fragile if you try to make it any faster (e.g. occasional socket errors with almost any concurrency, regardless of the purported ulimits, in a way that suggested some kind of socket leaking to me at the time), and possessing fairly bad reporting. Also, on Windows, being built with ancient Python meant it didn’t support SNI, so I had to delve into it to figure out a way of turning TLS verification off to make it work pretty much at all, which also hints at its generally poor configurability and surprisingly poor documentation (given the fact that there is actually a moderate amount of it).

I still use linkchecker (because I’ve never completed my Rust-based link checker that I started several years ago), and have extended it at work to support client-side certificates which we use on CI, but I’m generally fairly unimpressed with linkchecker.

fwip · on April 24, 2018

Looks rad. Is there any plan to add authorization headers, so that I can test a site as a particular user?

raviqqe42 · on April 24, 2018

Thank you for your feedback! Can you open an issue on the GitHub repository? I would appreciate it if you add some concrete use cases as then it'll be clear what kinds of options need to be implemented.

bryanrasmussen · on April 24, 2018

I don't know Go, but looking at the code it doesn't look like it handles sites that are rendered on the client, if so it has limited utility in today's web.

OoooooooO · on April 24, 2018

Of course it does not, for that you need a headless browser ( = Chrome Headless).

Neither Python/Go/Java/Rust/C/C++/D/Ruby/C#/F#/... nor any other language (not even Node afaik, but Puppeteer handles that) can handle Javascript rendered sites.

jpsim · on April 24, 2018

Anyone know of an equivalent for Ruby? I’d love to add this as an integration test for jazzy[0].

Separately, it’d be awesome if this also checked that anchor links resolve to id values to validate linking to specific elements on a page.

[0]: https://github.com/realm/jazzy

jamietanna · on April 24, 2018

https://github.com/gjtorikian/html-proofer is what I use

jwilk · on April 24, 2018

Are there any good checkers for URLs in text files?

I wrote https://github.com/jwilk/urlycue , but I'm not quite happy about it, I don't have energy to improve it either; so I'm looking for alternatives.

oneeyedpigeon · on April 24, 2018

I guess one of the bigger challenges when it comes to unstructured data is identifying URLs. Is there a canonical way of identifying a URL embedded in text? Is it an impossible problem?

ethhics · on April 24, 2018

Perhaps I’m missing the question, but there is a regular expression for matching a URI. Remove the leading carat and it can match anywhere in a text.

https://tools.ietf.org/html/rfc3986#appendix-B

Edit: I see it's not quite that simple. However, I still think that with some stricter matching requirements this could work.

jwilk · on April 24, 2018

This regexp lets you parse a valid URI, but it matches also a lot things at are not URIs at all.

The URI language is of course regular, so it would be possible to construct a regexp that matches only URIs. But naively applying such regexp wouldn't work in practice, because many punctuation characters are allowed in URIs. For example, single quotes are allowed, so in this Python code the regexp would match too much:

   homepage = 'http://example.com/'

ethhics · on April 24, 2018

I see—The standard allows most characters we normally use to surround URIs. It sure does look like a difficult problem then, and one that a regexp can't solve.

andyonthewings · on April 24, 2018

I have been using a nodejs equivalent, named broken-link-checker https://github.com/stevenvachon/broken-link-checker

I run it in a CI job which builds a static site. It works pretty well for me.

gruzh · on April 24, 2018

pylinkvalidator seems to be a pretty good python alternative for this project. It's also very fast, scans 700+ URLs in under a minute https://github.com/bartdag/pylinkvalidator

thamizhan2611 · on April 24, 2018

Awesome tool. I have one small request, is it possible to put an architecture diagram connecting different moving parts like parsing, fetcher, daemon etc. IMHO this might be useful for people who are trying to go through the source code to understand how the tool functions. Thanks anyways :)

0xbadcafebee · on April 24, 2018

So, I'll bite. Why user this and not wget, curl, or any other http spider?

pknopf · on April 24, 2018

Is there a good html parser that you can pipe into/out of? Using Go, it would be easier to parse the html, and tell the difference between <p>http://somewhere.com/</p> and <a href="http://somewhere.com/">test</p>

laumars · on April 24, 2018

Personally I don't care about the difference between your two examples because a published URL is still a published URL regardless of whether it is a hyperlink or not.

However where I do care about the difference between an anchor and a paragraph block is with relative links or other URLs without the protocol prefix as those are harder to programmatically guess what is a web URL and what is an example file system path in a technical document (for example). In an ideal world the JavaScript, CSS and any web-APIs (eg JSON returns) would be executed locally to check any modern way of abstracting away URLs (page redirection et al). But that's not to say there isn't a place for a less sophisticated parser (though I would say that as I've also written a link checker similar to the one posted hehehe).

ryanlol · on April 24, 2018

Because sometimes you care about performance?

Although, based on a quick look at the code this thing isn't going to go particularly fast.

0xbadcafebee · on April 24, 2018

Like, extreme performance, or just parallelism? One example of parallelism: xargs -a urls.txt -n 5 -P 20 wget -nv --spider -T 10 -e robots=off. This will run up to 20 processes with 5 URLs each. It's not "efficient" but it's faster than nothing, and you get the whole feature set of Wget.

For more customizeable spidering, Scrapy allows you to customize a spider, and even deploy spider daemons to run in production (https://doc.scrapy.org/en/latest/topics/deploy.html). For an out-of-the-box version, try Spidy (https://github.com/rivermont/spidy). For super serious spidering, try Heritrix (https://webarchive.jira.com/wiki/spaces/Heritrix/overview) or Nutch (https://nutch.apache.org/).

Here's an interesting read on crawling a quarter billion pages in 40 hours: http://www.michaelnielsen.org/ddi/how-to-crawl-a-quarter-bil... From my own experience crawling massive dynamic state-driven websites, even if you're trying to just grab a single page, you will eventually want the extra features.