Feature #6891

Monitor external broken links on our website

Added by intrigeri 2014-03-10 12:18:51 . Updated 2019-12-09 17:20:24 .

Status:
In Progress
Priority:
Normal
Assignee:
Category:
Infrastructure
Target version:
Start date:
2014-03-10
Due date:
% Done:

0%

Feature Branch:
Type of work:
Website
Blueprint:

Starter:
1
Affected tool:
Deliverable for:

Description

It would be great if someone prepared whatever is needed (scripts, cronjob line, email output) to monitor outgoing broken links on the Tails website (e.g. links on our website pointing to a third-party resource that does not exist anymore), and send useful reports of it regularly to some email address.

Ideally, it would be good to cache old results in order to report new(ly) broken links, and links that were broken last time already, separately (a bit like apticron is doing).

Once the basics are ready, we will want to turn the whole thing into a Puppet module, and deploy it on our infrastructure, but as a first step, preparing things without Puppet, as long as there is some setup documentation, would be enough.

It’s important to avoid the Not Invented Here syndrome, as we don’t want to maintain a big new chunk of software forever. Most likely, existing tools can be reused extensively. It might even be that Puppet modules to do the whole thing can be found.

Current command line used:

I’m ignoring /contribute for now. Let’s start with more exposed sections of the website and then move on to more internal ones.

linkchecker --file-output=csv/tails.csv --no-warnings --check-extern \
--no-follow-url="https://tails.boum.org/blueprint/.*" \
--no-follow-url="https://tails.boum.org/news/.*" \
--no-follow-url="https://tails.boum.org/security/.*" \
--no-follow-url="https://tails.boum.org/contribute/.*" \
https://tails.boum.org/

:sajolida:


Subtasks


History

#1 Updated by geb 2014-03-29 21:02:20

Hi,

Awstats does it by default. It use log file parsing and build HTML reports that include a section with 404, with URLS #Access and Referers.

Example:
http://noc.actux.eu.org/awstats/actux.eu.org/awstats.actux.eu.org.html
http://noc.actux.eu.org/awstats/actux.eu.org/awstats.actux.eu.org.errors404.html

Best,

#2 Updated by intrigeri 2014-03-29 23:40:56

geb wrote:
> Awstats does it by default.

This is for incoming broken links (e.g. to pages of ours that don’t exist anymore). What we need is to detect outgoing broken links (e.g. links on our website pointing to a third-party resource that does not exist anymore).

#3 Updated by intrigeri 2014-03-29 23:42:04

  • Description updated

Clarified description to avoid more similar confusion in the future.

#4 Updated by spriver 2014-10-17 07:18:22

intrigeri wrote:
> geb wrote:
> > Awstats does it by default.
>
> This is for incoming broken links (e.g. to pages of ours that don’t exist anymore). What we need is to detect outgoing broken links (e.g. links on our website pointing to a third-party resource that does not exist anymore).

I tried around a bit with the —spider mode in wget (with the recursive option activated), it detects errors like 404 or 301. Maybe this would be a start. A simple bash script would be able to do the job.
Shall I go on to create such a script or should we use another approach?

#5 Updated by intrigeri 2014-10-17 07:36:50

> I tried around a bit with the —spider mode in wget (with the recursive option
> activated), it detects errors like 404 or 301. Maybe this would be a start. A simple
> bash script would be able to do the job.
> Shall I go on to create such a script or should we use another approach?

Thanks for working on this!

However, this does not address caching old results by itself. Aren’t there more specialized tools (than wget) that satisfy this requirements without any need for us to write and maintain custom code?

#6 Updated by spriver 2014-10-17 10:55:15

intrigeri wrote:
> > I tried around a bit with the —spider mode in wget (with the recursive option
> > activated), it detects errors like 404 or 301. Maybe this would be a start. A simple
> > bash script would be able to do the job.
> > Shall I go on to create such a script or should we use another approach?
>
> Thanks for working on this!
>
> However, this does not address caching old results by itself. Aren’t there more specialized tools (than wget) that satisfy this requirements without any need for us to write and maintain custom code?

How about: http://wummel.github.io/linkchecker/ ?
I am trying out some configuration options. What exactly do you mean by address caching? Can you explain it?

#7 Updated by sajolida 2014-10-18 00:47:52

  • Assignee set to spriver

I searched very quickly for similar tools in the Debian archive and
found three:

  • htcheck - Utility for checking web site for dead/external links
  • linkchecker - check websites and HTML documents for broken links
  • webcheck - website link and structure checker

Could you maybe have a look to at least those three and compare how
would they do to solve our particular problem?

I’m assigning this ticket to you since you started working on it. Feel
free to deassign it if you give up on it at some point. That’s no problem!

#8 Updated by spriver 2014-10-18 06:43:44

  • Assignee deleted (spriver)

I will check them all out. (currently checking intensively out linkchecker) What type of errors do we want to gather? Just “404 not found”? Or also ones like “moved permanently”?

#9 Updated by spriver 2014-10-18 06:44:23

  • Assignee set to spriver

#10 Updated by intrigeri 2014-10-18 19:39:09

> What exactly do you mean by address caching? Can you explain it?

See the ticket description, where I have explained it already :)

#11 Updated by sajolida 2014-10-19 11:43:33

I would be interested in both cases, because “moved permanently” pages
are a bit more likely of turning up being “not found” at some point.

#12 Updated by intrigeri 2014-11-29 11:07:58

Any news?

#13 Updated by spriver 2014-12-27 19:39:50

intrigeri wrote:
> Any news?

I’m still on it, testing out all the tools. Caching is not really common…

#14 Updated by spriver 2015-02-01 19:28:48

Hi,
I tested out htcheck, linkchecker and webcheck. No one of them is providing caching. How extensive should the caching be? Maybe diffing of the result files would be sufficient?

#15 Updated by BitingBird 2015-02-01 21:00:04

Well, if no one does the caching, maybe we should give up on that and verify manually if the pages are really gone.

#16 Updated by intrigeri 2015-02-02 09:25:41

> Maybe diffing of the result files would be sufficient?

Yes, possibly. Let’s try it that way and we’ll see :)

#17 Updated by spriver 2015-02-02 10:14:45

>Yes, possibly. Let’s try it that way and we’ll see :)

I will have a look now on the best output methods of the tools and how to prevent duplicated links (in my testing there were sometimes some)

#18 Updated by elouann 2016-03-30 22:37:35

spriver, may I ask you if you did some progress on this?

sajolida wrote:
> I searched very quickly for similar tools in the Debian archive and
> found three:
>
> * htcheck - Utility for checking web site for dead/external links
> * linkchecker - check websites and HTML documents for broken links
> * webcheck - website link and structure checker

webcheck was not updated since 2010: http://arthurdejong.org/webcheck/

#19 Updated by spriver 2016-04-01 16:26:52

elouann wrote:
> spriver, may I ask you if you did some progress on this?
>

I can work on this again, but feel free to assign this ticket to you if you want!

#20 Updated by BitingBird 2016-05-20 20:31:20

It seems that LinkChecker was broken in Debian and is now repaired (http://anarc.at/blog/2016-05-19-free-software-activities-may-2016/). Maybe worth a second look ?

#21 Updated by sajolida 2016-11-17 13:21:27

  • Subject changed from Monitor broken links on our website to Monitor external broken links on our website

#22 Updated by Anonymous 2018-01-19 15:54:54

linkchecker is actively maintained in Debian indeed.

#23 Updated by Anonymous 2018-08-19 06:18:31

I actually find it horrible to have to use an entire package to do that kind of thing :( So I looked and I found this: https://www.createdbypete.com/articles/simple-way-to-find-broken-links-with-wget/ It’s basically crawling a page with wget, can also find broken image links and logs all the output. That’s the downside: the output then needs to be processed by looking for 404s and 500s → but I guess this can’t be too hard to turn this into a script.

#24 Updated by intrigeri 2018-08-19 08:21:23

> I actually find it horrible to have to use an entire package to do that kind of thing

I’m curious why.

FWIW I’m more concerned about the NIH syndrome that leads to writing the good old “simple script” that remains simple for about 1 hour and then becomes an abomination once it’s made suitable for the real world, than about using software that’s already written specifically for this purpose.

> https://www.createdbypete.com/articles/simple-way-to-find-broken-links-with-wget/ It’s basically crawling a page with wget, can also find broken image links and logs all the output.

I’m not sure that this checks external links.

> That’s the downside: the output then needs to be processed by looking for 404s and 500s → but I guess this can’t be too hard to turn this into a script.

Parsing non-machine-readable output makes red lights blink in my brain.

Anyway, this being said, : I’ll be happy with any solution chosen by whoever decides to implement this as long as they’re ready to maintain it :)

#25 Updated by Anonymous 2018-08-19 09:15:23

Ack! Maybe I was bitten by NIH :)

#26 Updated by sajolida 2018-09-09 18:25:54

  • Assignee changed from spriver to sajolida
  • Priority changed from Low to Normal
  • Target version set to Tails_3.10.1
  • Type of work changed from Sysadmin to Website

I’m taking this one over after many years of inactivity.

It seems to work to run linkchecker https://tails.boum.org/. I started that on a server and will check the output tomorrow.

#27 Updated by sajolida 2018-09-09 18:26:37

  • blocks Feature #15411: Core work 2018Q2 → 2018Q3: Technical writing added

#28 Updated by sajolida 2018-09-10 14:33:04

Better version:

linkchecker --file-output=csv/tails.csv --no-warnings --check-extern --no-follow-url="https://tails.boum.org/blueprint/.*" https://tails.boum.org/

#29 Updated by sajolida 2018-09-11 17:58:59

  • blocked by deleted (Feature #15411: Core work 2018Q2 → 2018Q3: Technical writing)

#30 Updated by sajolida 2018-09-11 17:59:04

  • blocks Feature #15941: Core work 2018Q4 → 2019Q2: Technical writing added

#31 Updated by sajolida 2018-09-22 22:21:25

  • blocked by deleted (Feature #15941: Core work 2018Q4 → 2019Q2: Technical writing)

#32 Updated by sajolida 2018-09-25 01:24:52

  • Feature Branch set to web/6891-broken-links

I started fixing a bunch of broken links on web/6891-broken-links. We have a lot and most of them are not affecting our documentation but /news, /contribute, and /blueprint. So I won’t do that on our Technical Writing budget (that would be too much work) nor will I do it alone.

#33 Updated by intrigeri 2018-10-09 09:18:12

sajolida wrote:
> I started fixing a bunch of broken links on web/6891-broken-links.

Do you want someone to review & merge this branch?

#34 Updated by sajolida 2018-10-09 16:16:31

  • Assignee deleted (sajolida)
  • QA Check set to Ready for QA

Indeed. I started working on this some weeks ago because I had more time but it’s the case anymore so it would be good to have this reviewed already.

#35 Updated by intrigeri 2018-10-10 10:44:24

  • Status changed from Confirmed to In Progress
  • Assignee set to intrigeri

#36 Updated by intrigeri 2018-10-10 14:20:38

  • Assignee deleted (intrigeri)
  • Target version deleted (Tails_3.10.1)
  • QA Check deleted (Ready for QA)
  • Feature Branch deleted (web/6891-broken-links)

sajolida wrote:
> I started fixing a bunch of broken links on web/6891-broken-links.

Looks good, merging!

> We have a lot and most of them are not affecting our documentation but /news, /contribute, and /blueprint.

I think we should teach whatever broken link tool we use to ignore /blueprint and older blog posts:

  • broken links in older /news entries don’t matter much because it’s pretty hard to find a link pointing to them and they’re of mostly historical interest anyway; but broken links in recent blog posts should be reported (probably except links to nightly.t.b.o and dl.a.b.o).
  • blueprints are work tools for contributors and don’t affect the vast majority of the public of our website

=> and then we can focus first on broken links in sections that matter more :)

> So I won’t do that on our Technical Writing budget (that would be too much work) nor will I do it alone.

Makes sense.

#37 Updated by sajolida 2018-10-10 22:53:32

  • Description updated

#38 Updated by sajolida 2019-12-09 17:08:11

  • Description updated

The next steps regarding this could be:

  1. Fine-tune or confirm the command line given in the description of this ticket. For example, in my first runs, I get a lot of “ConnectionError” errors (instead of 404 and 403). What shall we do with them? Is the output relatively stable: do several runs return a similar list?
  2. Fix as many broken links as possible.
  3. Think about how we can automate this monitoring. For example, if the list becomes stable (not many transient errors), we could automate the runs and send a report to some list every now and then.

Note that this would only check external links. Internal links are treated differently by ikiwiki and reported on https://tails.boum.org/brokenlinks/. It requires a local build of the website, with the brokenlink plugin activated. See https://tails.boum.org/contribute/build/website/.