When Good Links Go Bad: Link Rot in Ebooks

Big iron chains from an old ships anchor rusting by the harbor side close up
  • Sumo

This is a guest post from Teresa Elsey, an #eprdctn regular.


The February 14 edition of the #eprdctn chat focused on link rot. The below is just a piece of the whole wide-ranging conversation; as usual, many thanks to the Twitter ebook developer community and all the participants for bringing their wide range of voices and experiences!

What is link rot?

I’ve written previously a bit about the problem of link rot in ebook publishing. But concisely: as the web continually changes and reorganizes itself, the URLs included in ebooks eventually become outdated and stop working. In a single nonfiction title with extensive references, this might encompass hundreds of URLs. The disappearance of cited information on the web is, of course, not a problem unique to ebooks, but it is significant to ebook developers because the clickability of ebook links makes the fault much more apparent there than in, for example, the equivalent print edition.

Prevalence and philosophy

Adam Ziegler (of Perma.cc, more on that below) shared some context about the problem, pointing to the article Scholarly Context Not Found: One in Five Articles Suffers from Reference Rot. He also shared this glorious example of link rot in legal publishing.

We chatted about whose problem this is to solve. Is it an authorial imperative to cite links only when you commit to archive them? Are print publishers concerned with this problem? And are ebook developers committed to archiving their citations or do they really just want the pesky retailer tickets informing them of broken links that go away?

A side topic ventured into how style guides suggest handling this problem. I pointed to Chicago’s take on the topic, which is just to mention that it exists. Rebecca Cremona reported that the legal Bluebook specifically encourages the use of archiving services.

How ebook developers encounter the problem

Keith Snyder described what the process looks like to the freelance ebook developer. The major ebook retailers submit complaints about link rot to the publisher, the publisher has to go back to the ebook developer for a fix, and it’s an endless game of whack-a-mole. (In the group’s experience, major ebook retailers are not currently suppressing titles for broken links, just reporting them as errors to the publishers.)

Simon Collinson described creating a series of ebooks from Mike Shatzkin’s blog. The blog posts, dating back to 2011, of course included numerous links; Simon estimates that 30 to 40% of the links in the oldest posts had rotted and about 5% of those in the newest.

How ebook developers solve the problem

workarounds

I’m not saying #eprdctn loves a workaround, but … sometimes we love a workaround. And in the case of a decade-old book with a notes section jam-packed with dead URLs, a deceased author, and annual ebook sales in the single digits, there’s nothing that can be done except to do what you have to to keep the book on sale.

Keith Snyder shared his pragmatic suggestion of not making URLs in ebooks clickable links (that is, including the text of the URL, but not the <a href=""> tag). Of link archiving, he says, “I don’t think it’s realistic for most smaller clients, who tend not to have the single-book budgets, the overall production volume and the discounts that come with it, the expertise, or the staff.”

Others mentioned encouraging authors to link only to reliable, stable sources. I wrote a paragraph for my company’s author guide suggesting that authors be thoughtful about the problem of link rot when deciding whether to use URLs extensively in their citations.

And really getting to the root: Simon Collinson pointed us to the classic Tim Berners-Lee article “Cool URIs don’t change.”

redirects

Ken Jones mentioned using a service like bit.ly or Rebrandly or a custom landing page to control and redirect all the URLs in your ebook. To do so, you would point all the links in an ebook to either your own webpage or to the service, which would redirect to the appropriate location. As that location changed, you could update those redirects for a seamless update experience for readers. A Book Apart was mentioned as a publisher that does this (looks like http://bkaprt.com/ncl/03-01/).

The DOI (digital object identifier) is perhaps the most official approach to this problem. A DOI (for example, doi:10.1038/nphys1170) is a unique identifier that can be assigned to a publication such as a journal article, government document, or book chapter. It is meant to permanently enable a reader to locate the publication, generally via a URL created from the DOI (for example, http://dx.doi.org/10.1038/nphys1170). This system, of course, depends on the third-party publishers of your cited material assigning DOIs and keeping their locations up to date; in other words, DOIs may be nice to use but are not implementable on the ebook developer end.

Redirects do not directly address the problem of link rot, but they make it external to the ebook. If your goal is to prevent having to make changes in the ebook file itself (that is, to pass the responsibility for managing link rot to someone other than the ebook developer), options like this may work for you.

archiving

The web has been working on the problem of its own changeability for some time, and the long-term solution is to save an archive of older versions of itself.

So one solution for ebook developers, when their links rot (or in anticipation of their doing so), is to point to an archive of the page rather than the page itself. Simon’s solution for his Shatzkin blog project was to replace the rotted links with links to the Internet Archive when possible (with a general note that he had done so). He discovered that that could be done by simply prefixing his URLs with “http://web.archive.org/web/.”

Adam pointed us to http://timetravel.mementoweb.org/, a service to help you find existing archives of webpages, whether on Internet Archive, Perma, or elsewhere.

If an archive of the page you wish to cite does not exist, the next-level solution is to create your own. This could ideally be a process located with authors; at the time they are consulting a source, they could archive the source as it appears at that moment, and then use the link to the archived page as the citation in their manuscript. (A girl can dream?)

Perma.cc is one such service (currently free to academic institutions and for a fee for commercial users). We were joined by some of the Perma.cc team (to talk about their product and for general link rot banter).

Apparently our chatting sent up the linkrot bat signal, because Webrecorder (another online archiving service) and the Wayback Machine (Internet Archive) also dropped by.

Tool proposal

One thing I love about the #eprdctn and ebook developers is their willingness to share their expertise and tools with the whole community. Ken innocently asked whether ebook editors could use some tools for link checking.

Chatters noted that epubcheck, FlightDeck, and InDesign provide simple URL validation only (checking that strings designated as URLs are conceivably valid URLs, but not that they actually exist). Developers speculated about what form the tool they would want would take: a Sigil plugin, a pandoc filter, a pull request to epubcheck …

And then they started speccing out the tool before our eyes.

India Amos came by a bit later to note that Calibre already has a link checker.

And Raffaele Messuti volunteered a solution within two days.

Rust Never Sleeps

Link rot is a complication we should all be aware of, and educating our clients/employers about — no question. Several terrific, long-term solutions were offered by the community. Give some of them a whirl and report back in ten years about how your links survived — okay?


#Eprdctn chat is weekly on Wednesdays at 11am Eastern time; see (and suggest!) upcoming topics here.

10 Responses to “When Good Links Go Bad: Link Rot in Ebooks”

  1. Luc Prévost says:

    Thank you for the great article !

    I have started using Zotero to archive the links I propose to my readers.
    I will try to incorporate some of your suggestions to find a usable solution…

  2. Paul Marriner says:

    Very useful article. BTW, I don’t think the Calibre link check is anything other than a URL validation. I just checked a book with dozens of external links and the “all good” came back in no more than a second or two. I eagerly await a robust drag-and-drop type app (Windows).

  3. Teresa Elsey says:

    Thanks for this info!

  4. Bruce says:

    Again a great article and timely reminder.

    Another ‘Rot’ to consider is QR Code Rot in printed content. Some publishers include QR codes in children’s books, magazines, etc. that when scanned link to video and/or audio content on various internet services including YouTube. Most of the housekeeping examples cited in the article could be applied to this issue as well.

  5. Doug Smith says:

    We created our own URL shortener for the books we publish. Popular shorteners have gone away and caused problems before so I don’t think we can count on services like bit.ly always being around.

    When a link we used changes or disappears then we can simply update it to a new one in the shortener and all existing books still work.

    This discussion has me thinking that I should also build a link checker to go with our system. Then we could automatically and proactively check all of the links we’ve used in books rather than waiting for customers to report problems.

  6. David Kudler says:

    Great article! Thanks as always.

    This has been driving me nuts lately as I’ve been updating some backlist titles. Who knew big companies like Audible would change their URL syntax? ^.^

    BTW, Sigil has a very helpful URL-checking plugin: https://www.mobileread.com/forums/showthread.php?t=264848

    It actually does check that the URLs are current. It does give occasional false positives

  7. Paul Marriner says:

    Teresa, I stand corrected. The Calibre link checker does its job. I checked another book and it reports the same broken links as the Sigil plugin mentioned by David Kudler.

  8. […] When Good Links Go Bad: Link Rot in Ebooks: as Teresa Elsey writes, ‘As the web continually changes and reorganizes itself, the URLs included in ebooks eventually become outdated and stop working.” One solution for ebook developers? “… When their links rot, … point to an archive of the page rather than the page itself.” (Archived link.) […]

  9. Excellent article that I just had a chance to read for the first time. I had no idea the problem of link rot was so bad.

    Correction for the #eprdctn “chatters” cited early on; InDesign *will* flag links that lead to a 404 Not Found page (rotten links) as well as malformed links (like “http//apple.com” … missing the colon). Both get the dreaded Red Dot of Shame in the Hyperlinks panel, (Good links get a Green Dot of Goodness.)

    Though both get a tooltip that reads “URL is not available” when you hover over the dot, clicking a malformed URL will result in nothing happening, while clicking a 404 dot results in your being taken to the 404 page. Clicking on a green dot brings you to the (good) URL as well.