When Good Links Go Bad: Link Rot in Ebooks

Laura Brady February 21, 2018 #eprdctn Hour, EPUB, guest posts, SECRETS, Tools 10 Comments

Big iron chains from an old ships anchor rusting by the harbor side close up

This is a guest post from Teresa Elsey, an #eprdctn regular.

The February 14 edition of the #eprdctn chat focused on link rot. The below is just a piece of the whole wide-ranging conversation; as usual, many thanks to the Twitter ebook developer community and all the participants for bringing their wide range of voices and experiences!

What is link rot?

Link rot (or linkrot) is the process by which hyperlinks on individual websites or the Internet in general point to web pages, servers or other resources that have become permanently unavailable. #eprdctn https://t.co/x61130JGf9

— Ken Jones (@CircularKen) February 14, 2018

#eprdctn Yes, and its evil cousin — reference rot — occurs when the pages may still be available but the content on the page changes

— Adam Ziegler (@abziegler) February 14, 2018

I’ve written previously a bit about the problem of link rot in ebook publishing. But concisely: as the web continually changes and reorganizes itself, the URLs included in ebooks eventually become outdated and stop working. In a single nonfiction title with extensive references, this might encompass hundreds of URLs. The disappearance of cited information on the web is, of course, not a problem unique to ebooks, but it is significant to ebook developers because the clickability of ebook links makes the fault much more apparent there than in, for example, the equivalent print edition.

Prevalence and philosophy

Some of the things we know about link rot: e.g. 50% of all links in Supreme Court opinions are rotten; 70% of all links in articles from the legal field; 1 in 5 STM articles — it’s bad 🙁 Here’s a great paper on the problem … https://t.co/vEpLXwu36k

— Adam Ziegler (@abziegler) February 14, 2018

Adam Ziegler (of Perma.cc, more on that below) shared some context about the problem, pointing to the article Scholarly Context Not Found: One in Five Articles Suffers from Reference Rot. He also shared this glorious example of link rot in legal publishing.

We chatted about whose problem this is to solve. Is it an authorial imperative to cite links only when you commit to archive them? Are print publishers concerned with this problem? And are ebook developers committed to archiving their citations or do they really just want the pesky retailer tickets informing them of broken links that go away?

#eprdctn It raises the question for how responsible should ebook designers be for making sure links still work? If not our responsibility, then whose? And if this is not held accountable in print, why so in digital?

— Katy M. Cola (@katy_mcola) February 14, 2018

A side topic ventured into how style guides suggest handling this problem. I pointed to Chicago’s take on the topic, which is just to mention that it exists. Rebecca Cremona reported that the legal Bluebook specifically encourages the use of archiving services.

“For this reason, it is important to choose the version of the URL that is most likely to continue to point to the source cited. For DOIs, see 14.8. For other options, see 14.9, 14.10, 14.11.” (2/2) #eprdctn

— Teresa Elsey (@teresaelsey) February 14, 2018

The Bluebook, used in Law, is more specific: “Archiving of Internet sources is encouraged, but only when a reliable archival tool is available. For citations to Internet sources, append the archive URL to the full citation in brackets”

— Rebecca Lynn Cremona (@RebeccaCremona) February 14, 2018

How ebook developers encounter the problem

Amazon pesters my clients about broken external links, most of which are the result of link rot. I don’t have a good answer for them besides “let me know which ones, and what to change them to.” I’ve considered recommending that links in, eg, endnotes not be made active. #eprdctn

— Keith 🅢nyder (@noteon) February 14, 2018

Keith Snyder described what the process looks like to the freelance ebook developer. The major ebook retailers submit complaints about link rot to the publisher, the publisher has to go back to the ebook developer for a fix, and it’s an endless game of whack-a-mole. (In the group’s experience, major ebook retailers are not currently suppressing titles for broken links, just reporting them as errors to the publishers.)

So I’ve dealt with link rot across various projects, but the most interesting and challenging was turning @MikeShatzkin‘s blog into a series of ebooks. I proofread and tested links in blog posts going back to 2011. #eprdctn

— Simon Collinson (@Simon_Collinson) February 14, 2018

Simon Collinson described creating a series of ebooks from Mike Shatzkin’s blog. The blog posts, dating back to 2011, of course included numerous links; Simon estimates that 30 to 40% of the links in the oldest posts had rotted and about 5% of those in the newest.

How ebook developers solve the problem

workarounds

I’m not saying #eprdctn loves a workaround, but … sometimes we love a workaround. And in the case of a decade-old book with a notes section jam-packed with dead URLs, a deceased author, and annual ebook sales in the single digits, there’s nothing that can be done except to do what you have to to keep the book on sale.

Keith Snyder shared his pragmatic suggestion of not making URLs in ebooks clickable links (that is, including the text of the URL, but not the <a href=""> tag). Of link archiving, he says, “I don’t think it’s realistic for most smaller clients, who tend not to have the single-book budgets, the overall production volume and the discounts that come with it, the expertise, or the staff.”

Others mentioned encouraging authors to link only to reliable, stable sources. I wrote a paragraph for my company’s author guide suggesting that authors be thoughtful about the problem of link rot when deciding whether to use URLs extensively in their citations.

And really getting to the root: Simon Collinson pointed us to the classic Tim Berners-Lee article “Cool URIs don’t change.”

redirects

Ken Jones mentioned using a service like bit.ly or Rebrandly or a custom landing page to control and redirect all the URLs in your ebook. To do so, you would point all the links in an ebook to either your own webpage or to the service, which would redirect to the appropriate location. As that location changed, you could update those redirects for a seamless update experience for readers. A Book Apart was mentioned as a publisher that does this (looks like http://bkaprt.com/ncl/03-01/).

The DOI (digital object identifier) is perhaps the most official approach to this problem. A DOI (for example, doi:10.1038/nphys1170) is a unique identifier that can be assigned to a publication such as a journal article, government document, or book chapter. It is meant to permanently enable a reader to locate the publication, generally via a URL created from the DOI (for example, http://dx.doi.org/10.1038/nphys1170). This system, of course, depends on the third-party publishers of your cited material assigning DOIs and keeping their locations up to date; in other words, DOIs may be nice to use but are not implementable on the ebook developer end.

Redirects do not directly address the problem of link rot, but they make it external to the ebook. If your goal is to prevent having to make changes in the ebook file itself (that is, to pass the responsibility for managing link rot to someone other than the ebook developer), options like this may work for you.

archiving

The web has been working on the problem of its own changeability for some time, and the long-term solution is to save an archive of older versions of itself.

So one solution for ebook developers, when their links rot (or in anticipation of their doing so), is to point to an archive of the page rather than the page itself. Simon’s solution for his Shatzkin blog project was to replace the rotted links with links to the Internet Archive when possible (with a general note that he had done so). He discovered that that could be done by simply prefixing his URLs with “http://web.archive.org/web/.”

Adam pointed us to http://timetravel.mementoweb.org/, a service to help you find existing archives of webpages, whether on Internet Archive, Perma, or elsewhere.

If an archive of the page you wish to cite does not exist, the next-level solution is to create your own. This could ideally be a process located with authors; at the time they are consulting a source, they could archive the source as it appears at that moment, and then use the link to the archived page as the citation in their manuscript. (A girl can dream?)

Perma.cc is one such service (currently free to academic institutions and for a fee for commercial users). We were joined by some of the Perma.cc team (to talk about their product and for general link rot banter).

How https://t.co/dIGA1pbZlO works #eprdctn — you give Perma a URL you want to preserve for citation; Perma goes to the website and preserves the page; Perma gives you a link (like this: https://t.co/vn0oj4wsaR) you use to direct your readers to the archived page https://t.co/SxUVaXuo4I

— Adam Ziegler (@abziegler) February 14, 2018

Apparently our chatting sent up the linkrot bat signal, because Webrecorder (another online archiving service) and the Wayback Machine (Internet Archive) also dropped by.

Webrecorder is free to use, and you can download all content to view locally with our desktop app https://t.co/S9PFRDohhO Also works well with https://t.co/NvOwIv5AlD and any other web archive data (WARC format). If a service no longer up, you can still have a copy of the archive

— Webrecorder (@webrecorder_io) February 14, 2018

I mange the Wayback Machine at the @internetarchive We also run https://t.co/yrzkv4jRbM I am very interested in the topic of link rot in epubs (.mobi and .epub files, etc.) and have some ideas to make things better. Please email me at mark@archive.org Thank you!

— Mark Graham (@MarkGraham) February 14, 2018

Tool proposal

One thing I love about the #eprdctn and ebook developers is their willingness to share their expertise and tools with the whole community. Ken innocently asked whether ebook editors could use some tools for link checking.

Is there a tool that auto checks URLs in ebooks? If not would you like one?… #eprdctn

— Ken Jones (@CircularKen) February 14, 2018

Chatters noted that epubcheck, FlightDeck, and InDesign provide simple URL validation only (checking that strings designated as URLs are conceivably valid URLs, but not that they actually exist). Developers speculated about what form the tool they would want would take: a Sigil plugin, a pandoc filter, a pull request to epubcheck …

And then they started speccing out the tool before our eyes.

I feel like it’s just:
1. Unpack epub
2. XSLT out the links
3. Get status codes with curl
4. Profit! #eprdctn

— Simon Collinson (@Simon_Collinson) February 14, 2018

India Amos came by a bit later to note that Calibre already has a link checker.

FYI, the Calibre e-book editor—which I’ve always assumed to be a fork of Sigil?—already has a link checker: pic.twitter.com/5722LLEMIW

— India (@indiamos) February 14, 2018

And Raffaele Messuti volunteered a solution within two days.

i’ve started doing this, is a command line tool to extract links from an epub and check the http status https://t.co/DzESiCU1cO
is still very raw and incomplete (maybe bugged). you can download it from here https://t.co/uWyFTgjKZW
i have some improvement ideas, will write soon

— Raffaele Messuti (@atomotic) February 16, 2018

Rust Never Sleeps

Link rot is a complication we should all be aware of, and educating our clients/employers about — no question. Several terrific, long-term solutions were offered by the community. Give some of them a whirl and report back in ten years about how your links survived — okay?

#Eprdctn chat is weekly on Wednesdays at 11am Eastern time; see (and suggest!) upcoming topics here.

Calibre EPUB-checker FlightDeck Guest post hyperlinks link rot Tools

10 Responses to “When Good Links Go Bad: Link Rot in Ebooks”

Luc Prévost says:

February 21, 2018 at 9:38 am

Thank you for the great article !

I have started using Zotero to archive the links I propose to my readers.
I will try to incorporate some of your suggestions to find a usable solution…
Raffaele Messuti (@atomotic) says:

February 21, 2018 at 12:11 pm

I would also like to suggest these links:

Epub linkrot
https://literarymachin.es/epub-linkrot/

Hyperlinks in your files? How to get them out using tikalinkextract
http://openpreservation.org/blog/2017/10/21/hyperlinks-in-your-files-how-to-get-them-out-using-tikalinkextract/
Paul Marriner says:

February 21, 2018 at 2:56 pm

Very useful article. BTW, I don’t think the Calibre link check is anything other than a URL validation. I just checked a book with dozens of external links and the “all good” came back in no more than a second or two. I eagerly await a robust drag-and-drop type app (Windows).
Teresa Elsey says:

February 22, 2018 at 7:49 pm

Thanks for this info!
Bruce says:

February 23, 2018 at 12:31 am

Again a great article and timely reminder.

Another ‘Rot’ to consider is QR Code Rot in printed content. Some publishers include QR codes in children’s books, magazines, etc. that when scanned link to video and/or audio content on various internet services including YouTube. Most of the housekeeping examples cited in the article could be applied to this issue as well.
Doug Smith says:

February 26, 2018 at 8:18 pm

We created our own URL shortener for the books we publish. Popular shorteners have gone away and caused problems before so I don’t think we can count on services like bit.ly always being around.

When a link we used changes or disappears then we can simply update it to a new one in the shortener and all existing books still work.

This discussion has me thinking that I should also build a link checker to go with our system. Then we could automatically and proactively check all of the links we’ve used in books rather than waiting for customers to report problems.
David Kudler says:

February 27, 2018 at 7:48 pm

Great article! Thanks as always.

This has been driving me nuts lately as I’ve been updating some backlist titles. Who knew big companies like Audible would change their URL syntax? ^.^

BTW, Sigil has a very helpful URL-checking plugin: https://www.mobileread.com/forums/showthread.php?t=264848

It actually does check that the URLs are current. It does give occasional false positives
Paul Marriner says:

February 28, 2018 at 2:25 pm

Teresa, I stand corrected. The Calibre link checker does its job. I checked another book and it reports the same broken links as the Sigil plugin mentioned by David Kudler.
Web Archiving Roundup: March 5, 2018 | Web Archiving Section says:

March 5, 2018 at 9:55 am

[…] When Good Links Go Bad: Link Rot in Ebooks: as Teresa Elsey writes, ‘As the web continually changes and reorganizes itself, the URLs included in ebooks eventually become outdated and stop working.” One solution for ebook developers? “… When their links rot, … point to an archive of the page rather than the page itself.” (Archived link.) […]
Anne-Marie Concepcion says:

March 8, 2018 at 10:45 am

Excellent article that I just had a chance to read for the first time. I had no idea the problem of link rot was so bad.

Correction for the #eprdctn “chatters” cited early on; InDesign *will* flag links that lead to a 404 Not Found page (rotten links) as well as malformed links (like “http//apple.com” … missing the colon). Both get the dreaded Red Dot of Shame in the Hyperlinks panel, (Good links get a Green Dot of Goodness.)

Though both get a tooltip that reads “URL is not available” when you hover over the dot, clicking a malformed URL will result in nothing happening, while clicking a 404 dot results in your being taken to the 404 page. Clicking on a green dot brings you to the (good) URL as well.