This is a guest post from Franco A. Alvarado is a freelance ebook developer and a project manager at a Boston-based academic publisher.
Maybe you’ve seen this script before:
import re i = 0 def replace(m): global i i += 1 return str(i) file = open('toc.ncx', 'rb').read() file = re.sub(r'(?<=<navPoint id="navpoint)(\d+)', replace, file) i = 0 file = re.sub(r'(?<= playOrder=")(\d+)', replace, file) open('toc.ncx', 'wb').write(file)
It might also be the case that you’ve never actually opened up the file (called, on my computer at least, renumber.py), but you’ve used it. If you haven’t used it, this renumbers your NCX. Ancient, I know, but perhaps you’ve wondered what else Python can do.
python script.py and you’ve done something. Hopefully something you meant to do.
You just enter in
python script.pyand you’ve done something. Hopefully something you meant to do.
Specifically, I’ve used it to clean up EPUBs generated from a CMS that – long story short – did not have all the metadata that it needed to have. Rather than manually enter in all the metadata (including the ISBN and author name), drop in the cover, and update the NCX and NAV, I saved myself about an hour per title by writing this script, which takes 1 second to run. And sure that sounds like a very specific problem, but I bet you can think of things you do repetitively and, more importantly, things that you might mess up somehow. I’ve used it to automate sending emails to authors. Time-saving is really secondary to the amount of fine-grain control you get from running a script.
Time-saving is really secondary to the amount of fine-grain control you get from running a script.
Let’s walk through how to run renumber.py because a lot of people know it, it’s very short, and the other one is about 120 lines long. (But it is available on Github).
So, first make a folder. (Well, really first, install Python). Put your offending misnumbered NCX into this folder. Open your favorite text editor (I recommend SublimeText 3) and make a file called renumber.py.
I’ve rewritten the script again below, with some comments.
import re # Import the regular expressions module. This lets you use the familiar regex syntax i = 0 # Set up a variable called i and assign it a value of 0. This is a number. # Below we will define a function def replace(m): global i # Call that i variable we defined above i += 1 # Add one to it return str(i) # And return that new number as a string
The difference between a number 1 and a string called “1” is that one is a number and can add and subtract and multiply and all that whereas “1” acts as text rather than a functional number.
file = open('toc.ncx', 'rb').read() # this creates another variable called file and assigns it the value of that ncx file in the folder. file = re.sub(r'(?<=<navPoint id="navpoint)(\d+)', replace, file) # this redefines that variable as a regular expression search and replace.
So this part above is really key. The regex here is searching for a navpoint with one or more digits:
(?<=<navPoint id="navpoint)(\d+). The replacement text, however, is (and this is key) the function that we defined earlier. This is not like normal regex where the replacing text is just another string of letters and \1, \2, \3.
With our i starting at 0, i will now equal i + 1 = i, which is 0 + 1 = i, which is i = 1. So, that is our first replacement. Then, with 1 as our new number, when the regex continues to the next digit (\d+), it will replace that with the function again, which is i + 1 = i, or 1 + 1 = 2. And on and on.
For the rest of it, we just do the same for the playOrder numbers, overwrite the file, and save it.
i = 0 file = re.sub(r'(?<= playOrder=")(\d+)', replace, file) open('toc.ncx', 'wb').write(file)
And that is it! This might be a bit broad and I don’t know this script as well since I did not write it, but I hope that demystifies Python a bit and encourages some of you to explore how it works.
Header image: https://www.flickr.com/photos/ejorpin/6763569379/in/photostream/
Thanks for that post. I must try it because some years ago I try to write something similar – I unzip epub, do some regex changes, but I was unable to zip epub back again.
Yeah, that’s one of the surprises of e-production. The mimetype file has to be the first one in the EPUB, and it has to stay uncompressed. Here’s a question about that with some answers: https://stackoverflow.com/questions/27799692/epub3-how-to-add-the-mimetype-at-first-in-archive. In the answer by PM 2Ring, the line that writes the mimetype file is newz.write(fname, fname, zipfile.ZIP_STORED).
The other Python script I mentioned (https://github.com/francofaa/RomanceEPUBCleanup/) does zip back up a valid EPUB file. This chunk of the script zips the EPUB file back up and perhaps you can use it:
import os, zipfile
def build_epub(epub_name, dir):
dir_length = len(dir.rstrip(os.sep)) + 1
with zipfile.ZipFile(epub_name, mode=”w”, compression=zipfile.ZIP_DEFLATED) as zf:
for dirname, subdirs, files in os.walk(dir):
for filename in files:
if not filename.startswith(‘.’):
path = os.path.join(dirname, filename)
entry = path[dir_length:]
I did grab this chunk from stack overflow and did not have to edit it too much to make it fit my goals. That is, I believe it stands alone and can be adapted to another script without depending on what else I did in the rest of it. I added the if not filename.startswith(‘.’) to omit any hidden files (in Mac OS), since that was an error I was getting during the zipping process. Hope that is helpful
Yikes. With appropriate indents of course. This link will take you to the chunk I am referring to