Automating your regexes with Sublime Text

Regex
  • Sumo

This is a guest post from Simon Collinson (@Simon_Collinson), a digital editor at Canelo Digital Publishing, where he corrects all the OCR.


If your workflow is anything like mine, you have a toolkit of regular expressions you use to clean up text and markup. Maybe they’re in your head, maybe they’re written down in a document somewhere, maybe you write them afresh every time (because remembering regex syntax is super fun, right?)

I’m here to tell you that there’s a better way. Of course, if you’re Dave Cramer and you use BBEdit, you already know this. As Dave helpfully pointed out the other day, saving regexes is baked into BBEdit, in a handy drop-down in the search box:

(It won’t help you if you’re trying to type them on a French keyboard, however…)

So if you’re the kind of person who uses BBEdit, or some other inferior text editor, you can stop reading now. But what if you use the correct editor, Sublime Text? In that case, RegReplace is your saviour. RegReplace is a free package which allows you to define common regular expressions – which it calls rules – and call them with console commands.

So how does it work? If you’ve used the Sublime console before, go ahead and install the package. If not, have no fear! It’s a simple three-step process.

Step One: Install RegReplace

If you haven’t tried Sublime’s package manager before, you might be surprised by how easy it is to use. Just press Command + Shift + P and a console prompt will drop down from the top of the window:

The Sublime Text console. It is your friend

Type ‘install package’, hit enter, then type ‘RegReplace’ and the package will install itself.

Step Two: Define your rules

This is the fun part! Grab all your regexes, open up RegReplace’s package settings, find the file called ‘Rules – User’ and whack them in a JSON object called ‘replacements’.

Yo dawg, I heard you liked articles…

Beware: if you’re used to Sublime’s special regex syntax, you might get confused. RegReplace doesn’t use Sublime syntax, but rather the re engine, which is part of the Python library and whose syntax you’ll be familiar with if you’ve written regular expressions in other contexts (eg InDesign). You’ll need to escape backslashes and determine whether you want the regex to be case-sensitive or not.

Here are a few of my rules, which I use for cleaning up OCR text:

{
    "replacements":
    {
        "correct_ellipses":
        {
            "find": "\\.\\s?\\.\\s?\\.", // spaces inside ellipsis are optional
            "name": "correct_ellipses",
            "replace": "…"
        },
        "fi_ligature":
        {
            "find": "fi",
            "greedy": true,
            "replace": "fi"
        },
        "stops_to_commas":
        {
            "case": false,
            "find": "([a-z])\\. ([a-z])",
            "greedy": true,
            "replace": "\\1, \\2"
        },
        "incorrect_para_breaks":
        {
            "case": true,
            "find": "\n\n([a-z])", // paragraphs starting with a lowercase letter are unlikely
            "greedy": true,
            "replace": "\n\\1",
        },
        "educate_quotes_in_contractions":
        {
            "case": true,
            "find": "([A-Za-z])[‘']([a-z])",
            "greedy": true,
            "replace": "\\1’\\2",
        }
}

As you can see, it’s possible (and good practice!) to comment more complex regexes so your colleagues can understand them in future, because as we all know, regex syntax is crazy, and doubly so where backslashes have to be escaped. It’s also a good idea to try and give your regexes reasonably descriptive names – mine need a little more work.

It’s worth noting that there’s also a built-in rule definition tool, which some people may prefer to use. I found it gave me too many options, but if you’re comfortable with Python you might find it easier than writing a bunch of JSON by hand.

Step Three: Define your commands

Finally, you need to define the commands which invoke the rules, and once again this is done through the Package Settings menu. One of the things that makes Sublime great is that you can define commands to do just about anything. However, if – like me – you haven’t previously defined many Sublime commands, you may not realise that here the JSON object needs to be enclosed in square brackets – don’t ask me why. I kid you not, figuring this out accounted for 95% of the time I spent setting up RegReplace.

Here’s my command which invokes all of the rules above, plus a few more:

[
// OCR cleanup
    {
        "caption": "RegReplace: OCR cleanup",
        "command": "reg_replace",
        "args": {
            "replacements": [
                // letter misreads
                "fi_ligature",
                "fl_ligature",
                "double_l",
                "fully",

                // punctuation
                "correct_ellipses",
                "stops_to_commas",
                "educate_quotes_in_contractions",
                "spaced_hyphens_to_ens",

                // paragraphs
                "incorrect_para_breaks"
            ]
        }
    }
]

Obviously, the relationship between commands and rules can be 1:1 or 1:many, although I suspect it’s a bad idea to string too many together in a row. You should also think carefully about what order to run the rules in – I try to move from letter-reading errors, to punctuation, then words and finally paragraphs. As with rules, you can and should comment everything.

Once you’ve defined a command, you can call it by opening the console – Command + Shift + P – and typing any part of the caption, as defined in your commands

Step Four: Profit!

And that’s it! If you’re using the world’s greatest (and Australian-made!) text editor, as you should be, hopefully this helps you save some time – I’d love to hear what works for you.

3 Responses to “Automating your regexes with Sublime Text”

  1. Aaron Troia says:

    Great post Simon! I’ve been using RegReplace for awhile now and it works really well! It can take a little time to get used to the syntax and to get your RegExs setup (depending on how many you need, I set up quite a number of rules for different projects) but once you do its definitely worth it.

  2. Naomi says:

    Team BBEdit here: RegReplace replicates BBEdit’s Text Factories, not the saved search functionality.

    Although the ability to save a series of static regex patterns is great, the key difference is BBEdits save functionality loads your pattern into the native Find/Replace dialog. Allowing you to tweak it before running.

    (The inability to load pre-defined patterns into Sublime find/replace is the only reason I don’t use it. I haven’t found a package that does what I want. I should look into writing something myself, but I’m just not that good yet.)

Leave a Reply

Your email address will not be published. Required fields are marked *