https://www.swiftbysundell.com/articles/string-parsing-in-swift/

social.png

Almost every program on the planet has to deal with strings one way or another, since text is so fundamental to how we both communicate and represent various forms of data. But handling and parsing strings in a way that’s both robust and efficient can, at times, be really difficult. While some strings come in a very strict and computer-friendly format, such as JSON or XML, other strings can be much more chaotic.

This week, let’s take a look at various ways to parse and extract information from such strings, and how different techniques and APIs will yield a different set of trade-offs.

My favorite continuous integration service. Automatically build, test and distribute your app on every single commit that you make. Bitrise is fast, rock solid, and takes just a few minutes to set up. Try it for free, and see how it can improve your team’s workflow.

String and its friends

In some ways, Swift has developed a reputation for being a bit difficult to work with when it comes to string parsing. While it’s true that Swift’s String implementation doesn’t offer the same amount of convenience as many other languages do (for example, you can’t simply randomly access a given character using an integer, like string[7]), it does make it easier to write correct string parsing code.

Because while it’s nice to be able to randomly access any given character in a string based on its perceived position, the multi-lingual (or perhaps emoji-lingual?) world that we live in today makes such APIs very error prone, since the way a character is represented in a text UI is in many cases very different from how it’s actually stored within a string value.

In Swift, a string is composed of a collection of Character values, stored using the UTF-8 encoding. That means that if we iterate over a string (for example using a for loop), each element will be a Character — which might be a letter, an emoji, or some other form of character. To identify groups of characters (such as letters or numbers), we can use a CharacterSet, which can be passed to several different APIs on String and its related types.

Tokenizing usernames

Let’s say that we’re working on an app that lets several different users collaborate on a document, and that we want to implement a feature that lets users mention other people using a Twitter-like @mention syntax.

Identifying what users that were mentioned within a given string is a task that’s actually quite similar to what the Swift compiler needs to do when identifying the various parts within a string of code — a process known as lexing or tokenizing — only that our implementation will be orders of magnitude simpler, since we only need to look for one kind of token.

An initial implementation might look something like this — a computed property on String, in which we split the string up based on @ characters, drop the first element (since it’ll be the text before the first @-sign), and then compactMap over the result — identifying strings of non-empty letter-only characters:

extension String {
    var mentionedUsernames: [String] {
        let parts = split(separator: "@").dropFirst()

        // Character sets may be inverted to identify all
        // characters that are *not* a member of the set.
        let delimiterSet = CharacterSet.letters.inverted

        return parts.compactMap { part in
            // Here we grab the first sequence of letters right
            // after the @-sign, and check that it’s non-empty.
            let name = part.components(separatedBy: delimiterSet)[0]
            return name.isEmpty ? nil : name
        }
    }
}

The above implementation is quite simple, and makes use of some really nice Swift features — such as modifying collections, using compactMap to discard nil values, and so on. But it does have one problem — it requires three iterations, one to split the string based on @ characters, one to iterate over all those parts, and then one to split each part based on non-letter characters.

While each iteration is smaller than the one before it (so our algorithm’s complexity is not quite O(3N)), multiple iterations will most often result in some form of bottleneck as the input dataset grows. That could become a problem in our case, since we’re planning on applying this algorithm to documents of any size (maybe some users will work on a book together using our app, who knows?), so let’s see if we can do something to optimize it.

Rather than splitting our string up into components, and then iterating over those components, let’s instead make a single pass through our string — by iterating over its characters. While this’ll require a bit more manual parsing code, it’ll enable us to reduce our algorithm into one single iteration — like this:

extension String {
    var mentionedUsernames: [String] {
        // Setting up our state, which is any partial name that we’re
        // currently parsing, and an array of all names found.
        var partialName: String?
        var names = [String]()

        // A nested parsing function, that we’ll apply to each
        // character within the string.
        func parse(_ character: Character) {
            if var name = partialName {
                guard character.isLetter else {
                    // If we encounter a non-letter character
                    // while parsing a name, it means that the
                    // name is finished, and we can add it to
                    // our array (if non-empty):
                    if !name.isEmpty {
                        names.append(name)
                    }

                    // Reset our state, and parse the character
                    // again, since it might be an @-sign.
                    partialName = nil
                    return parse(character)
                }

                name.append(character)
                partialName = name
            } else if character == "@" {
                // Set an empty state, to signal to our above
                // code that it’s time to start parsing a name.
                partialName = ""
            }
        }

        // Apply our parsing function to each character
        forEach(parse)

        // Once we’ve reached the end, we’ll make sure to
        // capture any name that was previously found.
        if let lastName = partialName, !lastName.isEmpty {
            names.append(lastName)
        }

        return names
    }
}

Note that the above isLetter API on Character was added in Swift 5.

While the above implementation is much more complex than our initial one — it is about twice as fast (on average), which presents us with our first clear trade-off: do we opt for a simpler solution at the potential cost of performance — or do we choose to maintain a more complex, and also more efficient, algorithm?