Objective-C era

NSRange has a very simple API. Finding a range, replacing, splitting or chopping are some few tricks off the bat for simple string manipulation. Objective-C and its idiomatic NSRange API looks such:

public struct _NSRange {
    public var location: Int
    public var length: Int
}

Lets say we wanted to extract just the name from a JSON string we got.

let a: NSString = “name: Bj P. Kandel”
let name = a.substring(from: (“name: “ as NSString).length)

Swift3 Era

let aSwift = “name: Bj P. Kandel”
let nameSwift = aSwift.substring(from: <String.Index>) 

So what is the mess with String.Index?

NSString (and its NSRange) is not unicode aware. Swift intends to have great support for unicode. But do we care?

If you like Emoji then you do.

let emojiOBJC: NSString = “🤓”
emojiOBJC.length   //2 

You can see the emoji is actually 1 character for you and our user. But NSString doesn’t co-relate to natural understanding. It thinks its 2 character. If we were to substring the Emoji, we could get this familiar unknown representation symbol.

emojiOBJC.substring(from: 1) //� 

Why?

NSString uses UTF-16 or 16 bits to encode a character into memory. When reading, 16 bit of memory is treated as 1 character. So to find the length of a string, count the 16 bit memory. Straightforward. Shall we think a bit more.

16 bits == 2^16 possibilities == 65536 distinct characters that can be represented uniquely

However, There are roughly 6,500 spoken languages in the world today. However, about 2,000 of those languages have fewer than 1,000 speakers. The most popular language in the world is Mandarin Chinese. This 16 bit cannot all the characters from all of those language + emojis

That’s why Swift String were made more unicode correct. Unicode is somehow not limited to specify 16 bits for 1 Character or 64 bits . It doesn’t matter if a 😘 takes 32 bit or 128 bit(just example) for user. You, me and the other developers. Its 1 character afterall.

Hence, counting X bit memory to find number of characters went like PUFF! Length didn’t make sense.

let swiftyEmoji = (emojiOBJC as String)
(emojiOBJC as String).characters.count   // 1
(emojiOBJC as String).utf16.count        // 2 :: like the Objc length

If you think swift treats all characters as 32 bit memory then thats wrong. We don’t care how it stores. The interface that swift provides is what we care.

“go”.characters.count   //2

For us, developers, "go" is 2 character String. So is "🍻👯" is 2 character String. Swift manages the details for us. String.characters provides the most unicode aware interface to us. However, feel free to visit the UTF16 and UTF8 view. Remember those are just a VIEW to the String.

“go”.utf16.count        //2
“go”.utf8.count         //2

Okay lets move on to substring some Swifty String. And came swifty Range

Swifty String Manipulation

public struct Range<Bound : Comparable> {
    public let lowerBound: Bound
    public let upperBound: Bound
    public init(uncheckedBounds bounds: (lower: Bound, upper: Bound))
}

So substring operation becomes:

“Mr. X”.substring(from: <String.Index>)

Like we discussed we cannot just consider a Swift String as Array of Fixed Length Bits; like C char* where char is 8 bit. Thus, to not chop our emoji, we need a unicode safe way. Swift provides String.Index. You cannot get the Int from the String.Index.

The details

Some Observations

Should we use simple NSRange and NSString or go through swifty pain to be unicode aware?

This all depends on your use case. If you are sure there are not special characters and emojis involved in the text then UTF-16 suffices for english languages. However, the more cryptic and global your content is the more precaution is needed. After all, Swift does all the heavy lifting, you just don’t get the Int for lowerbound and upperbound. Why not deal with it?

comments powered by Disqus