Objective-C era
NSRange has a very simple API. Finding a range, replacing, splitting or chopping are some few tricks off the bat for simple string manipulation. Objective-C and its idiomatic NSRange API looks such:
public struct _NSRange {
public var location: Int
public var length: Int
}
Lets say we wanted to extract just the name from a JSON string we got.
let a: NSString = “name: Bj P. Kandel”
let name = a.substring(from: (“name: “ as NSString).length)
Swift3 Era
let aSwift = “name: Bj P. Kandel”
let nameSwift = aSwift.substring(from: <String.Index>)
So what is the mess with String.Index?
NSString (and its NSRange) is not unicode aware. Swift intends to have great support for unicode. But do we care?
If you like Emoji then you do.
let emojiOBJC: NSString = “🤓”
emojiOBJC.length //2
You can see the emoji is actually 1 character for you and our user. But NSString doesn’t co-relate to natural understanding. It thinks its 2 character. If we were to substring the Emoji, we could get this familiar unknown representation symbol.
emojiOBJC.substring(from: 1) //�
Why?
NSString
uses UTF-16
or 16 bits to encode a character into memory. When reading, 16 bit of memory is treated as 1 character. So to find the length of a string, count the 16 bit memory. Straightforward. Shall we think a bit more.
16 bits == 2^16 possibilities == 65536 distinct characters that can be represented uniquely
However, There are roughly 6,500 spoken languages in the world today. However, about 2,000 of those languages have fewer than 1,000 speakers. The most popular language in the world is Mandarin Chinese. This 16 bit cannot all the characters from all of those language + emojis
That’s why Swift String were made more unicode correct. Unicode is somehow not limited to specify 16 bits
for 1 Character or 64 bits
. It doesn’t matter if a 😘 takes 32 bit
or 128 bit
(just example) for user. You, me and the other developers. Its 1 character afterall.
Hence, counting X bit memory to find number of characters went like PUFF! Length didn’t make sense.
let swiftyEmoji = (emojiOBJC as String)
(emojiOBJC as String).characters.count // 1
(emojiOBJC as String).utf16.count // 2 :: like the Objc length
If you think swift treats all characters as 32 bit
memory then thats wrong. We don’t care how it stores. The interface that swift provides is what we care.
“go”.characters.count //2
For us, developers, "go"
is 2 character String. So is "🍻👯"
is 2 character String. Swift manages the details for us. String.characters provides the most unicode aware interface to us. However, feel free to visit the UTF16 and UTF8 view. Remember those are just a VIEW
to the String.
“go”.utf16.count //2
“go”.utf8.count //2
Okay lets move on to substring some Swifty String. And came swifty Range
Swifty String Manipulation
public struct Range<Bound : Comparable> {
public let lowerBound: Bound
public let upperBound: Bound
public init(uncheckedBounds bounds: (lower: Bound, upper: Bound))
}
So substring operation becomes:
“Mr. X”.substring(from: <String.Index>)
Like we discussed we cannot just consider a Swift String as Array of Fixed Length Bits; like C char*
where char is 8 bit. Thus, to not chop our emoji, we need a unicode safe way. Swift provides String.Index
. You cannot get the Int
from the String.Index
.
The details
- This will be updated in the coming days.
Some Observations
-
There is no public API that turns
Range<Bounds>
toNSRange
primarily becauselowerBound
andupperBound
are notInt
and hence not convertible toInt
. -
Its however trivial to make
Range
fromNSRange
although you cant make a range without actually specifying whose Range is this. In our case its String.public extension NSRange { func toRange(forString: String) -> Range<String.Index> { let lowerIndex = forString.index(forString.startIndex, offsetBy: location) let upperIndex = forString.index(forString.startIndex, offsetBy: location + length) return Range(uncheckedBounds: (lowerIndex, upperIndex)) } }
-
Like above you need to specify WhoseRange to create startIndex and endIndex.
-
Range created for one string shouldn’t be used to manipulate another string directly. In essence, range is tightly owned and is applicable to its owner only.
let a = “this iz it”let aor = a.range(of: “iz”)! var another = “th iz it”another.replaceSubrange(aor, with: “IS”) //OUTPUT =“thISz it”
-
If we wanted to offset by +1 and use the same range then it can be done as such
func rangeFrom(range: Range<String.Index>, forString: String, offset: Int) -> Range<String.Index> { let lowerIndex = forString.index(range.lowerBound, offsetBy: offset) let upperIndex = forString.index(range.upperBound, offsetBy: offset) returnRange(uncheckedBounds: (lowerIndex, upperIndex)) }
-
In the above code, range is calculated from another string.
let nr = rangeFrom(range: aor, forString: another, offset: 1) another.replaceSubrange(nr, with: “IS”) //OUTPUT = “th IS it”
-
The above rangeFrom function will produce error such as these:
-
When the lower/upperBounds of the fromString are computed which will fall outside of the range of the entire fromString.
//fatal error: cannot decrement invalid index //fatal error: cannot increment beyond endIndex
Should we use simple NSRange
and NSString
or go through swifty pain to be unicode aware?
- No support for 😎💔
- No support for Image Literals and other literals Apple will add on with time.
- Only supports with reasoning when UTF16 is used
- When there is a emoji like 💔 it will be counted as 2 characters if you use NSRange api and it gets worse if you try to insert a space in-between those 2 characters.
This all depends on your use case. If you are sure there are not special characters and emojis involved in the text then UTF-16
suffices for english languages. However, the more cryptic and global your content is the more precaution is needed. After all, Swift does all the heavy lifting, you just don’t get the Int
for lowerbound
and upperbound
. Why not deal with it?