Log

View Options

UTF32 to UTF16 Surrogate Pairs

10/11/22

Recently I had to make an Icon component that could accept UTF32 characters to display in SVG glyphs. It seemed pretty simple at first. I already had an array of unicode symbols I would plug in and my component would represent them in an ADA friendly way.

Then I noticed the list contained several character codes that were out of the UTF16 scope and just as I feared these got garbled when trying to run them through JavaScript. JavaScript being built long before emojis and the need of UTF32 only supports UTF16 in strings to this day (sort of).

Eventually when UTF32 started becoming more popular the Unicode maintainers came up with a solution. There where several plains or ranges of higher character codes that weren't currently in use. While these could in no way to support a one to one mapping to accommodate UTF32 they decided they would make supplementary character sets using two 16 byte sets. After some searching I finally found a table of surrogate characters.

Now, I figured surely, there is a built in solution in the modern era for supporting UTF32 in JavaScript and there is. The native String object has a method called fromCodePoint which is supported by all major modern browsers.

Sure enough, I was able to plug my UTF32 characters and return the two UTF16 characters that would make my glyphs. Cool, but, it left me curious. The only way to use UTF32 characters in JavaScript before was to use fromCharCode. So, how did the math work to know how to split the UTF32 values when using fromCharCode?

After much digging I found an example of the alogrithm on the Unicdoe Consortium website. The only issue was it was not very well documented as to why certain operations were being performed. I needed to know more.

After much searching, trial, and error I came across a Wikipedia article that helped me understand just why all of this math was necessary and how it worked to reorder the bytes. I ended up developing a well commented example function for converting UTF32 to UTF16 friendly characters. While this function seems to work well I would recommend the Mozilla polyfill if you're to need something like this for production.

[JavaScript] /* @summary: Take a unicode character code. If it is 16bit range do nothing. If it's in 32 bit range convert to a UTF32 JavaScript friendly character set. * @prop c: A character or a number to covert to a JavaScript friendly character. */ function asUTF16(c) { // Minimum value for the high end character code (fist 10 digits) const surh_base = 0xd800; // Minimum value for the low end charater code (last 10 digits) const surl_base = 0xdc00; // Row size for surrogate characters const sur_int = 0x400; // The leading bit to remove before math operations const xbit = 0x10000; // The stripped down value for calcualting surrogate let c_base = (c - xbit); // First value determined by left 10 bits plus surrogate high base let c1 = Math.floor(c_base / sur_int) + surh_base; // If c1 is not a 32bit range return the original 16bit set if(c1 < surh_base) return String.fromCharCode(c); // Second value determned by right 10 bits plus surrogate low base let c2 = (c_base % sur_int) + surl_base; // Return our 32bit character set return String.fromCharCode(c1, c2); }

Try it out


UTF32:
UTF16 Char1:
UTF16 Char2:
Character: