Unicode Escape Sequences and Normalization in JavaScript
In the previous blog, we learned about Character Set in JavaScript. In JavaScript character set is like a toolkit for letters and symbols, forming a universal language for computers. Each character has a unique number, and using the right capitalization is key. This set is the backbone, ensuring text works seamlessly across languages in web development. If you want to know more about it, visit the Character Set in JavaScript blog. In this blog, we will go through other features of Character Set i.e. Unicode Escape Sequences and Normalization.
What is a Unicode Escape Sequence?
Unicode escape sequences in JavaScript are a way to represent characters using a specific syntax. In JavaScript, these escape sequences begin with \u followed by four hexadecimal digits, which correspond to the Unicode code point of the desired character. For example, the Unicode escape sequence \u00E9 represents the character é. Developers use these sequences when they want to include characters that might be challenging to type directly in their code or when working with characters from different languages and scripts.
Unicode escapes may also exist in comments, but because comments are disregarded, they are considered as ASCII characters and not understood as Unicode in that context. The most common sort of Unicode escape sequence is the \u0000 sequence, where 0000represents the character's hexadecimal code point. The character Ais represented by the escape sequence \u0041.
Here are a few more often used Unicode escape sequences:
Sequence | Character represented |
---|---|
\0 | The NUL character (\u0000) |
\b | Backspace (\u0008) |
\t | Horizontal tab (\u0009) |
\n | Newline (\u000A) |
\v | Vertical tab (\u000B) |
\f | Form feed (\u000C) |
\r | Carriage return (\u000D) |
\” | Double quote (\u0022) |
\’ | Apostrophe or single quote (\u0027) |
\\ | Backslash (\u005C) |
\x XX | The Latin-1 character specified by the two hexadecimal digits XX |
\u XXXX | The Unicode character specified by the four hexadecimal digits XXXX |
Let's take a look at the example of Unicode Escape Sequences
Output
Normalization
In JavaScript, normalization refers to the process of transforming text containing diverse Unicode characters into a standardized form. Some characters have several Unicode representations, which can cause problems when comparing or processing strings. Normalization guarantees that strings are represented consistently, which aids in the prevention of undesired behavior or inconsistencies.
String.normalize()
Before delving into the normalize() method, let's explore why it's necessary. Unicode assigns a unique numerical value, called a code point, to each character. Take the example of the letter A:
However, a character can sometimes have more than one code point. For instance:
Here, É has a single code point, U+00C9. Yet, it can have two code points, like so:
In JavaScript, consider the same example:
Despite both strings appearing identical, they have different code points. Consequently, comparing these strings directly yieldsfalse, as the length of str1 is 1, while str2 is 2. This is because str2 has code points as \u0045\u0301, where \u0045 is the first character, and \u0301 is the second. Even though the strings look alike, their code points differ. To address this, we use the normalize() method.
String.prototype.normalize() in JavaScript allows you to normalize strings using Unicode normalization forms. There are four types of Normalization
Normalization Form D (NFD)
Normalization Form D (NFD) is one of the Unicode normalization forms used to represent strings in JavaScript. NFD breaks down diacritical marked characters into their basic pieces. This is especially handy for comparing or processing strings that treat characters with diacritics (accents) and their base characters as separate entities. To apply NFD normalization to a JavaScript string, use thenormalize() method. For example
Output
In the above example the normalize(NFD) method call decomposes the character é (\u00e9 ) into its constituent parts (\u0065\u0301), resulting in Wélcomé where é is represented as e with an acute accent.
Normalization Form C (NFC)
In JavaScript, Normalization Form C (NFC) is a Unicode normalization form that composes characters containing diacritics into their precomposed form. When working with strings that contain diacritics (accents) and their base characters, this normalization form is useful for guaranteeing consistency and compatibility. For example
Output
In the above example, the normalize('NFC') method call composes. the character é into its precomposed form, resulting in héllò whereé(\u0065\u0301) is represented as a combined character (\u00e9).
Normalization Form KD (NFKD)
In JavaScript, Normalization Form KD (NFKD) is a Unicode normalization form that decomposes characters and performs compatibility decomposition. This means that it not only decomposes characters with diacritics (accents), but also characters that are regarded compatible but have various Unicode representations. For example
Output
In the above example, the normalize('NFKD') method call decomposes the ligature characters fi into fi and further decomposes them into their basic forms.
Normalization Form KC (NFKC)
In JavaScript, Normalization Form KC (NFKC) is a Unicode normalization form that composes characters and applies compatibility composition. When working with strings including diacritics (accents) and their base characters, as well as compatibility characters, this normalization form is useful for maintaining compatibility and consistency. For example
Output
Unicode Escape Sequences and Normalization are essential tools in JavaScript for managing diverse characters and ensuring text consistency. Unicode Escape Sequences provide versatility in handling characters from various languages, while Normalization standardizes text representations, mitigating discrepancies. Incorporating these features enhances the robustness of JavaScript code, fostering cross-language compatibility and contributing to a more inclusive web development environment. In the next blog we will take a look into Unicode