Unicode Character Set in JavaScript
In the Unicode Escape Sequences blog, we have learned about What is Unicode Escape Sequence and Normalization. Unicode Escape Sequences and Normalization are important tools in JavaScript for handling different characters and ensuring that the text looks the same. Unicode Escape Sequences help in dealing with characters from various languages, while Normalization ensures uniform text representations, reducing inconsistencies.
Using these tools improves the strength of JavaScript code, promoting compatibility across languages and creating a more inclusive web development environment. In this blog, we will go through the concept of the Unicode Character Set
What is a Unicode Character Set?
What is a Unicode Character Set? Before we dive into Unicode, let's explore how data is stored. Data is stored in bits, which are 0s and 1s. For instance, to represent the number 28, we use the binary form 11100. However, when it comes to displaying characters like b,👍 (thumbs up), or é, we turn to ASCII. ASCII, or American Standard Code for Information Interchange, assigns numbers from 0 to 127 to basic Western characters, supporting 128 characters in total. For example,
String | H | e | A | l | o |
---|---|---|---|---|---|
ASCII value | 72 | 101 | 65 | 108 | 111 |
Binary | 0100 1000 | 0110 0101 | 0100 0001 | 0110 1100 | 0110 1111 |
Each character becomes 8 bits or 1 byte in this encoding, known as ASCII encoding. But what if we want to display non-Western characters or other writing systems, like Chinese or Arabic? That's where Unicode comes in. Unicode encompasses over a thousand unique characters in more than a hundred languages.
To display characters or symbols in Unicode, we use code points, which can be combined. For instance, d can be represented by a single code point, i.e100, and é with an acute accent can be represented by a single code point, i.e., 233, or by multiple code points, i.e., e by 101 and ́ by 769.
ASCII Table
In the above section we have learned about ASCII. Let's take a look into its representation in Decimal, Hexadecimal:
Decimal | Hexadecimal | Binary | Char | Description |
---|---|---|---|---|
0 | 00 | 00000000 | NUL | Null |
1 | 01 | 00000001 | SOH | Start of Heading |
2 | 02 | 00000010 | STX | Start of Text |
3 | 03 | 00000011 | ETX | End of Text |
4 | 04 | 00000100 | EOT | End of Transmission |
5 | 05 | 00000101 | ENQ | Enquiry |
6 | 06 | 00000110 | ACK | Acknowledge |
7 | 07 | 00000111 | BEL | Bell |
8 | 08 | 00001000 | BS | Backspace |
9 | 09 | 00001001 | HT | Horizontal Tab |
10 | 0A | 00001010 | LF | Line Feed |
11 | 0B | 00001011 | VT | Vertical Tab |
12 | 0C | 00001100 | FF | Form Feed |
13 | 0D | 00001101 | CR | Carriage Return |
14 | 0E | 00001110 | SO | Shift Out |
15 | 0F | 00001111 | SI | Shift In |
16 | 10 | 00010000 | DLE | Data Link Escape |
17 | 11 | 00010001 | DC1 | Device Control 1 |
18 | 12 | 00010010 | DC2 | Device Control 2 |
19 | 13 | 00010011 | DC3 | Device Control 3 |
20 | 14 | 00010100 | DC4 | Device Control 4 |
21 | 15 | 00010101 | NAK | Negative Acknowledge |
22 | 16 | 00010110 | SYN | Synchronize |
23 | 17 | 00010111 | ETB | End of Transmission Block |
24 | 18 | 00011000 | CAN | Cancel |
25 | 19 | 00011001 | EM | End of Medium |
26 | 1A | 00011010 | SUB | Substitute |
27 | 1B | 00011011 | ESC | Escape |
28 | 1C | 00011100 | FS | File Separator |
29 | 1D | 00011101 | GS | Group Separator |
30 | 1E | 00011110 | RS | Record Separator |
31 | 1F | 00011111 | US | Unit Separator |
32 | 20 | 00100000 | space | Space |
33 | 21 | 00100001 | ! | exclamation mark |
34 | 22 | 00100010 | " | double quote |
35 | 23 | 00100011 | # | number |
36 | 24 | 00100100 | $ | dollar |
37 | 25 | 00100101 | % | percent |
38 | 26 | 00100110 | & | ampersand |
39 | 27 | 00100111 | ' | single quote |
40 | 28 | 00101000 | ( | left parenthesis |
41 | 29 | 00101001 | ) | right parenthesis |
42 | 2A | 00101010 | * | asterisk |
43 | 2B | 00101011 | + | plus |
44 | 2C | 00101100 | , | comma |
45 | 2D | 00101101 | - | minus |
46 | 2E | 00101110 | . | period |
47 | 2F | 00101111 | / | slash |
48 | 30 | 00110000 | 0 | zero |
49 | 31 | 00110001 | 1 | one |
50 | 32 | 00110010 | 2 | two |
51 | 33 | 00110011 | 3 | three |
52 | 34 | 00110100 | 4 | four |
53 | 35 | 00110101 | 5 | five |
54 | 36 | 00110110 | 6 | six |
55 | 37 | 00110111 | 7 | seven |
56 | 38 | 00111000 | 8 | eight |
57 | 39 | 00111001 | 9 | nine |
58 | 3A | 00111010 | : | colon |
59 | 3B | 00111011 | ; | semicolon |
60 | 3C | 00111100 | < | less than |
61 | 3D | 00111101 | = | equality sign |
62 | 3E | 00111110 | > | greater than |
63 | 3F | 00111111 | ? | question mark |
64 | 40 | 01000000 | @ | at sign |
65 | 41 | 01000001 | A | |
66 | 42 | 01000010 | B | |
67 | 43 | 01000011 | C | |
68 | 44 | 01000100 | D | |
69 | 45 | 01000101 | E | |
70 | 46 | 01000110 | F | |
71 | 47 | 01000111 | G | |
72 | 48 | 01001000 | H | |
73 | 49 | 01001001 | I | |
74 | 4A | 01001010 | J | |
75 | 4B | 01001011 | K | |
76 | 4C | 01001100 | L | |
77 | 4D | 01001101 | M | |
78 | 4E | 01001110 | N | |
79 | 4F | 01001111 | O | |
80 | 50 | 01010000 | P | |
81 | 51 | 01010001 | Q | |
82 | 52 | 01010010 | R | |
83 | 53 | 01010011 | S | |
84 | 54 | 01010100 | T | |
85 | 55 | 01010101 | U | |
86 | 56 | 01010110 | V | |
87 | 57 | 01010111 | W | |
88 | 58 | 01011000 | X | |
89 | 59 | 01011001 | Y | |
90 | 5A | 01011010 | Z | |
91 | 5B | 01011011 | [ | left square bracket |
92 | 5C | 01011100 | \ | backslash |
93 | 5D | 01011101 | ] | right square bracket |
94 | 5E | 01011110 | ^ | caret / circumflex |
95 | 5F | 01011111 | _ | underscore |
96 | 60 | 01100000 | ` | grave / accent |
97 | 61 | 01100001 | a | |
98 | 62 | 01100010 | b | |
99 | 63 | 01100011 | c | |
100 | 64 | 01100100 | d | |
101 | 65 | 01100101 | e | |
102 | 66 | 01100110 | f | |
103 | 67 | 01100111 | g | |
104 | 68 | 01101000 | h | |
105 | 69 | 01101001 | i | |
106 | 6A | 01101010 | j | |
107 | 6B | 01101011 | k | |
108 | 6C | 01101100 | l | |
109 | 6D | 01101101 | m | |
110 | 6E | 01101110 | n | |
111 | 6F | 01101111 | o | |
112 | 70 | 01110000 | p | |
113 | 71 | 01110001 | q | |
114 | 72 | 01110010 | r | |
115 | 73 | 01110011 | s | |
116 | 74 | 01110100 | t | |
117 | 75 | 01110101 | u | |
118 | 76 | 01110110 | v | |
119 | 77 | 01110111 | w | |
120 | 78 | 01111000 | x | |
121 | 79 | 01111001 | y | |
122 | 7A | 01111010 | z | |
123 | 7B | 01111011 | { | left curly bracket |
124 | 7C | 01111100 | | | vertical bar |
125 | 7D | 01111101 | } | right curly bracket |
126 | 7E | 01111110 | ~ | tilde |
127 | 7F | 01111111 | DEL | delete |
Latin-1 Table
Let’s take a look into the Latin-1 Table where it shows the decimal and hexadecimal representation of characters:
Decimal | Hexadecimal | Description |
---|---|---|
0 | 00 | null |
1 | 01 | start of heading |
2 | 02 | start of text |
3 | 03 | end of text |
4 | 04 | end of transmission |
5 | 05 | enquiry |
6 | 06 | acknowledge |
7 | 07 | bell |
8 | 08 | backspace |
9 | 09 | character tabulation |
10 | 0A | line feed |
11 | 0B | line tabulation |
12 | 0C | form feed |
13 | 0D | carriage return |
14 | 0E | shift out |
15 | 0F | shift in |
16 | 10 | datalink escape |
17 | 11 | device control one |
18 | 12 | device control two |
19 | 13 | device control three |
20 | 14 | device control four |
21 | 15 | negative acknowledge |
22 | 16 | synchronous idle |
23 | 17 | end of transmission block |
24 | 18 | cancel |
25 | 19 | end of medium |
26 | 1A | substitute |
27 | 1B | escape |
28 | 1C | file separator |
29 | 1D | group separator |
30 | 1E | record separator |
31 | 1F | unit separator |
32 | 20 | space |
127 | 7F | delete |
128 | 80 | padding character |
129 | 81 | high octet preset |
130 | 82 | break permitted here |
131 | 83 | no break here |
132 | 84 | index |
133 | 85 | next line |
134 | 86 | start of selected area |
135 | 87 | end of selected area |
136 | 88 | character tabulation set |
137 | 89 | character tabulation with justification |
138 | 8A | line tabulation set |
139 | 8B | partial line forward |
140 | 8C | partial line backward |
141 | 8D | reverse line feed |
142 | 8E | single shift two |
143 | 8F | single shift three |
144 | 90 | device control string |
145 | 91 | private use one |
146 | 92 | private use two |
147 | 93 | set transmit state |
148 | 94 | cancel character |
149 | 95 | message waiting |
150 | 96 | start of guarded area |
151 | 97 | end of guarded area |
152 | 98 | start of string |
153 | 99 | single graphic character introducer |
154 | 9A | single character introducer |
155 | 9B | control sequence introducer |
156 | 9C | string terminator |
157 | 9D | operating system command |
158 | 9E | privacy message |
159 | 9F | application program command |
160 | A0 | non-breaking space |
161 | A1 | inverted exclamation mark |
162 | A2 | cent sign |
163 | A3 | pound sterling sign |
164 | A4 | currency sign |
165 | A5 | yen sign |
166 | A6 | broken bar |
167 | A7 | section sign |
168 | A8 | diaeresis (umlaut) |
169 | A9 | copyright sign |
170 | AA | feminine ordinal |
171 | AB | left angle quote |
172 | AC | not sign |
173 | AD | soft hyphen |
174 | AE | registered sign |
175 | AF | macron |
176 | B0 | degree sign |
177 | B1 | plus-minus sign |
178 | B2 | superscript two |
179 | B3 | superscript three |
180 | B4 | acute accent |
181 | B5 | micro sign |
182 | B6 | paragraph sign (pilcrow) |
183 | B7 | middle dot |
184 | B8 | cedilla |
185 | B9 | superscript one |
186 | BA | masculine ordinal |
187 | BB | right angle quote |
188 | BC | one-fourth fraction |
189 | BD | one-half fraction |
190 | BE | three-quarter fraction |
191 | BF | inverted question mark |
192 | C0 | capital a with grave accent |
193 | C1 | capital a with acute accent |
194 | C2 | capital a with circumflex |
195 | C3 | capital a with tilde |
196 | C4 | capital a with diaeresis |
197 | C5 | capital a with ring |
198 | C6 | capital ae ligature |
199 | C7 | capital c with cedilla |
200 | C8 | capital e with grave accent |
201 | C9 | capital e with acute accent |
202 | CA | capital e with circumflex |
203 | CB | capital e with diaeresis |
204 | CC | capital i with grave accent |
205 | CD | capital i with acute accent |
206 | CE | capital i with circumflex |
207 | CF | capital i with diaeresis |
208 | D0 | capital eth |
209 | D1 | capital n with tilde |
210 | D2 | capital o with grave accent |
211 | D3 | capital o with acute accent |
212 | D4 | capital o with circumflex |
213 | D5 | capital o with tilde |
214 | D6 | capital o with diaeresis |
215 | D7 | multiplication sign |
216 | D8 | capital o with slash |
217 | D9 | capital u with grave accent |
218 | DA | capital u with acute accent |
219 | DB | capital u with circumflex |
220 | DC | capital u with diaeresis |
221 | DD | capital y with acute accent |
222 | DE | capital thorn |
223 | DF | small sharp s |
224 | E0 | small a with grave accent |
225 | E1 | small a with acute accent |
226 | E2 | small a with circumflex |
227 | E3 | small a with tilde |
228 | E4 | small a with diaeresis |
229 | E5 | small a with ring |
230 | E6 | small ae ligature |
231 | E8 | small c with cedilla |
232 | E7 | small e with grave accent |
233 | E9 | small e with acute accent |
234 | EA | small e with circumflex |
235 | EB | small e with diaeresis |
236 | EC | small i with grave accent |
237 | ED | small i with acute accent |
238 | EE | small i with circumflex |
239 | EF | small i with diaeresis |
240 | F0 | small eth |
241 | F1 | small n with tilde |
242 | F2 | small o with grave accent |
243 | F3 | small o with acute accent |
244 | F4 | small o with circumflex |
245 | F5 | small o with tilde |
246 | F6 | small o with diaeresis |
247 | F7 | division sign |
248 | F8 | small o with slash |
249 | F9 | small u with grave accent |
250 | FA | small u with acute accent |
251 | FB | small u with circumflex |
252 | FC | small u with diaeresis |
253 | FD | small y with acute accent |
254 | FE | small thorn |
255 | FF | small y with diaeresis |
Unicode helps websites work well with different languages, making the online experience more connected and accessible. In coding, knowing and using Unicode is essential for creating websites that speak a global language.