ASCII (American Standard Code for Information Interchange) is a character encoding standard for electronic communication. It is limited to English letters, digits (0-9), special characters (like !, @, #) and does not support characters from other languages.
0 to 127. (2^7-1 =127)
Primarily used in early computers and simple text files. The range from 128 to 255 is often referred to as the "extended ASCII" range. This range is not defined by the original ASCII standard but was used in various extended ASCII encodings. Different systems and languages used different extended ASCII sets, which could lead to compatibility issues. For example, IBM's Code Page 437, ISO 8859-1(Latin-1), and Windows-1252 are extended ASCII sets but have different characters in the 128-255 range. These extended characters (128-255) are often called "upper-128 characters" or "top 128 characters" because they occupy the second half of an 8-bit byte (2^8=256, 256/2=128).
A byte is 8 bits. Why it has a word
8-bit bytesto emphasis?Originally, the term "byte" did not have a standardised size. Early computers used different byte sizes, including 6, 7, 8, 9, or more bits per byte. Over time, the 8-bit byte became the standard for most modern computer systems. Today, the term "8-bit byte" is sometimes used to be explicit, be used for clarity and to avoid confusion in contexts, especially in technical contexts or when dealing with older systems or documentation that might refer to different byte sizes.
ANSI character sets extend ASCII by using the values 128-255 to include additional characters, these are often called Windows code pages. ANSI refers to a family of 8-bit character encodings used primarily on Microsoft Windows. Windows-1252 is one of the most common ANSI character sets, used for Western European languages.
Original equipment manufacturer character sets were developed for specific hardware platforms, particularly for early personal computers. For example, IBM's Code Page 437.
Double-Byte Character Sets use either 1-2 bytes to represent a character.
DBCS is primarily tailored to large character sets languages, such as Traditional Chinese(Big5), Japanese(Shift-JIS). DBCS was widely used before Unicode became the standard for text encoding.
Unicode is a universal character encoding standard designed to represent languages text and symbols from all the world’s writing systems. Used universally across modern computer systems and software to ensure consistent representation.
Code Points: Unicode assigns a unique number (code point) to every character. The U+ means “Unicode” and the numbers are hexadecimal. For example, the Unicode code point for 'A' is U+0041, '中' (Chinese character), it is U+4E2D, and '�' is U+FFFD, representing a replacement character. "Hello" corresponds to these five code points:
U+0048 U+0065 U+006C U+006C U+006F.
Code Point Range: Unicode uses a 21-bit code space to represent code points (0 to 2^21 - 1, 2,097,152 code points). However, Unicode chose to limit the range to U+10FFFF, which is approximately half of the 21-bit space (1,114,112 code points), providing manageable implementation complexity. The range is divided into 17 planes, each containing 65,536 code points:
Plane 0: Basic Multilingual Plane (BMP) (U+0000 to U+FFFF)
Plane 1 - Plane 16: Supplementary Planes (U+10000 to U+10FFFF)
Supplementary Planes refer to additional planes of Unicode code points beyond the Basic Multilingual Plane (BMP). These planes are used for less common characters, historic scripts, emoji, and other specialised symbols.
Plane 1: Supplementary Multilingual Plane (SMP)
Plane 2: Supplementary Ideographic Plane (SIP)
Plane 3: Tertiary Ideographic Plane (TIP)
Plane 14: Supplementary Special-purpose Plane (SSP)
Plane 15: Supplementary Private Use Area-A (PUA-A)
Plane 16: Supplementary Private Use Area-B (PUA-B)