UTF-8
UTF-8 is a variable-length character encoding standard used for electronic communication. Defined by the Unicode Standard, the name is derived from Unicode (or Universal Coded Character Set) Transformation Format – 8-bit.[1]
Standard | Unicode Standard |
---|---|
Classification | Unicode Transformation Format, extended ASCII, variable-length encoding |
Extends | US-ASCII |
Transforms / Encodes | ISO/IEC 10646 (Unicode) |
Preceded by | UTF-1 |
UTF-8 is capable of encoding all 1,112,064[lower-alpha 1] valid character code points in Unicode using one to four one-byte (8-bit) code units. Code points with lower numerical values, which tend to occur more frequently, are encoded using fewer bytes. It was designed for backward compatibility with ASCII: the first 128 characters of Unicode, which correspond one-to-one with ASCII, are encoded using a single byte with the same binary value as ASCII, so that valid ASCII text is valid UTF-8-encoded Unicode as well.
UTF-8 was designed as a superior alternative to UTF-1, a proposed variable-length encoding with partial ASCII compatibility which lacked some features including self-synchronization and fully ASCII-compatible handling of characters such as slashes. Ken Thompson and Rob Pike produced the first implementation for the Plan 9 operating system in September 1992.[2][3] This led to its adoption by X/Open as its specification for FSS-UTF,[4] which would first be officially presented at USENIX in January 1993[5] and subsequently adopted by the Internet Engineering Task Force (IETF) in RFC 2277 (BCP 18)[6] for future internet standards work, replacing Single Byte Character Sets such as Latin-1 in older RFCs.
UTF-8 is the dominant encoding for the World Wide Web (and internet technologies), accounting for 97.8% of all web pages, and up to 100.0% for many languages, as of 2023[update].[7] Virtually all countries and languages have 95.0% or more use of UTF-8 encodings on the web.