UTF-8 is a variable-length character encoding standard used for electronic communication. Defined by the Unicode Standard, the name is derived from Unicode (or Universal Coded Character Set) Transformation Format  8-bit.[1]

StandardUnicode Standard
ClassificationUnicode Transformation Format, extended ASCII, variable-length encoding
Transforms / EncodesISO/IEC 10646 (Unicode)
Preceded byUTF-1

UTF-8 is capable of encoding all 1,112,064[lower-alpha 1] valid character code points in Unicode using one to four one-byte (8-bit) code units. Code points with lower numerical values, which tend to occur more frequently, are encoded using fewer bytes. It was designed for backward compatibility with ASCII: the first 128 characters of Unicode, which correspond one-to-one with ASCII, are encoded using a single byte with the same binary value as ASCII, so that valid ASCII text is valid UTF-8-encoded Unicode as well.

UTF-8 was designed as a superior alternative to UTF-1, a proposed variable-length encoding with partial ASCII compatibility which lacked some features including self-synchronization and fully ASCII-compatible handling of characters such as slashes. Ken Thompson and Rob Pike produced the first implementation for the Plan 9 operating system in September 1992.[2][3] This led to its adoption by X/Open as its specification for FSS-UTF,[4] which would first be officially presented at USENIX in January 1993[5] and subsequently adopted by the Internet Engineering Task Force (IETF) in RFC 2277 (BCP 18)[6] for future internet standards work, replacing Single Byte Character Sets such as Latin-1 in older RFCs.

UTF-8 is the dominant encoding for the World Wide Web (and internet technologies), accounting for 97.8% of all web pages, and up to 100.0% for many languages, as of 2023.[7] Virtually all countries and languages have 95.0% or more use of UTF-8 encodings on the web.

