None of the UTFs can generate every arbitrary byte Īre not generated by a UTF? How should I interpret them?
Many other libraries may have built-in converters, so you may not have to write your own. The latest version may be downloaded from the ICU Project web site. The freely available open source project International Components for Unicode ( ICU) has UTF conversion built into it. For more information on encodingįorms see UTR #17: Unicode Character Encoding Model. Many different byte sequences, depending on the particular SCSU In addition to being lossless, UTFs are unique: any given coded character sequence will always result in the same sequence of bytes for a given UTF.Ĭompression method, even though it is reversible, is not a UTF because the same string can map to very This includes reserved or unassigned code points and the 66 noncharacters Must have a mapping for all code points (except surrogate code points). The ISO/IEC 10646 standard uses the term “ UCS transformationįormat” for UTF the two terms are merely synonyms for the same concept.Įach UTF is reversible, thus every UTF supports lossless round tripping: mappingįrom any Unicode coded character sequence S to a sequence of bytes andīack will produce S again. There are compression transformations such as the one described in the UTS #6: A Standard Compression Scheme for Unicode ( SCSU).Ī Unicode transformation format ( UTF) is anĪlgorithmic mapping from every Unicode code point (except surrogate code Unicode data, including UTF-8, UTF-16 and UTF-32. They are all able to represent all of Unicode, but they differ for example in the number of bits for their constituent code units. There are several possible representations of Q: Can Unicode text be represented in more than one way? One or two 16-bit code units, or a single 32-bit code unit. Depending on theĮncoding form you choose ( UTF-8, UTF-16, or UTF-32), each character will then be represented either as a sequence of one to four 8-bit bytes, In its first version, from 1991 to 1995, Unicode was a 16-bit encoding, but starting with Unicode 2.0 (July, 1996), the Unicode Standard has encoded characters in the range U+0000.U+10FFFF, which amounts to a 21-bit code space.
Frequently Asked Questions UTF-8, UTF-16, UTF-32 & BOM General questions, relating to UTF or Encoding Form