EVOLUTION-MANAGER
Edit File: unicode.3
.TH unicode 3 "stdlib 1.19.4" "Ericsson AB" "Erlang Module Definition" .SH NAME unicode \- Functions for converting Unicode characters .SH DESCRIPTION .LP This module contains functions for converting between different character representations\&. Basically it converts between ISO-latin-1 characters and Unicode ditto, but it can also convert between different Unicode encodings (like UTF-8, UTF-16 and UTF-32)\&. .LP The default Unicode encoding in Erlang is in binaries UTF-8, which is also the format in which built in functions and libraries in OTP expect to find binary Unicode data\&. In lists, Unicode data is encoded as integers, each integer representing one character and encoded simply as the Unicode codepoint for the character\&. .LP Other Unicode encodings than integers representing codepoints or UTF-8 in binaries are referred to as "external encodings"\&. The ISO-latin-1 encoding is in binaries and lists referred to as latin1-encoding\&. .LP It is recommended to only use external encodings for communication with external entities where this is required\&. When working inside the Erlang/OTP environment, it is recommended to keep binaries in UTF-8 when representing Unicode characters\&. Latin1 encoding is supported both for backward compatibility and for communication with external entities not supporting Unicode character sets\&. .SH DATA TYPES .nf \fBencoding()\fR\& = latin1 .br | unicode .br | utf8 .br | utf16 .br | {utf16, \fBendian()\fR\&} .br | utf32 .br | {utf32, \fBendian()\fR\&} .br .fi .nf \fBendian()\fR\& = big | little .br .fi .nf \fBunicode_binary()\fR\& = binary() .br .fi .RS .LP A \fIbinary()\fR\& with characters encoded in the UTF-8 coding standard\&. .RE .nf \fBchardata()\fR\& = \fBcharlist()\fR\& | \fBunicode_binary()\fR\& .br .fi .nf \fBcharlist()\fR\& = .br maybe_improper_list(char() | \fBunicode_binary()\fR\& | \fBcharlist()\fR\&, .br \fBunicode_binary()\fR\& | []) .br .fi .nf \fBexternal_unicode_binary()\fR\& = binary() .br .fi .RS .LP A \fIbinary()\fR\& with characters coded in a user specified Unicode encoding other than UTF-8 (UTF-16 or UTF-32)\&. .RE .nf \fBexternal_chardata()\fR\& = \fBexternal_charlist()\fR\& .br | \fBexternal_unicode_binary()\fR\& .br .fi .nf \fBexternal_charlist()\fR\& = .br maybe_improper_list(char() | .br \fBexternal_unicode_binary()\fR\& | .br \fBexternal_charlist()\fR\&, .br \fBexternal_unicode_binary()\fR\& | []) .br .fi .nf \fBlatin1_binary()\fR\& = binary() .br .fi .RS .LP A \fIbinary()\fR\& with characters coded in ISO-latin-1\&. .RE .nf \fBlatin1_char()\fR\& = byte() .br .fi .RS .LP An \fIinteger()\fR\& representing valid latin1 character (0-255)\&. .RE .nf \fBlatin1_chardata()\fR\& = \fBlatin1_charlist()\fR\& | \fBlatin1_binary()\fR\& .br .fi .RS .LP The same as \fIiodata()\fR\&\&. .RE .nf \fBlatin1_charlist()\fR\& = .br maybe_improper_list(\fBlatin1_char()\fR\& | .br \fBlatin1_binary()\fR\& | .br \fBlatin1_charlist()\fR\&, .br \fBlatin1_binary()\fR\& | []) .br .fi .RS .LP The same as \fIiolist()\fR\&\&. .RE .SH EXPORTS .LP .nf .B bom_to_encoding(Bin) -> {Encoding, Length} .br .fi .br .RS .LP Types: .RS 3 Bin = binary() .br .RS 2 A \fIbinary()\fR\& such that \fIbyte_size(Bin) >= 4\fR\&\&. .RE Encoding = latin1 .br | utf8 .br | {utf16, \fBendian()\fR\&} .br | {utf32, \fBendian()\fR\&} .br Length = integer() >= 0 .br .nf \fBendian()\fR\& = big | little .fi .br .RE .RE .RS .LP Check for a UTF byte order mark (BOM) in the beginning of a binary\&. If the supplied binary \fIBin\fR\& begins with a valid byte order mark for either UTF-8, UTF-16 or UTF-32, the function returns the encoding identified along with the length of the BOM in bytes\&. .LP If no BOM is found, the function returns \fI{latin1,0}\fR\& .RE .LP .nf .B characters_to_list(Data) -> Result .br .fi .br .RS .LP Types: .RS 3 Data = \fBlatin1_chardata()\fR\& | \fBchardata()\fR\& | \fBexternal_chardata()\fR\& .br Result = list() .br | {error, list(), RestData} .br | {incomplete, list(), binary()} .br RestData = \fBlatin1_chardata()\fR\& | \fBchardata()\fR\& | \fBexternal_chardata()\fR\& .br .RE .RE .RS .LP Same as \fIcharacters_to_list(Data, unicode)\fR\&\&. .RE .LP .nf .B characters_to_list(Data, InEncoding) -> Result .br .fi .br .RS .LP Types: .RS 3 Data = \fBlatin1_chardata()\fR\& | \fBchardata()\fR\& | \fBexternal_chardata()\fR\& .br InEncoding = \fBencoding()\fR\& .br Result = list() .br | {error, list(), RestData} .br | {incomplete, list(), binary()} .br RestData = \fBlatin1_chardata()\fR\& | \fBchardata()\fR\& | \fBexternal_chardata()\fR\& .br .RE .RE .RS .LP Converts a possibly deep list of integers and binaries into a list of integers representing Unicode characters\&. The binaries in the input may have characters encoded as latin1 (0 - 255, one character per byte), in which case the \fIInEncoding\fR\& parameter should be given as \fIlatin1\fR\&, or have characters encoded as one of the UTF-encodings, which is given as the \fIInEncoding\fR\& parameter\&. Only when the \fIInEncoding\fR\& is one of the UTF encodings, integers in the list are allowed to be grater than 255\&. .LP If \fIInEncoding\fR\& is \fIlatin1\fR\&, the \fIData\fR\& parameter corresponds to the \fIiodata()\fR\& type, but for \fIunicode\fR\&, the \fIData\fR\& parameter can contain integers greater than 255 (Unicode characters beyond the ISO-latin-1 range), which would make it invalid as \fIiodata()\fR\&\&. .LP The purpose of the function is mainly to be able to convert combinations of Unicode characters into a pure Unicode string in list representation for further processing\&. For writing the data to an external entity, the reverse function \fB\fIcharacters_to_binary/3\fR\&\fR\& comes in handy\&. .LP The option \fIunicode\fR\& is an alias for \fIutf8\fR\&, as this is the preferred encoding for Unicode characters in binaries\&. \fIutf16\fR\& is an alias for \fI{utf16,big}\fR\& and \fIutf32\fR\& is an alias for \fI{utf32,big}\fR\&\&. The \fIbig\fR\& and \fIlittle\fR\& atoms denote big or little endian encoding\&. .LP If for some reason, the data cannot be converted, either because of illegal Unicode/latin1 characters in the list, or because of invalid UTF encoding in any binaries, an error tuple is returned\&. The error tuple contains the tag \fIerror\fR\&, a list representing the characters that could be converted before the error occurred and a representation of the characters including and after the offending integer/bytes\&. The last part is mostly for debugging as it still constitutes a possibly deep and/or mixed list, not necessarily of the same depth as the original data\&. The error occurs when traversing the list and whatever is left to decode is simply returned as is\&. .LP However, if the input \fIData\fR\& is a pure binary, the third part of the error tuple is guaranteed to be a binary as well\&. .LP Errors occur for the following reasons: .RS 2 .TP 2 * Integers out of range - If \fIInEncoding\fR\& is \fIlatin1\fR\&, an error occurs whenever an integer greater than 255 is found in the lists\&. If \fIInEncoding\fR\& is of a Unicode type, an error occurs whenever an integer .RS 2 .TP 2 * greater than \fI16#10FFFF\fR\& (the maximum Unicode character), .LP .TP 2 * in the range \fI16#D800\fR\& to \fI16#DFFF\fR\& (invalid range reserved for UTF-16 surrogate pairs) .LP .RE is found\&. .LP .TP 2 * UTF encoding incorrect - If \fIInEncoding\fR\& is one of the UTF types, the bytes in any binaries have to be valid in that encoding\&. Errors can occur for various reasons, including "pure" decoding errors (like the upper bits of the bytes being wrong), the bytes are decoded to a too large number, the bytes are decoded to a code-point in the invalid Unicode range, or encoding is "overlong", meaning that a number should have been encoded in fewer bytes\&. The case of a truncated UTF is handled specially, see the paragraph about incomplete binaries below\&. If \fIInEncoding\fR\& is \fIlatin1\fR\&, binaries are always valid as long as they contain whole bytes, as each byte falls into the valid ISO-latin-1 range\&. .LP .RE .LP A special type of error is when no actual invalid integers or bytes are found, but a trailing \fIbinary()\fR\& consists of too few bytes to decode the last character\&. This error might occur if bytes are read from a file in chunks or binaries in other ways are split on non UTF character boundaries\&. In this case an \fIincomplete\fR\& tuple is returned instead of the \fIerror\fR\& tuple\&. It consists of the same parts as the \fIerror\fR\& tuple, but the tag is \fIincomplete\fR\& instead of \fIerror\fR\& and the last element is always guaranteed to be a binary consisting of the first part of a (so far) valid UTF character\&. .LP If one UTF characters is split over two consecutive binaries in the \fIData\fR\&, the conversion succeeds\&. This means that a character can be decoded from a range of binaries as long as the whole range is given as input without errors occurring\&. Example: .LP .nf decode_data(Data) -> case unicode:characters_to_list(Data,unicode) of {incomplete,Encoded, Rest} -> More = get_some_more_data(), Encoded ++ decode_data([Rest, More]); {error,Encoded,Rest} -> handle_error(Encoded,Rest); List -> List end. .fi .LP Bit-strings that are not whole bytes are however not allowed, so a UTF character has to be split along 8-bit boundaries to ever be decoded\&. .LP If any parameters are of the wrong type, the list structure is invalid (a number as tail) or the binaries do not contain whole bytes (bit-strings), a \fIbadarg\fR\& exception is thrown\&. .RE .LP .nf .B characters_to_binary(Data) -> Result .br .fi .br .RS .LP Types: .RS 3 Data = \fBlatin1_chardata()\fR\& | \fBchardata()\fR\& | \fBexternal_chardata()\fR\& .br Result = binary() .br | {error, binary(), RestData} .br | {incomplete, binary(), binary()} .br RestData = \fBlatin1_chardata()\fR\& | \fBchardata()\fR\& | \fBexternal_chardata()\fR\& .br .RE .RE .RS .LP Same as \fIcharacters_to_binary(Data, unicode, unicode)\fR\&\&. .RE .LP .nf .B characters_to_binary(Data, InEncoding) -> Result .br .fi .br .RS .LP Types: .RS 3 Data = \fBlatin1_chardata()\fR\& | \fBchardata()\fR\& | \fBexternal_chardata()\fR\& .br InEncoding = \fBencoding()\fR\& .br Result = binary() .br | {error, binary(), RestData} .br | {incomplete, binary(), binary()} .br RestData = \fBlatin1_chardata()\fR\& | \fBchardata()\fR\& | \fBexternal_chardata()\fR\& .br .RE .RE .RS .LP Same as \fIcharacters_to_binary(Data, InEncoding, unicode)\fR\&\&. .RE .LP .nf .B characters_to_binary(Data, InEncoding, OutEncoding) -> Result .br .fi .br .RS .LP Types: .RS 3 Data = \fBlatin1_chardata()\fR\& | \fBchardata()\fR\& | \fBexternal_chardata()\fR\& .br InEncoding = OutEncoding = \fBencoding()\fR\& .br Result = binary() .br | {error, binary(), RestData} .br | {incomplete, binary(), binary()} .br RestData = \fBlatin1_chardata()\fR\& | \fBchardata()\fR\& | \fBexternal_chardata()\fR\& .br .RE .RE .RS .LP Behaves as \fB\fIcharacters_to_list/2\fR\&\fR\&, but produces an binary instead of a Unicode list\&. The \fIInEncoding\fR\& defines how input is to be interpreted if binaries are present in the \fIData\fR\&, while \fIOutEncoding\fR\& defines in what format output is to be generated\&. .LP The option \fIunicode\fR\& is an alias for \fIutf8\fR\&, as this is the preferred encoding for Unicode characters in binaries\&. \fIutf16\fR\& is an alias for \fI{utf16,big}\fR\& and \fIutf32\fR\& is an alias for \fI{utf32,big}\fR\&\&. The \fIbig\fR\& and \fIlittle\fR\& atoms denote big or little endian encoding\&. .LP Errors and exceptions occur as in \fB\fIcharacters_to_list/2\fR\&\fR\&, but the second element in the \fIerror\fR\& or \fIincomplete\fR\& tuple will be a \fIbinary()\fR\& and not a \fIlist()\fR\&\&. .RE .LP .nf .B encoding_to_bom(InEncoding) -> Bin .br .fi .br .RS .LP Types: .RS 3 Bin = binary() .br .RS 2 A \fIbinary()\fR\& such that \fIbyte_size(Bin) >= 4\fR\&\&. .RE InEncoding = \fBencoding()\fR\& .br .RE .RE .RS .LP Create a UTF byte order mark (BOM) as a binary from the supplied \fIInEncoding\fR\&\&. The BOM is, if supported at all, expected to be placed first in UTF encoded files or messages\&. .LP The function returns \fI<<>>\fR\& for the \fIlatin1\fR\& encoding as there is no BOM for ISO-latin-1\&. .LP It can be noted that the BOM for UTF-8 is seldom used, and it is really not a \fIbyte order\fR\& mark\&. There are obviously no byte order issues with UTF-8, so the BOM is only there to differentiate UTF-8 encoding from other UTF formats\&. .RE