Unicode strings You are encouraged to solve this task according to the task description, using any language you may know.
As the world gets smaller each day, internationalization becomes more and more important. For handling multiple languages, Unicode is your best friend.
It is a very capable and remarkable tool, but also quite complex compared to older single- and double-byte character encodings.
How well prepared is your programming language for Unicode?
Task
Discuss and demonstrate its unicode awareness and capabilities.
Some suggested topics:
- How easy is it to present Unicode strings in …
Unicode strings You are encouraged to solve this task according to the task description, using any language you may know.
As the world gets smaller each day, internationalization becomes more and more important. For handling multiple languages, Unicode is your best friend.
It is a very capable and remarkable tool, but also quite complex compared to older single- and double-byte character encodings.
How well prepared is your programming language for Unicode?
Task
Discuss and demonstrate its unicode awareness and capabilities.
Some suggested topics:
- How easy is it to present Unicode strings in source code?
- Can Unicode literals be written directly, or be part of identifiers/keywords/etc?
- How well can the language communicate with the rest of the world?
- Is it good at input/output with Unicode?
- Is it convenient to manipulate Unicode strings in the language?
- How broad/deep does the language support Unicode?
- What encodings (e.g. UTF-8, UTF-16, etc) can be used?
- Does it support normalization?
Note
This task is a bit unusual in that it encourages general discussion rather than clever coding.
See also
11l
11l source code is specified to be UTF-8 encoded.
All strings in 11l are UTF-16 encoded.
80386 Assembly
-
How well prepared is the programming language for Unicode? - Prepared, in terms of handling: Assembly language can do anything the computer can do. However, it has no Unicode facilities as part of the language.
-
How easy is it to present Unicode strings in source code? - Easy, they are in hexadecimal.
-
Can Unicode literals be written directly - Depends on the compiler. MASM does not allow this. All data in assembly language is created from a series of bytes. Literal characters are not part of the language. They are number crunched down into a byte sequence by the compiler. If the compiler can read Unicode, then you are onto a winner.
-
or be part of identifiers/keywords/etc? - Depends on compiler. Intel notation does not use Unicode identifiers or mnemonics. Assembly language converts to numeric machine code, so everything is represented as mnemonics. You can use your own mnemonics, but you need to be able to compile them. One way to do this is to use a wrapper (which you would create) that converts your Unicode mnemonic notation to the notation that the compiler is expecting.
-
How well can the language communicate with the rest of the world? - Difficult. This is a low level language, so all communication can be done, but you have to set up data structures, and produce many lines of code for just basic tasks.
-
Is it good at input/output with Unicode? - Yes and No. The Unicode bit is easy, but for input/output, we have to set up data structures and produce many lines of code, or link to code libraries.
-
Is it convenient to manipulate Unicode strings in the language? - No. String manipulation requires lots of code. We can link to code libraries though, but it is not as straightforward, as it would be in a higher level language.
-
How broad/deep does the language support Unicode? We can do anything in assembly language, so support is 100%, but nothing is convenient with respect to Unicode. Strings are just a series of bytes, treatment of a series of bytes as a string is down to the compiler, if it provides string support as an extension. You need to be prepared to define data structures containing the values that you want.
-
What encodings (e.g. UTF-8, UTF-16, etc) can be used? All encodings are supported, but again, nothing is convenient with respect to encodings, although hexadecimal notation is good to use in assembly language. Normalization is not supported unless you write lots of code.
8th
- In 8th all strings are UTF-8 encoded. You can simply enter any text you like in a string.
- For special characters one may use the "\u" escape, e.g. "\u05ad"
- All the string manipulation words are UTF-8 aware, so in general the user doesn’t have to be concerned about bytes vs characters
Ada
- As of Ada 2005, all source/identifiers/keywords/literals/etc can be in up to 32bit characters as long as the compiler is told what encoding you are using.
- Unicode input/output has been in ada for much longer, only unicode source/literals are recent additions to the standard.
- Manipulation is the same as any other strings, but operates from *_Wide_Text_* modules rather than *_Text_*
- Supports the entire set of characters from ISO/IEC 10646:2003
- Extensive support of Unicode (including normalization, collation, etc.) and text codecs are provided by Matreshka.
ALGOL 68
How well prepared is the programming language for Unicode? - ALGOL 68 is character set agnostic and the standard explicitly permits the use of a universal character set. The standard includes routines like "make conv" to permit the opening of files and devices using alternate characters sets and converting character sets on the fly.
How easy is it to present Unicode strings in source code? - Easy.
Can Unicode literals be written directly - No, a REPR operator must be used to encode the string in UTF8.
Can Unicode literals be part of identifiers/keywords/etc? - Yes... ALGOL 68 is character set agnostic and the standard explicitly permits the use of a universal character set. Implementation for English, German, Polish and Cyrillic have been created. However ALGOL 68G supports only "Worthy" Character sets.
How well can the language communicate with the rest of the world? - Acceptable.
Is it good at input/output with Unicode? - No, although the "make conv" routine is in the standard it is rarely fully implemented.
Is it convenient to manipulate Unicode strings in the language? - Yes
How broad/deep does the language support Unicode? What encodings (e.g. UTF-8, UTF-16, etc) can be used? The attached set of utility routine is currently only for UTF8. Currently the Unicode routines like is_digit, is_letter, is_lower_case etc are not implemented.
Works with: ALGOL 68 version Revision 1 - no extensions to language used.
#!/usr/local/bin/a68g --script #
# -*- coding: utf-8 -*- #
# UNICHAR/UNICODE must be printed using REPR to convert to UTF8 #
MODE UNICHAR = STRUCT(BITS #31# bits); # assuming bits width >=31 #
MODE UNICODE = FLEX[0]UNICHAR;
OP INITUNICHAR = (BITS bits)UNICHAR: (UNICHAR out; bits OF out := #ABS# bits; out);
OP INITUNICHAR = (CHAR char)UNICHAR: (UNICHAR out; bits OF out := BIN ABS char; out);
OP INITBITS = (UNICHAR unichar)BITS: #BIN# bits OF unichar;
PROC raise value error = ([]UNION(FORMAT,BITS,STRING)argv )VOID: (
putf(stand error, argv); stop
);
MODE YIELDCHAR = PROC(CHAR)VOID; MODE GENCHAR = PROC(YIELDCHAR)VOID;
MODE YIELDUNICHAR = PROC(UNICHAR)VOID; MODE GENUNICHAR = PROC(YIELDUNICHAR)VOID;
PRIO DOCONV = 1;
# Convert a stream of UNICHAR into a stream of UTFCHAR #
OP DOCONV = (GENUNICHAR gen unichar, YIELDCHAR yield)VOID:(
BITS non ascii = NOT 2r1111111;
# FOR UNICHAR unichar IN # gen unichar( # ) DO ( #
## (UNICHAR unichar)VOID: (
BITS bits := INITBITS unichar;
IF (bits AND non ascii) = 2r0 THEN # ascii #
yield(REPR ABS bits)
ELSE
FLEX[6]CHAR buf := "?"*6; # initialise work around #
INT bytes := 0;
BITS byte lead bits = 2r10000000;
FOR ofs FROM UPB buf BY -1 WHILE
bytes +:= 1;
buf[ofs]:= REPR ABS (byte lead bits OR bits AND 2r111111);
bits := bits SHR 6;
# WHILE # bits NE 2r0 DO
SKIP
OD;
BITS first byte lead bits = BIN (ABS(2r1 SHL bytes)-2) SHL (UPB buf - bytes + 1);
buf := buf[UPB buf-bytes+1:];
buf[1] := REPR ABS(BIN ABS buf[1] OR first byte lead bits);
FOR i TO UPB buf DO yield(buf[i]) OD
FI
# OD # ))
);
# Convert a STRING into a stream of UNICHAR #
OP DOCONV = (STRING string, YIELDUNICHAR yield)VOID: (
PROC gen char = (YIELDCHAR yield)VOID:
FOR i FROM LWB string TO UPB string DO yield(string[i]) OD;
gen char DOCONV yield
);
CO Prosser/Thompson UTF8 encoding scheme
Bits Last code point Byte 1 Byte 2 Byte 3 Byte 4 Byte 5 Byte 6
7 U+007F 0xxxxxxx
11 U+07FF 110xxxxx 10xxxxxx
16 U+FFFF 1110xxxx 10xxxxxx 10xxxxxx
21 U+1FFFFF 11110xxx 10xxxxxx 10xxxxxx 10xxxxxx
26 U+3FFFFFF 111110xx 10xxxxxx 10xxxxxx 10xxxxxx 10xxxxxx
31 U+7FFFFFFF 1111110x 10xxxxxx 10xxxxxx 10xxxxxx 10xxxxxx 10xxxxxx
END CO
# Quickly calculate the length of the UTF8 encoded string #
PROC upb utf8 = (STRING utf8 string)INT:(
INT bytes to go := 0;
INT upb := 0;
FOR i FROM LWB utf8 string TO UPB utf8 string DO
CHAR byte := utf8 string[i];
IF bytes to go = 0 THEN # start new utf char #
bytes to go :=
IF ABS byte <= ABS 2r01111111 THEN 1 # 7 bits #
ELIF ABS byte <= ABS 2r11011111 THEN 2 # 11 bits #
ELIF ABS byte <= ABS 2r11101111 THEN 3 # 16 bits #
ELIF ABS byte <= ABS 2r11110111 THEN 4 # 21 bits #
ELIF ABS byte <= ABS 2r11111011 THEN 5 # 26 bits #
ELIF ABS byte <= ABS 2r11111101 THEN 6 # 31 bits #
ELSE raise value error(("Invalid UTF-8 bytes", BIN ABS byte)); ~ FI
FI;
bytes to go -:= 1; # skip over trailing bytes #
IF bytes to go = 0 THEN upb +:= 1 FI
OD;
upb
);
# Convert a stream of CHAR into a stream of UNICHAR #
OP DOCONV = (GENCHAR gen char, YIELDUNICHAR yield)VOID: (
INT bytes to go := 0;
INT lshift;
BITS mask, out;
# FOR CHAR byte IN # gen char( # ) DO ( #
## (CHAR byte)VOID: (
INT bits := ABS byte;
IF bytes to go = 0 THEN # start new unichar #
bytes to go :=
IF bits <= ABS 2r01111111 THEN 1 # 7 bits #
ELIF bits <= ABS 2r11011111 THEN 2 # 11 bits #
ELIF bits <= ABS 2r11101111 THEN 3 # 16 bits #
ELIF bits <= ABS 2r11110111 THEN 4 # 21 bits #
ELIF bits <= ABS 2r11111011 THEN 5 # 26 bits #
ELIF bits <= ABS 2r11111101 THEN 6 # 31 bits #
ELSE raise value error(("Invalid UTF-8 bytes", BIN bits)); ~ FI;
IF bytes to go = 1 THEN
lshift := 7; mask := 2r1111111
ELSE
lshift := 7 - bytes to go; mask := BIN(ABS(2r1 SHL lshift)-1)
FI;
out := mask AND BIN bits;
lshift := 6; mask := 2r111111 # subsequently pic 6 bits at a time #
ELSE
out := (out SHL lshift) OR ( mask AND BIN bits)
FI;
bytes to go -:= 1;
IF bytes to go = 0 THEN yield(INITUNICHAR out) FI
# OD # ))
);
# Convert a string of UNICHAR into a stream of UTFCHAR #
OP DOCONV = (UNICODE unicode, YIELDCHAR yield)VOID:(
PROC gen unichar = (YIELDUNICHAR yield)VOID:
FOR i FROM LWB unicode TO UPB unicode DO yield(unicode[i]) OD;
gen unichar DOCONV yield
);
# Some convenience/shorthand U operators #
# Convert a BITS into a UNICODE char #
OP U = (BITS bits)UNICHAR:
INITUNICHAR bits;
# Convert a []BITS into a UNICODE char #
OP U = ([]BITS array bits)[]UNICHAR:(
[LWB array bits:UPB array bits]UNICHAR out;
FOR i FROM LWB array bits TO UPB array bits DO bits OF out[i]:=array bits[i] OD;
out
);
# Convert a CHAR into a UNICODE char #
OP U = (CHAR char)UNICHAR:
INITUNICHAR char;
# Convert a STRING into a UNICODE string #
OP U = (STRING utf8 string)UNICODE: (
FLEX[upb utf8(utf8 string)]UNICHAR out;
INT i := 0;
# FOR UNICHAR char IN # utf8 string DOCONV (
## (UNICHAR char)VOID:
out[i+:=1] := char
# OD #);
out
);
# Convert a UNICODE string into a UTF8 STRING #
OP REPR = (UNICODE string)STRING: (
STRING out;
# FOR CHAR char IN # string DOCONV (
## (CHAR char)VOID: (
out +:= char
# OD #));
out
);
# define the most useful OPerators on UNICODE CHARacter arrays #
# Note: LWB, UPB and slicing works as per normal #
OP + = (UNICODE a,b)UNICODE: (
[UPB a + UPB b]UNICHAR out;
out[:UPB a]:= a; out[UPB a+1:]:= b;
out
);
OP + = (UNICODE a, UNICHAR b)UNICODE: a+UNICODE(b);
OP + = (UNICHAR a, UNICODE b)UNICODE: UNICODE(a)+b;
OP + = (UNICHAR a,b)UNICODE: UNICODE(a)+b;
# Suffix a character to the end of a UNICODE string #
OP +:= = (REF UNICODE a, UNICODE b)VOID: a := a + b;
OP +:= = (REF UNICODE a, UNICHAR b)VOID: a := a + b;
# Prefix a character to the beginning of a UNICODE string #
OP +=: = (UNICODE b, REF UNICODE a)VOID: a := b + a;
OP +=: = (UNICHAR b, REF UNICODE a)VOID: a := b + a;
OP * = (UNICODE a, INT n)UNICODE: (
UNICODE out := a;
FOR i FROM 2 TO n DO out +:= a OD;
out
);
OP * = (INT n, UNICODE a)UNICODE: a * n;
OP * = (UNICHAR a, INT n)UNICODE: UNICODE(a)*n;
OP * = (INT n, UNICHAR a)UNICODE: n*UNICODE(a);
OP *:= = (REF UNICODE a, INT b)VOID: a := a * b;
# Wirthy Operators #
OP LT = (UNICHAR a,b)BOOL: ABS bits OF a LT ABS bits OF b,
LE = (UNICHAR a,b)BOOL: ABS bits OF a LE ABS bits OF b,
EQ = (UNICHAR a,b)BOOL: ABS bits OF a EQ ABS bits OF b,
NE = (UNICHAR a,b)BOOL: ABS bits OF a NE ABS bits OF b,
GE = (UNICHAR a,b)BOOL: ABS bits OF a GE ABS bits OF b,
GT = (UNICHAR a,b)BOOL: ABS bits OF a GT ABS bits OF b;
# ASCII OPerators #
OP < = (UNICHAR a,b)BOOL: a LT b,
<= = (UNICHAR a,b)BOOL: a LE b,
= = (UNICHAR a,b)BOOL: a EQ b,
/= = (UNICHAR a,b)BOOL: a NE b,
>= = (UNICHAR a,b)BOOL: a GE b,
> = (UNICHAR a,b)BOOL: a GT b;
# Non ASCII OPerators
OP ≤ = (UNICHAR a,b)BOOL: a LE b,
≠ = (UNICHAR a,b)BOOL: a NE b,
≥ = (UNICHAR a,b)BOOL: a GE b;
#
# Compare two UNICODE strings for equality #
PROC unicode cmp = (UNICODE str a,str b)INT: (
IF LWB str a > LWB str b THEN exit lt ELIF LWB str a < LWB str b THEN exit gt FI;
INT min upb = UPB(UPB str a < UPB str b | str a | str b );
FOR i FROM LWB str a TO min upb DO
UNICHAR a := str a[i], UNICHAR b := str b[i];
IF a < b THEN exit lt ELIF a > b THEN exit gt FI
OD;
IF UPB str a > UPB str b THEN exit gt ELIF UPB str a < UPB str b THEN exit lt FI;
exit eq: 0 EXIT
exit lt: -1 EXIT
exit gt: 1
);
OP LT = (UNICODE a,b)BOOL: unicode cmp(a,b)< 0,
LE = (UNICODE a,b)BOOL: unicode cmp(a,b)<=0,
EQ = (UNICODE a,b)BOOL: unicode cmp(a,b) =0,
NE = (UNICODE a,b)BOOL: unicode cmp(a,b)/=0,
GE = (UNICODE a,b)BOOL: unicode cmp(a,b)>=0,
GT = (UNICODE a,b)BOOL: unicode cmp(a,b)> 0;
# ASCII OPerators #
OP < = (UNICODE a,b)BOOL: a LT b,
<= = (UNICODE a,b)BOOL: a LE b,
= = (UNICODE a,b)BOOL: a EQ b,
/= = (UNICODE a,b)BOOL: a NE b,
>= = (UNICODE a,b)BOOL: a GE b,
> = (UNICODE a,b)BOOL: a GT b;
# Non ASCII OPerators
OP ≤ = (UNICODE a,b)BOOL: a LE b,
≠ = (UNICODE a,b)BOOL: a NE b,
≥ = (UNICODE a,b)BOOL: a GE b;
#
COMMENT - Todo: for all UNICODE and UNICHAR
Add NonASCII OPerators: ×, ×:=,
Add ASCII Operators: &, &:=, &=:
Add Wirthy OPerators: PLUSTO, PLUSAB, TIMESAB for UNICODE/UNICHAR,
Add UNICODE against UNICHAR comparison OPerators,
Add char_in_string and string_in_string PROCedures,
Add standard Unicode functions:
to_upper_case, to_lower_case, unicode_block, char_count,
get_directionality, get_numeric_value, get_type, is_defined,
is_digit, is_identifier_ignorable, is_iso_control,
is_letter, is_letter_or_digit, is_lower_case, is_mirrored,
is_space_char, is_supplementary_code_point, is_title_case,
is_unicode_identifier_part, is_unicode_identifier_start,
is_upper_case, is_valid_code_point, is_whitespace
END COMMENT
test:(
UNICHAR aircraft := U16r 2708;
printf(($"aircraft: "$, $"16r"16rdddd$, UNICODE(aircraft), $g$, " => ", REPR UNICODE(aircraft), $l$));
UNICODE chinese forty two = U16r 56db + U16r 5341 + U16r 4e8c;
printf(($"chinese forty two: "$, $g$, REPR chinese forty two, ", length string = ", UPB chinese forty two, $l$));
UNICODE poker = U "A123456789♥♦♣♠JQK";
printf(($"poker: "$, $g$, REPR poker, ", length string = ", UPB poker, $l$));
UNICODE selectric := U"×÷≤≥≠¬∨∧⏨→↓↑□⌊⌈⎩⎧○⊥¢";
printf(($"selectric: "$, $g$, REPR selectric, $l$));
printf(($"selectric*4: "$, $g$, REPR(selectric*4), $l$));
print((
"1 < 2 is ", U"1" < U"2", ", ",
"111 < 11 is ",U"111" < U"11", ", ",
"111 < 12 is ",U"111" < U"12", ", ",
"♥ < ♦ is ", U"♥" < U"♦", ", ",
"♥Q < ♥K is ",U"♥Q" < U"♥K", " & ",
"♥J < ♥K is ",U"♥J" < U"♥K", new line
))
)
Output:
aircraft: 16r2708 => ✈
chinese forty two: 四十二, length string = +3
poker: A123456789♥♦♣♠JQK, length string = +17
selectric: ×÷≤≥≠¬∨∧⏨→↓↑□⌊⌈⎩⎧○⊥¢
selectric*4: ×÷≤≥≠¬∨∧⏨→↓↑□⌊⌈⎩⎧○⊥¢×÷≤≥≠¬∨∧⏨→↓↑□⌊⌈⎩⎧○⊥¢×÷≤≥≠¬∨∧⏨→↓↑□⌊⌈⎩⎧○⊥¢×÷≤≥≠¬∨∧⏨→↓↑□⌊⌈⎩⎧○⊥¢
1 < 2 is T, 111 < 11 is F, 111 < 12 is T, ♥ < ♦ is T, ♥Q < ♥K is F & ♥J < ♥K is T
Arturo
text: "你好"
print ["text:" text]
print ["length:" size text]
print ["contains string '好'?:" contains? text "好"]
print ["contains character '平'?:" contains? text '平']
Output:
text: 你好
length: 2
contains string '好'?: true
contains character '平'?: false
text as ascii: Ni Hao
AutoHotkey
How easy is it to present Unicode strings in source code? - Simple, as long as the script is saved as Unicode and you’re using a Unicode build
Can Unicode literals be written directly, or be part of identifiers/keywords/etc? - Yes, see above
How well can the language communicate with the rest of the world? Is it good at input/output with Unicode? - it can create GUI’s and send Unicode characters.
Is it convenient to manipulate Unicode strings in the language? - Yes: they act like any other string, apart from lowlevel functions such as NumPut which deal with bytes.
How broad/deep does the language support Unicode? What encodings (e.g. UTF-8, UTF-16, etc) can be used? UTF-8 is most often used. StrPut/StrGet and FileRead/FileAppend allow unicode in AutoHotkey_L (the current build)
AWK
How well prepared is the programming language for Unicode? - Not really prepared. AWK is a tool for manipulating ASCII input.
How easy is it to present Unicode strings in source code? - Easy. They can be represented in hexadecimal.
Can Unicode literals be written directly - No
or be part of identifiers/keywords/etc? - No
How well can the language communicate with the rest of the world? - The language is not good at communications, but can utilize external tools.
Is it good at input/output with Unicode? - No
Is it convenient to manipulate Unicode strings in the language? - No
How broad/deep does the language support Unicode? What encodings (e.g. UTF-8, UTF-16, etc) can be used? - There is no inbuilt support for Unicode, but all encodings can be represented through hexadecimal strings.
BASIC
BBC BASIC
- How easy is it to present Unicode strings in source code?
As of version 5.94a, the BBC BASIC for Windows source code editor supports Unicode text in string literals and comments (remarks). This includes bi-directional text and Arabic ligatures.
- Can Unicode literals be written directly?
If a suitable keyboard and/or Input Method Editor is available Unicode text may be entered directly into the editor.
- or be part of identifiers/keywords/etc?
Identifiers (variable names) and keywords cannot use Unicode characters.
- How well can the language communicate with the rest of the world? Is it good at input/output with Unicode?
Output of Unicode text to both the screen and the printer is supported, but must be enabled using a VDU 23,22 command since the default output mode is ANSI. The text printing direction can be set to right-to-left for languages such as Hebrew and Arabic. Run-time support for Arabic ligatures is not built-in, but is provided by means of the FNarabic() function. No specific support for Unicode input at run time is provided, although this is possible by means of Windows controls.
- Is it convenient to manipulate Unicode strings in the language?
The supported character encoding is UTF-8 which, being a byte stream, is compatible with most of the language’s string manipulation functions. However, the parameters in functions like LEFT$ and MID$ refer to byte counts rather than character counts.
Code example: (whether this listing displays correctly will depend on your browser)
VDU 23,22,640;512;8,16,16,128+8 : REM Select UTF-8 mode
*FONT Times New Roman, 20
PRINT "Arabic:"
arabic1$ = "هنا مثال يمكنك من الكتابة من اليمين"
arabic2$ = "الى اليسار باللغة العربية"
VDU 23,16,2;0;0;0;13 : REM Select right-to-left printing
PRINT FNarabic(arabic1$) ' FNarabic(arabic2$)
VDU 23,16,0;0;0;0;13 : REM Select left-to-right printing
PRINT '"Hebrew:"
hebrew$ = "זוהי הדגמה של כתיבת טקסט בעברית מימין לשמאל"
VDU 23,16,2;0;0;0;13 : REM Select right-to-left printing
PRINT hebrew$
VDU 23,16,0;0;0;0;13 : REM Select left-to-right printing
END
REM!Eject
DEF FNarabic(A$)
LOCAL A%, B%, L%, O%, P%, U%, B$
A$ += CHR$0
FOR A% = !^A$ TO !^A$+LENA$-1
IF ?A%<&80 OR ?A%>=&C0 THEN
L% = O% : O% = P% : P% = U%
U% = ((?A% AND &3F) << 6) + (A%?1 AND &3F)
IF ?A%<&80 U% = 0
CASE TRUE OF
WHEN U%=&60C OR U%=&61F: U% = 0
WHEN U%<&622:
WHEN U%<&626: U% = &01+2*(U%-&622)
WHEN U%<&628: U% = &09+4*(U%-&626)
WHEN U%<&62A: U% = &0F+4*(U%-&628)
WHEN U%<&62F: U% = &15+4*(U%-&62A)
WHEN U%<&633: U% = &29+2*(U%-&62F)
WHEN U%<&63B: U% = &31+4*(U%-&633)
WHEN U%<&641:
WHEN U%<&648: U% = &51+4*(U%-&641)
WHEN U%<&64B: U% = &6D+2*(U%-&648)
ENDCASE
IF P% IF P%<&80 THEN
B% = P%
IF O%=&5D IF P%<&5 B% += &74
IF O%=&5D IF P%=&7 B% += &72
IF O%=&5D IF P%=&D B% += &6E
IF B%>P% B$=LEFT$(B$,LENB$-3) : O% = L%
IF U% IF P%>7 IF P%<>&D IF P%<>&13 IF P%<>&29 IF P%<>&2B IF P%<>&2D IF P%<>&2F IF P%<>&6D IF P%<>&6F B% += 2
IF O% IF O%>7 IF O%<>&D IF O%<>&13 IF O%<>&29 IF O%<>&2B IF O%<>&2D IF O%<>&2F IF O%<>&6D IF O%<>&6F B% += 1
B$ = LEFT$(LEFT$(B$))+CHR$&EF+CHR$(&BA+(B%>>6))+CHR$(&80+(B%AND&3F))
ENDIF
ENDIF
B$ += CHR$?A%
NEXT
= LEFT$(B$)
Bracmat
- How easy is it to present Unicode strings in source code? Can Unicode literals be written directly, or be part of identifiers/keywords/etc?
The few keywords Bracmat knows are all ASCII. Identifiers and values can consist of all non-zero bytes, so UTF-8 encoded strings are allowed.
- How well can the language communicate with the rest of the world? Is it good at input/output with Unicode?
Input and output of UTF-8 encoded data and source code is easy. No special measures have to be taken. On reading HTML and JSON, hexcodes and HTML entities are converted to their UTF-8 equivalents.
- Is it convenient to manipulate Unicode strings in the language?
Yes, apart from counting characters, as UTF-8 has varying width. When converting a string to lower or uppercase, UTF-8 is assumed. If a string is not valid UTF-8, ISO-8859-1 (Latin-1) is assumed.
- How broad/deep does the language support Unicode? What encodings (e.g. UTF-8, UTF-16, etc) can be used?
Most UTF-16 and UTF-32 strings contain null bytes in non-final positions and can therefore not be handled easily.
C
C is not the most unicode friendly language, to put it mildly. Generally using unicode in C requires dealing with locales, manage data types carefully, and checking various aspects of your compiler. Directly embedding unicode strings in your C source might be a bad idea, too; it’s safer to use their hex values. Here’s a short example of doing the simplest string handling: print it.
#include <stdio.h>
#include <stdlib.h>
#include <locale.h>
/* wchar_t is the standard type for wide chars; what it is internally
* depends on the compiler.
*/
wchar_t poker[] = L"♥♦♣♠";
wchar_t four_two[] = L"\x56db\x5341\x4e8c";
int main() {
/* Set the locale to alert C's multibyte output routines */
if (!setlocale(LC_CTYPE, "")) {
fprintf(stderr, "Locale failure, check your env vars\n");
return 1;
}
#ifdef __STDC_ISO_10646__
/* C99 compilers should understand these */
printf("%lc\n", 0x2708); /* ✈ */
printf("%ls\n", poker); /* ♥♦♣♠ */
printf("%ls\n", four_two); /* 四十二 */
#else
/* oh well */
printf("airplane\n");
printf("club diamond club spade\n");
printf("for ty two\n");
#endif
return 0;
}
C#
In C#, the native string representation is actually determined by the Common Language Runtime. In CLR, the string data type is a sequence of char, and the char data type represents a UTF-16 code unit. The native string representation is essentially UTF-16, except that strings can contain sequences of UTF-16 code units that aren’t valid in UTF-16 if the string contains incorrectly-used high and low surrogates.
C# string literals support the \u escape sequence for 4-digit hexadecimal Unicode code points, \U for 6-digit code points, and UTF-encoded source code is also supported so that "Unicode strings" can be included in the source code as-is.
C# benefits from the extensive support for Unicode in the .NET Base Class Library, including
- Various UTF encodings
- String normalization
- Unicode character database subset
- Breaking strings into text elements
Common Lisp
Default unicode strings for most implementations. Unicode chars can be used on variable and function names. Tested in SBCL 1.2.7 and ECL 13.5.1
(defvar ♥♦♣♠ "♥♦♣♠")
(defun ✈ () "a plane unicode function")
D
import std.stdio;
import std.uni; // standard package for normalization, composition/decomposition, etc..
import std.utf; // standard package for decoding/encoding, etc...
void main() {
// normal identifiers are allowed
int a;
// unicode characters are allowed for identifers
int δ;
char c; // 1 to 4 byte unicode character
wchar w; // 2 or 4 byte unicode character
dchar d; // 4 byte unicode character
writeln("some text"); // strings by default are UTF8
writeln("some text"c); // optional suffix for UTF8
writeln("こんにちは"c); // unicode charcters are just fine (stored in the string type)
writeln("Здравствуйте"w); // also avaiable are UTF16 string (stored in the wstring type)
writeln("שלום"d); // and UTF32 strings (stored in the dstring type)
// escape sequences like what is defined in C are also allowed inside of strings and characters.
}
DuckDB
DuckDB quacks Unicode, with a strong UTF-8 accent.
For example, DuckDB’s STRING type represents Unicode strings, and the function read_text() can read any UTF-8-encoded file.
Support for other encodings is very limited: as of November 2024, read_csv() is also able to read files with UTF-16 and Latin-1 encodings.
How broad/deep is the language support for Unicode?
There is comprehensive support for Unicode, including conversions to and from DuckDB’s various data types, which include bit strings, blobs and JSON.
For example:
D select '\x00'::BLOB as blob, chr(0), blob = chr(0);
┌──────┬─────────┬─────────────────┐
│ blob │ chr(0) │ (blob = chr(0)) │
│ blob │ varchar │ boolean │
├──────┼─────────┼─────────────────┤
│ \x00 │ \0 │ true │
└──────┴─────────┴─────────────────┘
D select chr(0) as nul, to_json(nul) as j, j ->> '$';
┌─────────┬──────────┬─────────────┐
│ nul │ j │ (j ->> '$') │
│ varchar │ json │ varchar │
├─────────┼──────────┼─────────────┤
│ \0 │ "\u0000" │ \0 │
└─────────┴──────────┴─────────────┘
** Regular expressions, collation and normalization **
DuckDB uses the RE2 Regular Expression engine, and therefore supports Unicode character classes and named Unicode properties.
The function nfc_normalize(s) converts the string, s, to its Unicode NFC normalized form. The NFC collation performs NFC-normalized comparisons.
String literals and identifiers
There are various considerations with respect to the string literals and identifiers that can appear within a DuckDB program. In brief:
- string literals must be specified either by enclosing a suitable Unicode string
in single-quotes, or by using the "double-dollar" convention, which allows matched opening and closing delimiters of the form ‘${identifier}$’ where `{identifer}` is a $-free identifer;
- double-quotes are used to delimit so-called "quoted identifiers";
- to include a single-quote within a string literal, it must be duplicated;
- to include a double-quote within a quoted identifier, it must be duplicated.
See Idiomatically_determine_all_the_characters_that_can_be_used_for_symbols for a description of DuckDB’s unquoted identifiers.
Certain control characters (ASCII 8, 9, 10, 12 and 13) can be included within a string literal if the prefix E (or e) is used:
D select e'a\nb' as escaped, length(escaped), 'a\nb' as unescaped, length(unescaped);
┌─────────┬─────────────────┬───────────┬───────────────────┐
│ escaped │ length(escaped) │ unescaped │ length(unescaped) │
│ varchar │ int64 │ varchar │ int64 │
├─────────┼─────────────────┼───────────┼───────────────────┤
│ a\nb │ 3 │ a\nb │ 4 │
└─────────┴─────────────────┴───────────┴───────────────────┘
Notice that in the output shown above, the representations of chr(10) and the two-character string ‘\n’ are the same. This is an artifact of the "duckbox" mode, as can be seen for example by using `.mode line`.
DWScript
Source code is expected in Unicode (typically UTF-8 or UTF-16), characters above 127 (non-ASCII) are not part of the language, and are accepted literally as string characters or as identifier characters.
Characters in a string can also by specified explicitly by specifying the Unicode codepoint with a # followed by a decimal number, or a #$ followed by an hexadecimal codepoint (if the codepoint is outside the BMP, it’ll result in two UTF-16 words). Contrarily to some other Pascal variants (like Delphi), explicit character codes are always and consistently interpreted as Unicode codepoints.
Strings are UTF-16.
EasyLang
In Easylang, all strings are UTF-8 strings. All string functions work with UTF-8 characters.
s$ = "你好 😀"
print len s$
for c$ in strchars s$
print c$ & " " & strcode c$
.
print strpos s$ "好"
print strchar 128512
print s$
flag$ = "🇮🇹"
print flag$
print len flag$
for c$ in strchars flag$
print c$ & " " & strcode c$
.
Output:
4
你 20320
好 22909
32
😀 128512
2
😀
你好 😀
🇮🇹
2
🇮 127470
🇹 127481
Elena
ELENA supports both UTF8 and UTF16 strings, Unicode identifiers are also supported:
ELENA 6.x:
public Program()
{
var 四十二 := "♥♦♣♠"; // UTF8 string
var строка := "Привет"w; // UTF16 string
Console.writeLine(строка);
Console.writeLine(四十二);
}
Output:
Привет
♥♦♣♠
Elixir
Elixir has good Unicode support in Strings. Its String module is compliant with the Unicode Standard version 6.3.0. Internally, Strings are encoded in UTF-8. As source files are also typically Unicode encoded, String literals can be either written directly or via escape sequences. However, non-ASCII Unicode identifiers (variables, functions, ...) are not allowed.
Erlang
The simplified explanation is that Erlang allows Unicode in comments/data/file names/etc, but not in function or variable names.
FreeBASIC
FreeBASIC has decent support for Unicode, although not as complete as some other languages.
- How easy is it to present Unicode strings in source code?
FreeBASIC can handle ASCII files with Unicode escape sequences (\u), and can also parse source (.bas) or header (.bi) files into UTF-8, UTF-16LE, UTF-16BE. , UTF-32LE and UTF-32BE. These files can be freely mixed with other source or header files in the same project.
- Can Unicode literals be written directly, or be part of identifiers/keywords/etc?
String literals can be written in the original non-Latin alphabet, you just need to use a text editor that supports some of the mentioned Unicode formats.
- How well can the language communicate with the rest of the world?
FreeBASIC can communicate with other programs and systems that use Unicode. However, manipulating Unicode strings can be more complicated because many string functions become more complex.
- Is it good at input/output with Unicode?
The Open function supports UTF-8, UTF-16LE and UTF-32LE files with the encoding specifier.
The Input# and Line Input# functions as well as Print# Write# can be used normally, and any conversion between Unicode and ASCII is done automatically if necessary. The Print function also supports Unicode output.
- Is it convenient to manipulate Unicode strings in the language?
Although FreeBASIC supports wide characters in a string, it does not support dynamic strings. However, there are some libraries included with FreeBASIC to decode UTF-8 to wstring.
- What encodings (e.g. UTF-8, UTF-16, etc) can be used?
Unicode support in FreeBASIC is quite extensive, but not as deep as in other programming languages. It can handle most basic Unicode tasks, but more advanced tasks may require additional libraries.
- What encodings (e.g. UTF-8, UTF-16, etc) can be used?
FreeBASIC supports several encodings, including UTF-8, UTF-16LE, UTF-16BE, UTF-32LE, and UTF-32BE.
- Does it support normalization?
FreeBASIC does not have built-in support for Unicode normalization. However, it is possible to use external libraries to perform normalization.
For example,
' Define a Unicode string
Dim unicodeString As String
unicodeString = "こんにちは, 世界! 🌍"
' Print the Unicode string
Print unicodeString
' Wait for the user to press a key before closing the console
Sleep
FutureBasic
Like many macOS languages, Unicode support is native to FutureBasic. UTF8 and UTF16 strings, and Unicode identifiers are all supported.
CFString: This is the fundamental data type for representing text in FutureBasic (FB). It’s an immutable object, meaning its contents can’t be changed after creation. All CFStrings inherently store their data as Unicode characters. This means you don’t have to explicitly specify an encoding for basic string operations. A CFString variable is dimensioned and assigned a value like this:
CFStringRef testString = @"This is a test of a toolbox emoji — 🧰 — in a FutureBasic CFString."
CFMutableString: This is the mutable counterpart to CFString. It allows you to modify the string’s content, such as appending text, inserting characters, or deleting parts of the string. Like CFString, it works with Unicode by default.
Other FB Unicode string data types — CFAttributedString and CFMutableAttributedString — support a wide variety of in-line attributions allowing for exquisite text formatting.
FB also has massive variety of functions designed to interact with the string data types.
The sample code below shows how Unicode emojis can be used in FB strings using either their Unicode hex value, or by simply adding their representative icon to the string.
_window = 1
begin enum output 1
_scrollView
_textView
end enum
void local fn BuildWindow
CGRect r = fn CGRectMake( 0, 0, 550, 400 )
window _window, @"Unicode Characters in FutureBasic", r
scrollview _scrollView, r
ViewSetAutoresizingMask( _scrollView, NSViewWidthSizable + NSViewHeightSizable )
textview _textView,, _scrollView
CFMutableStringRef testStr = fn MutableStringWithCapacity( 0 )
MutableStringAppendString( testStr, @"\U0001F34C" )
MutableStringAppendString( testStr, @"\U0001F444" )
MutableStringAppendString( testStr, @"\U0001F41E" )
MutableStringAppendString( testStr, @"\U0001F37A" )
MutableStringAppendString( testStr, @"\U0001F41D" )
MutableStringAppendString( testStr, @"\U0001F48A" )
MutableStringAppendString( testStr, @"☎️" )
MutableStringAppendString( testStr, @"\U0001F4A1" )
MutableStringAppendString( testStr, @"🍎" )
MutableStringAppendString( testStr, @"⚽️" )
MutableStringAppendString( testStr, @"🚁" )
MutableStringAppendString( testStr, @"⏰" )
MutableStringAppendString( testStr, @"\u2699\uFE0F" )
MutableStringAppendString( testStr, @"\U0001F69C" )
MutableStringAppendString( testStr, @"\U0001F333" )
MutableStringAppendString( testStr, @"\U0001F413" )
MutableStringAppendString( testStr, @"\U0001F6A3" )
MutableStringAppendString( testStr, @"\U0001F6C1" )
TextSetString( _textView, testStr )
CFMutableAttributedStringRef storage = (CFMutableAttributedStringRef)fn TextViewTextStorage( _textView )
MutableAttributedStringSetFontWithName( storage, @"Menlo", 110 )
WindowMakeFirstResponder( _window, _textView )
end fn
editmenu 1
fn BuildWindow
HandleEvents
Output:
Go
Go source code is specified to be UTF-8 encoded. This directly allows any Unicode code point in character and string literals. Unicode is also allowed in identifiers like variables and field names, with some restrictions. The string data type represents a read-only sequence of bytes, conventionally but not necessarily representing UTF-8-encoded text. A number of built-in features interpret strings as UTF-8. For example,
var i int
var u rune
for i, u = range "voilà" {
fmt.Println(i, u)
}
Output:
0 118
1 111
2 105
3 108
4 224
224 being the Unicode code point for the à character. Note rune is predefined to be a type that can hold a Unicode code point.
In contrast,
w := "voilà"
for i := 0; i < len(w); i++ {
fmt.Println(i, w[i])
}
Output:
0 118
1 111
2 105
3 108
4 195
5 160
bytes 4 and 5 showing the UTF-8 encoding of à. The expression w[i] in this case has the type of byte rather than rune. A Go blog post covers this in more detail: Strings, bytes, runes and characters in Go.
The heavily used standard packages bytes and strings both have functions for working with strings both as UTF-8 and as encoding-unspecified bytes. The standard packages unicode, unicode/utf8, and unicode/utf16 have additional functions.
Normalization support is available in the sub-repository package golang.org/x/text/unicode/norm. It contains a number of string manipulation functions that work with the four normalization forms NFC, NFD, NFKC, and NFKD. The normalization form type in this package implements the io.Reader and io.WriteCloser interfaces to enable on-the-fly normalization during I/O. A Go blog post covers this in more detail: Text normalization in Go.
There is no built-in or automatic handling of byte order marks (which are at best unnecessary with UTF-8).
Haskell
Unicode is built-in in Haskell, so it can be used in strings and functions names.
J
Unicode characters can be represented directly in J strings:
'♥♦♣♠'
♥♦♣♠
By default, they are represented as utf-8:
#'♥♦♣♠'
12
The above string requires 12 literal elements to represent the four characters using utf-8.
However, they can be represented as utf-16 instead:
7 u:'♥♦♣♠'
♥♦♣♠
#7 u:'♥♦♣♠'
4
The above string requires 4 literal elements to represent the four characters using utf-16. (7 u: string produces a utf-16 result.)
These forms are not treated as equivalent:
'♥♦♣♠' -: 7 u:'♥♦♣♠'
0
The utf-8 string of literals is a different string of literals from the utf-16 string.
unless the character literals themselves are equivalent:
'abcd'-:7 u:'abcd'
1
Here, we were dealing with ascii characters, so the four literals needed to represent the characters using utf-8 matched the four literals needed to represent the characters using utf-16.
When this is likely to be an issue, you should enforce a single representation. For example:
'♥♦♣♠' -:&(7&u:) 7 u:'♥♦♣♠'
1
'♥♦♣♠' -:&(8&u:) 7 u:'♥♦♣♠'
1
Here, we see that even when comparing non-ascii characters, we can coerce both arguments to be utf-8 or utf-16 or utf-32 and the resulting literal strings would match. (8 u: string produces a utf-8 result.)
Output uses characters in whatever format they happen to be in. Character input assumes 8 bit characters but places no additional interpretation on them.
See also: http://www.jsoftware.com/help/dictionary/duco.htm
Non-ascii unicode characters are not legal tokens or names, within current versions J.
Java
How easy is it to present Unicode strings in source code?
Very easy. It is not specified what encoding the source code must be in, as long as it can be interpreted into a stream of UTF-16 characters. Most compilers probably take UTF-8.
In any case, even using only ASCII characters, any UTF-16 character can be embedded into the source code by using a Unicode escape \uxxxx (where xxxx is the hex code of the character), which is processed before any other steps by the compiler. This means that it is possible to write an entire program out of Unicode escapes. This also means that a Unicode escape could mess up the language syntax, if it happens to be the escape of a whitespace or quote character (please don’t do that).
Can Unicode literals be written directly, or be part of identifiers/keywords/etc?
UTF-16 characters can be written directly in character and string literals and comments (there is no difference between "directly" and using Unicode escapes, since they are processed at the first step). UTF-16 characters can be part of identifiers (either directly or through a Unicode escape).
How well can the language communicate with the rest of the world? Is it good at input/output with Unicode?
Yes
Is it convenient to manipulate Unicode strings in the language?
The String class in Java is basically a sequence of char elements, representing the string encoded in UTF-16. char is a 16-bit type, and thus one char does not necessarily correspond to one Unicode character, since supplementary characters can have code points greater than U+FFFF. However, if your string only consists of characters from the Basic Multilingual Plane (which is most of the time), then one char does correspond to one Unicode character.
Starting in J2SE 5 (1.5), Java has fairly convenient methods for dealing with true Unicode characters, even supplementary ones. Many methods that deal with characters have versions for both char and int. For example, String has the codePointAt method, analogous to the charAt method.
How broad/deep does the language support Unicode? What encodings (e.g. UTF-8, UTF-16, etc) can be used? Normalization?
jq
jq’s data types are JSON’s data types. In particular, jq strings are UTF-8 strings. See json.org for further details.
jq identifiers, however, are restricted to a subset of ASCII.
The jq command does have several options in support of flexibility when "communicating with the world":
--raw-input | -R :: each line of input is converted to a JSON string;
--ascii-output | -a :: every non-ASCII character that would otherwise
be sent to output is translated to an equivalent
ASCII escape sequence;
--raw-output | -r :: output strings as raw strings, e.g. "a\nb" is
output as:
a
b
Julia
Non-ASCII strings in Julia are UTF8-encoded by default, and Unicode identifiers are also supported:
julia> 四十二 = "voilà";
julia> println(四十二)
voilà
And you can also specify unicode characters by ordinal:
julia>println("\u2708")
✈
Kotlin
In the version of Kotlin targetting the JVM, Kotlin strings are mapped to Java strings and so everything that has already been said in the Java entry for this task applies equally to Kotlin.
I would only add that normalization of strings is supported in both languages via the java.text.Normalizer class.
Here’s a simple example of using both unicode identifiers and unicode strings in Kotlin:
// version 1.1.2
fun main(args: Array<String>) {
val åäö = "as⃝df̅ ♥♦♣♠ 頰"
println(åäö)
}
Output:
as⃝df̅ ♥♦♣♠ 頰
langur
Source code in langur is pure UTF-8 without a BOM and without surrogate codes.
Identifiers are ASCII only. Comments and string literals may use Unicode.
Indexing on a string indexes by code point. The index may be a single number, a range, or a list of such things.
Conversion between code point numbers, graphemes, and strings can be done with the cp2s(), s2cp(), and s2gc() functions. Conversion between UTF-8 byte lists and langur strings can be done with b2s() and s2b() functions.
The len() function returns the number of code points in a string.
Normalization can be handled with the functions nfc(), nfd(), nfkc(), and nfkd().
Using a for of loop over a string gives the code point indices, and using a for in loop over a string gives the code point numbers.
Interpolation modifiers allow limiting a string by code points or by graphemes.
See langurlang.org for more details.
Lasso
All string data in Lasso is processed as double-byte Unicode characters. Any input is assumed to be UTF-8 if not otherwise told. All output is UTF-8 unless specified to a different encoding. You can specify unicode characters by ordinal.
Variable names can not contain anything but ASCII.
local(unicode = '♥♦♣♠')
#unicode -> append('\u9830')
#unicode
'<br />'
#unicode -> get (2)
'<br />'
#unicode -> get (4) -> integer
Output:
♥♦♣♠頰
♦
9824
LFE
As with Erlang, LFE allows Unicode in comments/data/file names/etc, but not in function or variable names.
Here is example UFT-8 encoding:
> (set encoded (binary ("åäö ð" utf8)))
#B(195 165 195 164 195 182 32 195 176)
Display it in native Erlang format:
> (io:format "~tp~n" (list encoded))
<<"åäö ð"/utf8>>
Example UFT-8 decoding:
> (unicode:characters_to_list encoded 'utf8)
"åäö ð"
Lingo
In recent versions (since v11.5) of Lingo’s only implementation "Director" UTF-8 is the default encoding for both scripts and strings. Therefor Unicode string literals can be specified directly in the code,