Unicode: Difference between revisions

From NARS2000
Jump to navigationJump to search
(link to ⎕AV page)
 
(2 intermediate revisions by 2 users not shown)
Line 2: Line 2:
All character arrays and names (variable, function, and operator) are stored as one 16-bit word per character.  This fixed length encoding is called UCS-2 and is a subset of a more general encoding called [http://en.wikipedia.org/wiki/UTF-16 UTF-16].  The latter is a variable length encoding using one or two 16-bit words per character.
All character arrays and names (variable, function, and operator) are stored as one 16-bit word per character.  This fixed length encoding is called UCS-2 and is a subset of a more general encoding called [http://en.wikipedia.org/wiki/UTF-16 UTF-16].  The latter is a variable length encoding using one or two 16-bit words per character.


UCS-2 represents characters from U+0000 through U+FFFF; UTF-16 represents characters from U+0000 through U+10FFFF — both encodings exclude the surrogate pair range of U+D800 through U+DFFF (2,048 characters).
UCS-2 represents characters in the range U+0000 through U+FFFF; UTF-16 represents characters in the range U+0000 through U+10FFFF.  However, because of the way UTF-16 represents characters above U+FFFF, the range of code points for both encodings exclude U+D800 through U+DFFF (2,048 characters).


Thus, UCS-2 encodes 63,488 (=65,536 - 2,048) different characters, and UTF-16 encodes 1,112,064 (=1,114,112 - 2,048) different characters.  UTF-16 is needed mostly for Far Eastern languages.
Thus, UCS-2 encodes 63,488 (=65,536 - 2,048) different characters, and UTF-16 encodes 1,112,064 (=1,114,112 - 2,048) different characters.


==Alphabet for Names==
==Alphabet for Names==
The alphabet used for names consists of an initial character followed by one or more subsequent characters.
The alphabet used for names consists of an initial character followed by one or more subsequent characters.


* A leading character is one of <apll>a</apll> though <apll>z</apll>, <apll>A</apll> through <apll>Z</apll>, delta (<apll>∆</apll>), or delta underbar (<apll>⍙</apll>).
* An initial character is one of <apll>a</apll> though <apll>z</apll>, <apll>A</apll> through <apll>Z</apll>, delta (<apll>∆</apll>), or delta underbar (<apll>⍙</apll>).
* A subsequent character is a leading character, <apll>0</apll> through <apll>9</apll>, overbar (<apll>{overbar}</apll>), or underbar (<apll>_</apll>).
* A subsequent character is a leading character, <apll>0</apll> through <apll>9</apll>, overbar (<apll>{overbar}</apll>), or underbar (<apll>_</apll>).


One other set of characters, the underbarred alphabet (<apll>{A_}</apll> through <apll>{Z_}</apll>), may be pasted into a session or function editor window. There is no way to enter these characters directly from the keyboard.  Depending on a User Option setting, when these characters are pasted into a session or function editor window, they are treated as themselves or are mapped to the lowercase alphabet.  When used in a name, they are always equivalent to the corresponding lowercase letter, although they display as themselves.  Because of this latter translation, they may be used as either a leading or subsequent character in a name.  Thus the names <apll>{A_}l{P_}h{A_}</apll> and <apll>alpha</apll> display differently, but they both refer to the same object; a value assigned to one is reflected in the other.
One other set of characters, the underbarred alphabet (<apll>{A_}</apll> through <apll>{Z_}</apll>), may be pasted into a session or function editor window. There is no way to enter these characters directly from the keyboard.  Depending on a User Option setting, when these characters are pasted into a session or function editor window, they are treated as themselves or are mapped to the lowercase alphabet.  When used in a name, they are always equivalent to the corresponding lowercase letter, although they display as themselves.  Because of this latter translation, they may be used as either a leading or subsequent character in a name.  Thus the names <apll>{A_}l{P_}h{A_}</apll> and <apll>alpha</apll> display differently, but they both refer to the same object; a value assigned to one is reflected in the other.
'''See also:''' Quad AV = niladic system function '''⎕AV''' - '''[[System_Function_AV|Atomic Vector]]''' page.

Latest revision as of 21:15, 23 January 2015

Character Array and Name Storage

All character arrays and names (variable, function, and operator) are stored as one 16-bit word per character. This fixed length encoding is called UCS-2 and is a subset of a more general encoding called UTF-16. The latter is a variable length encoding using one or two 16-bit words per character.

UCS-2 represents characters in the range U+0000 through U+FFFF; UTF-16 represents characters in the range U+0000 through U+10FFFF. However, because of the way UTF-16 represents characters above U+FFFF, the range of code points for both encodings exclude U+D800 through U+DFFF (2,048 characters).

Thus, UCS-2 encodes 63,488 (=65,536 - 2,048) different characters, and UTF-16 encodes 1,112,064 (=1,114,112 - 2,048) different characters.

Alphabet for Names

The alphabet used for names consists of an initial character followed by one or more subsequent characters.

  • An initial character is one of a though z, A through Z, delta (), or delta underbar ().
  • A subsequent character is a leading character, 0 through 9, overbar (¯), or underbar (_).

One other set of characters, the underbarred alphabet ( through ), may be pasted into a session or function editor window. There is no way to enter these characters directly from the keyboard. Depending on a User Option setting, when these characters are pasted into a session or function editor window, they are treated as themselves or are mapped to the lowercase alphabet. When used in a name, they are always equivalent to the corresponding lowercase letter, although they display as themselves. Because of this latter translation, they may be used as either a leading or subsequent character in a name. Thus the names lh and alpha display differently, but they both refer to the same object; a value assigned to one is reflected in the other.

See also: Quad AV = niladic system function ⎕AV - Atomic Vector page.