Proposal To Extend .atf Files: Difference between revisions
Sudleyplace (talk | contribs) |
Sudleyplace (talk | contribs) |
||
(7 intermediate revisions by 2 users not shown) | |||
Line 4: | Line 4: | ||
== Current File Format == | == Current File Format == | ||
The current file format is essentially UCS-1 (bytes with fixed width characters, one byte per character). The byte values may range from 0x00 to 0xFF. | The current file format is essentially UCS-1 (bytes with fixed width characters, one byte per character), using the APL2 character set. The byte values may range from 0x00 to 0xFF. | ||
== A Proposal == | == A Proposal == | ||
Line 10: | Line 10: | ||
Given that more and more APL systems support Unicode, that suggests a possible solution: | Given that more and more APL systems support Unicode, that suggests a possible solution: | ||
Extend the file format to [http://en.wikipedia.org/wiki/UTF-16 UTF-16] (words with variable width characters, one or two words per character) possibly with a leading [http://en.wikipedia.org/wiki/Byte_Order_Mark BOM] (Byte Order Mark). This would also eliminate the need to translate to/from the APL2 character set as everything would be in a common character set, i.e., UTF-16. Note that the existing .atf file format is already considered a binary file because it | Extend the file format to [http://en.wikipedia.org/wiki/UTF-16 UTF-16] (16-bit words with variable width characters, one or two words per character) possibly with a leading [http://en.wikipedia.org/wiki/Byte_Order_Mark BOM] (Byte Order Mark). This would also eliminate the need to translate to/from the APL2 character set as everything would be in a common character set, i.e., UTF-16. Note that the existing .atf file format is already considered a binary file because it may contain bytes beyond 0x7F, so switching to UTF-16 wouldn't change that. | ||
== Byte Order Mark == | == Byte Order Mark == | ||
The BOM is not strictly necessary | The BOM is not strictly necessary because a UTF-16 format .atf file could be recognized by noting that the second byte is zero for little-endian systems and the first byte is zero for big-endian systems. However, rather than rely upon the currently limited range of the initial character in a .atf file (which is already used to distinguish between EBCDIC and ASCII), a BOM may be preferred. Alternatively, we can settle on either big-endian (UTF-16BE) or little-endian (UTF-16LE) as the file format and not need a BOM. | ||
For reference, the BOM is U+FEFF — in UTF-16, on big-endian systems, it's represented as (0xFE, 0xFF), and on little-endian systems as (0xFF, 0xFE). | |||
== File Extension == | == File Extension == | ||
Depending upon how robust the existing code for )IN is in each of our systems (that is, how well it handles an unexpected leading BOM and/or UTF-16), we might not even need a new file extension, although the file extension .utf has a certain appeal. | Depending upon how robust the existing code for <apll>)IN</apll> is in each of our systems (that is, how well it handles an unexpected leading BOM and/or UTF-16), we might not even need a new file extension, although the file extension .utf has a certain appeal. | ||
== UCS-2 Systems == | == UCS-2 Systems == | ||
Line 28: | Line 28: | ||
== Importing Data == | == Importing Data == | ||
The import mechanism — e.g., system command )IN — needs to be enhanced to recognize files in the new format and import them appropriately. | The import mechanism — e.g., system command <apll>)IN</apll> — needs to be enhanced to recognize files in the new format and import them appropriately. | ||
== Exporting Data == | == Exporting Data == | ||
The export mechanism — e.g., system command )OUT — needs to be augmented, perhaps with a switch on the command or a new command altogether so as to be told which format (UCS-1 vs. UTF-16) to use when writing out the workspace contents. | The export mechanism — e.g., system command <apll>)OUT</apll> — needs to be augmented, perhaps with a switch on the command or a new command altogether so as to be told which format (APL2 UCS-1 vs. UTF-16) to use when writing out the workspace contents. | ||
== Transfer Form System Function == | |||
Just as <apll>)OUT</apll> needs to be told which format to use as output and what format is used as input, the system function <apll>⎕TF</apll> needs the same information. For the purposes of testing this idea, the NARS implementation of <apll>⎕TF</apll> uses a left argument of <apll>¯1</apll> and <apll>¯2</apll> to generate or accept Type 1 and 2 transfer forms in UTF-16 format (really UCS-2). |
Latest revision as of 13:09, 21 October 2008
Rationale
In order to support APL systems with symbols in their character set that are not in the APL2 character set such as ⍬, ⍤, etc., we need a different file format and/or mechanism than is provided by the APL2 .atf file.
Current File Format
The current file format is essentially UCS-1 (bytes with fixed width characters, one byte per character), using the APL2 character set. The byte values may range from 0x00 to 0xFF.
A Proposal
Given that more and more APL systems support Unicode, that suggests a possible solution:
Extend the file format to UTF-16 (16-bit words with variable width characters, one or two words per character) possibly with a leading BOM (Byte Order Mark). This would also eliminate the need to translate to/from the APL2 character set as everything would be in a common character set, i.e., UTF-16. Note that the existing .atf file format is already considered a binary file because it may contain bytes beyond 0x7F, so switching to UTF-16 wouldn't change that.
Byte Order Mark
The BOM is not strictly necessary because a UTF-16 format .atf file could be recognized by noting that the second byte is zero for little-endian systems and the first byte is zero for big-endian systems. However, rather than rely upon the currently limited range of the initial character in a .atf file (which is already used to distinguish between EBCDIC and ASCII), a BOM may be preferred. Alternatively, we can settle on either big-endian (UTF-16BE) or little-endian (UTF-16LE) as the file format and not need a BOM.
For reference, the BOM is U+FEFF — in UTF-16, on big-endian systems, it's represented as (0xFE, 0xFF), and on little-endian systems as (0xFF, 0xFE).
File Extension
Depending upon how robust the existing code for )IN is in each of our systems (that is, how well it handles an unexpected leading BOM and/or UTF-16), we might not even need a new file extension, although the file extension .utf has a certain appeal.
UCS-2 Systems
In systems that support UCS-2 instead of UTF-16, on import, replace characters in a surrogate pair with U+FFFD (the Replacement Character).
Importing Data
The import mechanism — e.g., system command )IN — needs to be enhanced to recognize files in the new format and import them appropriately.
Exporting Data
The export mechanism — e.g., system command )OUT — needs to be augmented, perhaps with a switch on the command or a new command altogether so as to be told which format (APL2 UCS-1 vs. UTF-16) to use when writing out the workspace contents.
Transfer Form System Function
Just as )OUT needs to be told which format to use as output and what format is used as input, the system function ⎕TF needs the same information. For the purposes of testing this idea, the NARS implementation of ⎕TF uses a left argument of ¯1 and ¯2 to generate or accept Type 1 and 2 transfer forms in UTF-16 format (really UCS-2).