Proposal To Extend .atf Files: Difference between revisions
Sudleyplace (talk | contribs) |
|||
Line 14: | Line 14: | ||
== Byte Order Mark == | == Byte Order Mark == | ||
The BOM is not strictly necessary | The BOM is not strictly necessary because a UTF-16 format .atf file could be recognized by noting that the second byte is zero for little-endian systems and the first byte is zero for big-endian systems. However, rather than rely upon the currently limited range of the initial character in a .atf file (which is already used to distinguish between EBCDIC and ASCII), a BOM may be preferred. Alternatively, we can settle on either big-endian (UTF-16BE) or little-endian (UTF-16LE) as the file format and not need a BOM. | ||
For reference, the BOM is U+FEFF — in UTF-16, on big-endian systems, it's represented as (0xFE, 0xFF), and on little-endian systems as (0xFF, 0xFE). | |||
== File Extension == | == File Extension == |
Revision as of 08:52, 8 July 2008
Rationale
In order to support APL systems with symbols in their character set that are not in the APL2 character set such as ⍬, ⍤, etc., we need a different file format and/or mechanism than is provided by the APL2 .atf file.
Current File Format
The current file format is essentially UCS-1 (bytes with fixed width characters, one byte per character). The byte values may range from 0x00 to 0xFF.
A Proposal
Given that more and more APL systems support Unicode, that suggests a possible solution:
Extend the file format to UTF-16 (words with variable width characters, one or two words per character) possibly with a leading BOM (Byte Order Mark). This would also eliminate the need to translate to/from the APL2 character set as everything would be in a common character set, i.e., UTF-16. Note that the existing .atf file format is already considered a binary file because it may contain bytes beyond 0x7F, so switching to UTF-16 wouldn't change that.
Byte Order Mark
The BOM is not strictly necessary because a UTF-16 format .atf file could be recognized by noting that the second byte is zero for little-endian systems and the first byte is zero for big-endian systems. However, rather than rely upon the currently limited range of the initial character in a .atf file (which is already used to distinguish between EBCDIC and ASCII), a BOM may be preferred. Alternatively, we can settle on either big-endian (UTF-16BE) or little-endian (UTF-16LE) as the file format and not need a BOM.
For reference, the BOM is U+FEFF — in UTF-16, on big-endian systems, it's represented as (0xFE, 0xFF), and on little-endian systems as (0xFF, 0xFE).
File Extension
Depending upon how robust the existing code for )IN is in each of our systems (that is, how well it handles an unexpected leading BOM and/or UTF-16), we might not even need a new file extension, although the file extension .utf has a certain appeal.
UCS-2 Systems
In systems that support UCS-2 instead of UTF-16, on import, replace characters in a surrogate pair with U+FFFD (the Replacement Character).
Importing Data
The import mechanism — e.g., system command )IN — needs to be enhanced to recognize files in the new format and import them appropriately.
Exporting Data
The export mechanism — e.g., system command )OUT — needs to be augmented, perhaps with a switch on the command or a new command altogether so as to be told which format (UCS-1 vs. UTF-16) to use when writing out the workspace contents.