Proposal To Extend .atf Files: Difference between revisions
Sudleyplace (talk | contribs) (New page: == Rationale== In order to support APL systems with symbols in their character set that are not in the APL2 character set such as <apll>⍬</apll>, <apll>⍤</apll>, etc., we need a diffe...) |
Sudleyplace (talk | contribs) |
||
Line 20: | Line 20: | ||
== File Extension == | == File Extension == | ||
Depending upon how robust the existing code for )IN is in each of our systems (that is, how well it handles an unexpected leading BOM), we might not even need a new file extension, although the file extension .utf has a certain appeal. | Depending upon how robust the existing code for )IN is in each of our systems (that is, how well it handles an unexpected leading BOM and/or UTF-16), we might not even need a new file extension, although the file extension .utf has a certain appeal. | ||
== UCS-2 Systems == | == UCS-2 Systems == |
Revision as of 22:03, 26 June 2008
Rationale
In order to support APL systems with symbols in their character set that are not in the APL2 character set such as ⍬, ⍤, etc., we need a different file format and/or mechanism than is provided by the APL2 .atf file.
Current File Format
The current file format is essentially UCS-1 (bytes with fixed width characters, one byte per character). The byte values may range from 0x00 to 0xFF.
A Proposal
Given that more and more APL systems support Unicode, that suggests a possible solution:
Extend the file format to UTF-16 (words with variable width characters, one or two words per characters) possibly with a leading BOM (Byte Order Mark). This would also eliminate the need to translate to/from the APL2 character set as everything would be in a common character set, i.e., UTF-16. Note that the existing .atf file format is already considered a binary file because it contains bytes beyond 0x7F, so switching to UTF-16 wouldn't change that.
Byte Order Mark
The BOM is not strictly necessary as a UTF-16 format .atf file could be recognized by noting that the second byte is zero for little-endian systems and the first byte is zero for big-endian systems. However, rather than rely upon the currently limited range of the initial character in a .atf file (which is already used to distinguish between EBCDIC and ASCII), a BOM may be preferred. Alternatively, we can settle on either big-endian (UTF-16BE) or little-endian (UTF-16LE) as the file format and not need a BOM.
In UTF-16, the BOM is U+FFFE (0xFE, 0xFF) for big-endian, and U+FEFF (0xFF, 0xFE) for little-endian.
File Extension
Depending upon how robust the existing code for )IN is in each of our systems (that is, how well it handles an unexpected leading BOM and/or UTF-16), we might not even need a new file extension, although the file extension .utf has a certain appeal.
UCS-2 Systems
In systems that support UCS-2 instead of UTF-16, on import, replace characters in a surrogate pair with U+FFFD (the Replacement Character).
Importing Data
The import mechanism — e.g., system command )IN — needs to be enhanced to recognize files in the new format and import them appropriately.
Exporting Data
The export mechanism — e.g., system command )OUT — needs to be augmented, perhaps with a switch on the command or a new command altogether so as to be told which format (UCS-1 vs. UTF-16) to use when writing out the workspace contents.