Proposal To Extend .atf Files

From NARS2000
Revision as of 21:55, 26 June 2008 by Sudleyplace (talk | contribs) (New page: == Rationale== In order to support APL systems with symbols in their character set that are not in the APL2 character set such as <apll>⍬</apll>, <apll>⍤</apll>, etc., we need a diffe...)
(diff) ← Older revision | Latest revision (diff) | Newer revision → (diff)
Jump to navigationJump to search

Rationale

In order to support APL systems with symbols in their character set that are not in the APL2 character set such as , , etc., we need a different file format and/or mechanism than is provided by the APL2 .atf file.

Current File Format

The current file format is essentially UCS-1 (bytes with fixed width characters, one byte per character). The byte values may range from 0x00 to 0xFF.

A Proposal

Given that more and more APL systems support Unicode, that suggests a possible solution:

Extend the file format to UTF-16 (words with variable width characters, one or two words per characters) possibly with a leading BOM (Byte Order Mark). This would also eliminate the need to translate to/from the APL2 character set as everything would be in a common character set, i.e., UTF-16. Note that the existing .atf file format is already considered a binary file because it contains bytes beyond 0x7F, so switching to UTF-16 wouldn't change that.

Byte Order Mark

The BOM is not strictly necessary as a UTF-16 format .atf file could be recognized by noting that the second byte is zero for little-endian systems and the first byte is zero for big-endian systems. However, rather than rely upon the currently limited range of the initial character in a .atf file (which is already used to distinguish between EBCDIC and ASCII), a BOM may be preferred. Alternatively, we can settle on either big-endian (UTF-16BE) or little-endian (UTF-16LE) as the file format and not need a BOM.

In UTF-16, the BOM is U+FFFE (0xFE, 0xFF) for big-endian, and U+FEFF (0xFF, 0xFE) for little-endian.

File Extension

Depending upon how robust the existing code for )IN is in each of our systems (that is, how well it handles an unexpected leading BOM), we might not even need a new file extension, although the file extension .utf has a certain appeal.

UCS-2 Systems

In systems that support UCS-2 instead of UTF-16, on import, replace characters in a surrogate pair with U+FFFD (the Replacement Character).

Importing Data

The import mechanism — e.g., system command )IN — needs to be enhanced to recognize files in the new format and import them appropriately.

Exporting Data

The export mechanism — e.g., system command )OUT — needs to be augmented, perhaps with a switch on the command or a new command altogether so as to be told which format (UCS-1 vs. UTF-16) to use when writing out the workspace contents.