Proposal To Extend .atf Files: Difference between revisions

From NARS2000
Jump to navigationJump to search
(New page: == Rationale== In order to support APL systems with symbols in their character set that are not in the APL2 character set such as <apll>⍬</apll>, <apll>⍤</apll>, etc., we need a diffe...)
 
Line 20: Line 20:
== File Extension ==
== File Extension ==


Depending upon how robust the existing code for )IN is in each of our systems (that is, how well it handles an unexpected leading BOM), we might not even need a new file extension, although the file extension .utf has a certain appeal.
Depending upon how robust the existing code for )IN is in each of our systems (that is, how well it handles an unexpected leading BOM and/or UTF-16), we might not even need a new file extension, although the file extension .utf has a certain appeal.


== UCS-2 Systems ==
== UCS-2 Systems ==

Revision as of 22:03, 26 June 2008

Rationale

In order to support APL systems with symbols in their character set that are not in the APL2 character set such as , , etc., we need a different file format and/or mechanism than is provided by the APL2 .atf file.

Current File Format

The current file format is essentially UCS-1 (bytes with fixed width characters, one byte per character). The byte values may range from 0x00 to 0xFF.

A Proposal

Given that more and more APL systems support Unicode, that suggests a possible solution:

Extend the file format to UTF-16 (words with variable width characters, one or two words per characters) possibly with a leading BOM (Byte Order Mark). This would also eliminate the need to translate to/from the APL2 character set as everything would be in a common character set, i.e., UTF-16. Note that the existing .atf file format is already considered a binary file because it contains bytes beyond 0x7F, so switching to UTF-16 wouldn't change that.

Byte Order Mark

The BOM is not strictly necessary as a UTF-16 format .atf file could be recognized by noting that the second byte is zero for little-endian systems and the first byte is zero for big-endian systems. However, rather than rely upon the currently limited range of the initial character in a .atf file (which is already used to distinguish between EBCDIC and ASCII), a BOM may be preferred. Alternatively, we can settle on either big-endian (UTF-16BE) or little-endian (UTF-16LE) as the file format and not need a BOM.

In UTF-16, the BOM is U+FFFE (0xFE, 0xFF) for big-endian, and U+FEFF (0xFF, 0xFE) for little-endian.

File Extension

Depending upon how robust the existing code for )IN is in each of our systems (that is, how well it handles an unexpected leading BOM and/or UTF-16), we might not even need a new file extension, although the file extension .utf has a certain appeal.

UCS-2 Systems

In systems that support UCS-2 instead of UTF-16, on import, replace characters in a surrogate pair with U+FFFD (the Replacement Character).

Importing Data

The import mechanism — e.g., system command )IN — needs to be enhanced to recognize files in the new format and import them appropriately.

Exporting Data

The export mechanism — e.g., system command )OUT — needs to be augmented, perhaps with a switch on the command or a new command altogether so as to be told which format (UCS-1 vs. UTF-16) to use when writing out the workspace contents.