Proposal To Extend .atf Files: Difference between revisions

From NARS2000
Jump to navigationJump to search
No edit summary
Line 20: Line 20:
== File Extension ==
== File Extension ==


Depending upon how robust the existing code for )IN is in each of our systems (that is, how well it handles an unexpected leading BOM and/or UTF-16), we might not even need a new file extension, although the file extension .utf has a certain appeal.
Depending upon how robust the existing code for <apll>)IN</apll> is in each of our systems (that is, how well it handles an unexpected leading BOM and/or UTF-16), we might not even need a new file extension, although the file extension .utf has a certain appeal.


== UCS-2 Systems ==
== UCS-2 Systems ==

Revision as of 09:07, 8 July 2008

Rationale

In order to support APL systems with symbols in their character set that are not in the APL2 character set such as , , etc., we need a different file format and/or mechanism than is provided by the APL2 .atf file.

Current File Format

The current file format is essentially UCS-1 (bytes with fixed width characters, one byte per character), using the APL2 character set. The byte values may range from 0x00 to 0xFF.

A Proposal

Given that more and more APL systems support Unicode, that suggests a possible solution:

Extend the file format to UTF-16 (words with variable width characters, one or two words per character) possibly with a leading BOM (Byte Order Mark). This would also eliminate the need to translate to/from the APL2 character set as everything would be in a common character set, i.e., UTF-16. Note that the existing .atf file format is already considered a binary file because it may contain bytes beyond 0x7F, so switching to UTF-16 wouldn't change that.

Byte Order Mark

The BOM is not strictly necessary because a UTF-16 format .atf file could be recognized by noting that the second byte is zero for little-endian systems and the first byte is zero for big-endian systems. However, rather than rely upon the currently limited range of the initial character in a .atf file (which is already used to distinguish between EBCDIC and ASCII), a BOM may be preferred. Alternatively, we can settle on either big-endian (UTF-16BE) or little-endian (UTF-16LE) as the file format and not need a BOM.

For reference, the BOM is U+FEFF — in UTF-16, on big-endian systems, it's represented as (0xFE, 0xFF), and on little-endian systems as (0xFF, 0xFE).

File Extension

Depending upon how robust the existing code for )IN is in each of our systems (that is, how well it handles an unexpected leading BOM and/or UTF-16), we might not even need a new file extension, although the file extension .utf has a certain appeal.

UCS-2 Systems

In systems that support UCS-2 instead of UTF-16, on import, replace characters in a surrogate pair with U+FFFD (the Replacement Character).

Importing Data

The import mechanism — e.g., system command )IN — needs to be enhanced to recognize files in the new format and import them appropriately.

Exporting Data

The export mechanism — e.g., system command )OUT — needs to be augmented, perhaps with a switch on the command or a new command altogether so as to be told which format (APL2 UCS-1 vs. UTF-16) to use when writing out the workspace contents.

Transfer Form System Function

Just as )OUT needs to be told which format to use as output, the system function ⎕TF needs the same information. For the purposes of testing this idea, the NARS implementation of ⎕TF uses left arguments of ¯1 and ¯2 to generate Type 1 and 2 forms in UTF-16 format (really UCS-2).