Proposal To Extend .atf Files: Difference between revisions

Revision as of 09:52, 8 July 2008

Rationale

In order to support APL systems with symbols in their character set that are not in the APL2 character set such as ⍬, ⍤, etc., we need a different file format and/or mechanism than is provided by the APL2 .atf file.

Current File Format

The current file format is essentially UCS-1 (bytes with fixed width characters, one byte per character). The byte values may range from 0x00 to 0xFF.

A Proposal

Given that more and more APL systems support Unicode, that suggests a possible solution:

Extend the file format to UTF-16 (words with variable width characters, one or two words per character) possibly with a leading BOM (Byte Order Mark). This would also eliminate the need to translate to/from the APL2 character set as everything would be in a common character set, i.e., UTF-16. Note that the existing .atf file format is already considered a binary file because it may contain bytes beyond 0x7F, so switching to UTF-16 wouldn't change that.

Byte Order Mark

The BOM is not strictly necessary because a UTF-16 format .atf file could be recognized by noting that the second byte is zero for little-endian systems and the first byte is zero for big-endian systems. However, rather than rely upon the currently limited range of the initial character in a .atf file (which is already used to distinguish between EBCDIC and ASCII), a BOM may be preferred. Alternatively, we can settle on either big-endian (UTF-16BE) or little-endian (UTF-16LE) as the file format and not need a BOM.

For reference, the BOM is U+FEFF — in UTF-16, on big-endian systems, it's represented as (0xFE, 0xFF), and on little-endian systems as (0xFF, 0xFE).

File Extension

Depending upon how robust the existing code for )IN is in each of our systems (that is, how well it handles an unexpected leading BOM and/or UTF-16), we might not even need a new file extension, although the file extension .utf has a certain appeal.

UCS-2 Systems

In systems that support UCS-2 instead of UTF-16, on import, replace characters in a surrogate pair with U+FFFD (the Replacement Character).

Importing Data

The import mechanism — e.g., system command )IN — needs to be enhanced to recognize files in the new format and import them appropriately.

Exporting Data

The export mechanism — e.g., system command )OUT — needs to be augmented, perhaps with a switch on the command or a new command altogether so as to be told which format (UCS-1 vs. UTF-16) to use when writing out the workspace contents.

@@ Line 14: / Line 14: @@
 == Byte Order Mark ==
-The BOM is not strictly necessary as a UTF-16 format .atf file could be recognized by noting that the second byte is zero for little-endian systems and the first byte is zero for big-endian systems.  However, rather than rely upon the currently limited range of the initial character in a .atf file (which is already used to distinguish between EBCDIC and ASCII), a BOM may be preferred.  Alternatively, we can settle on either big-endian (UTF-16BE) or little-endian (UTF-16LE) as the file format and not need a BOM.
+The BOM is not strictly necessary because a UTF-16 format .atf file could be recognized by noting that the second byte is zero for little-endian systems and the first byte is zero for big-endian systems.  However, rather than rely upon the currently limited range of the initial character in a .atf file (which is already used to distinguish between EBCDIC and ASCII), a BOM may be preferred.  Alternatively, we can settle on either big-endian (UTF-16BE) or little-endian (UTF-16LE) as the file format and not need a BOM.
-The BOM is U+FEFF &mdash; in UTF-16, on big-endian systems, it's represented as (0xFE, 0xFF), and on little-endian systems as (0xFF, 0xFE).
+For reference, the BOM is U+FEFF &mdash; in UTF-16, on big-endian systems, it's represented as (0xFE, 0xFF), and on little-endian systems as (0xFF, 0xFE).
 == File Extension ==

Proposal To Extend .atf Files: Difference between revisions

Revision as of 09:52, 8 July 2008

Contents

Rationale

Current File Format

A Proposal

Byte Order Mark

File Extension

UCS-2 Systems

Importing Data

Exporting Data

Navigation menu

Page actions

Page actions

Personal tools

Navigation

Search

Tools