Re: [vhdl-200x] VHDL support for Unicode

From: Evan Lavelle <eml-vhdl-200x@cyconix.com>
Date: Thu Aug 11 2011 - 03:14:32 PDT

On 11/08/2011 09:00, Martin.J Thompson wrote:

> (I'm not sure how different
> OSs are evidence of endian problems - did Window put a non standard
> header on UTF16 files at some point?)

There shouldn't ever be an endianism problem when reading or writing
files, and I'm not aware that MS ever messed with UTF-16 or UTF-32. The
only issue is that MS chose to put a 3-byte Byte Order Mark at the start
of UTF-8 files generated by some of their tools. It's redundant, because
UTF-8 is a byte stream that is not endianism-dependent. It's not a
problem; any smart tool just checks for the 3 bytes and ignores them.

On the issue of input and output file representation, specifically (my
earlier comments were just on the potential difficulties of converting
compiler internals to Unicode, which may not have been obvious).
There seem to be 2 ways of doing this. In Java, for example, the LRM
specifies UTF-16 internally, but doesn't require external files to have
any particular coding. The compiler itself is expected to sort this out;
javac had a '-encoding' argument, and you tell it whether your file is
Big5, UTF-16, or whatever (or the system code page is checked). This
seems to be pretty common.

The other approach is just to require external files to be UTF-8. This
is far easier to implement, and also has the great advantage that all
existing 7-bit ASCII files are *already* valid UTF-8. I don't know how
common this is.

I don't really understand why some languages require internal UTF-16. I
suspect this is a hangover from the obsolete fixed-length UCS-2, when it
looked like all Unicode would fit into 64K code points, but UTF-16
itself is not fixed-length. It's also 50% unused for the vast majority
of programs, and is difficult to retrofit into existing code. I also
don't understand why any compiler should be asked to provide support for
arbitrary obsolete code pages and coding systems. All this might have
made sense 15 years ago, but I don't think it makes sense now.

-Evan

-- 
This message has been scanned for viruses and
dangerous content by MailScanner, and is
believed to be clean.
Received on Thu Aug 11 03:15:10 2011

This archive was generated by hypermail 2.1.8 : Thu Aug 11 2011 - 03:15:51 PDT