Re: [vhdl-200x] VHDL support for Unicode

From: Evan Lavelle <eml-vhdl-200x@cyconix.com>
Date: Wed Aug 10 2011 - 02:29:24 PDT

I retro-fitted Unicode (UTF-8) to an existing language last year. My own
view is that any/every modern language should support UTF-8 natively,
and that it's easy if you build it in from the start. Retro-fitting can
be a real problem, though. Some issues:

- I was surprised by the comment about the difficulty of supporting
strings; this was the only thing that was trivial

- If the compiler has a (f)lex lexer, then you've got a problem, since
lex only supports ASCII-7. However, you can work around this without too
much trouble

- not an issue for VHDL, but my biggest problem was getting UTF-8 into a
C-like preprocessor. This was most of the work

- I had other minor problems with symbol tables, and so on, but these
were easy to fix. They would be more difficult if your compiler is
written in C and relies heavily on nul-terminated strings. You can
probably fix all these problems by requiring input to be "modified
UTF-8" rather than standard UTF-8. I think this is quite common - Java,
for example.

- there's no problem with legacy back-end tools: you just output a
translated version of your identifier. You can't put arbitrary UTF-8
chars into an extended identifier, so this wouldn't be particularly
clean, but it's better than nothing [as an aside, extended identifiers
would of course be redundant with UTF-8 support]

- you have to think about what 'line terminators' and 'spaces' actually
are - Unicode has an extended set of these

- you'd need to support input files that start with Microsoft's 3-byte
"UTF-8 BOM"

- one plus is that VHDL has no significant printf-style support, so you
don't have to worry about the tricky issue of display widths, aligning
text, and so on

I think there were other issues, but I can't think what right now. It
was a lot of work, for a group of users who were mainly probably happy
to write in Latin characters anyway. Having said that, it's difficult to
justify not having Unicode support.

-Evan

-- 
This message has been scanned for viruses and
dangerous content by MailScanner, and is
believed to be clean.
Received on Wed Aug 10 02:30:06 2011

This archive was generated by hypermail 2.1.8 : Wed Aug 10 2011 - 02:30:41 PDT