Unicode and MacOS, and Code converters

Part of Nobumi Iyanaga's website. n-iyanag@ppp.bekkoame.ne.jp. 9/26/00.

Unicode and MacOS, and Code converters

Generalities

I am a user of Macintosh since 1991, and I love use it. I would like to explain in this page the current situation of the Unicode support in the Mac platform from the point of vue of an end-user. I have no special knowledge of computer science, and all I can say here is based on my own personal experience. There may be errors or wrong interpretations. I would be very grateful if you could send me comments so that I could improve this page.
As to what is Unicode, I would not venture to explain it here; please visit the official site of Unicode Consortium for full information. I would only say that, although it may not be the ideal solution to all the problems, it is for the moment at least one of the best solutions for multilingual computing. As I am a researcher in the field of Buddhist studies, I need to write and read multilingual texts, in which there may be: Japanese, as many Chinese characters as possible, Japanese, Chinese and Sanskrit transliteration, English, French, etc. For me, and for my colleagues, Unicode is badly needed.
I have to say that there are mainly three "flavors" of Unicode encoding schema. One is the "standard Unicode", often designated as "UCS-2" or "UTF-16"; the second one is called "UTF-8", which is used most often for multilingual HTML or XML; the third is "UTF-7", which is not so often used, but may be useful for email, because it uses only 7 bits ASCII characters. Of course, all these encodings can be converted from and to each other.
Windows, from Windows 95 onward (or at the latest, from Windows 98 onward), has a full support of Unicode 2.0; with Word 2000 and foreign language support, we can use a full Unicode font named Arial Unicode. MacOS falls very behind in this regard. From Mac OS 8.5 onward, OS itself has ATSUI (Apple Type Services for Unicode Imaging) support, which enables applications to use Unicode; Mac OS 9.0 and later includes all the language kits with input methods, and supports ATSUI and MLTE (Multilingual Text Editor). ATSUI, on the other hand, seems to inherit the legacy of Apple's another technology called QuickDraw GX: when used with fonts supporting Apple Advanced Typography (AAT), "it enables rendering of Unicode-encoded text with advanced typographic features," according to what Apple says on ATSUI.
You will find a good overview of Mac OS support of Unicode at the page Mac OS 8 and 9 Developer Documentation: Text and Other International Services.
But so far, there is almost no application at all able to handle directly Unicode -- the only two exceptions that I know of are a little demo program named "MLTE demo 1.0a2", and a very recently released program named "SUE 1.0a2", or Simple Unicode Editor, written by the author of the code converter "Cyclone" that I will mention below. SUE seems a little more stable than MLTE demo, and can import/export differently encoded text files. Finally, there is a multilingual HTML editor named "Unisite"; I am not sure if it handles directly Unicode, but anyway, it seems to write UTF-8 HTML file (I have not yet tried this program).
Last news I heard (as of July 25, 2000), the next release of Mac OS 9 will include a new ATSUI aware text editor named "WorldText". And of course, the next generation of Mac OS, the OS X, will be certainly full Unicode savvy (I hope...!).
Fonts:
Many of the fonts which ship with the system (from OS 8.5 onward), such as Times, etc., are in fact Unicode fonts. And the Input methods (such as Kotoeri) are Unicode based also. But as the applications are not Unicode savvy, the fonts behave as legacy code fonts. As far as I know, there is not yet any full Unicode font made for the Mac. But in fact, Mac OS, from OS 8.5 onward, is supposed to be able to use Windows Unicode fonts directly. It is said (according to [what I understand from...] Apple's Technote No 1159) that if users put Windows Unicode font files on the System Folder's icon in the Finder, the System will recognize them, and will direct them to the Fonts folder, adding to them some resources so that they can be used as Mac fonts. However, as far as I could test myself, this system seems to not work so far, for reasons that I cannot know (perhaps the Windows fonts that I tried were somewhat corrupted, or protected...).
In fact, I could convert some Windows Unicode fonts to Mac format, using a font conversion utility called "TrueKeys". One is Titus Bitstream Unicode, which contains many diacritical characters and can be used for example for typing Sanskrit transliterated text. It can be downloaded from Mr. Charles Muller's site (this is an exe file for Windows, so that I had to send it to my Windows machine to expand it [I heard that it's possible to expand exe file on Mac with the latest version of StuffIt Deluxe, but I don't have it, and I am not sure...]); the other is Bitstream CyberCJK, which can be downloaded from Netscape's ftp site. This is (as far as I know) a full CJK Unicode font, and can display all the Chinese, Korean and Japanese characters included in Unicode 2.0. With these two fonts, and the fonts which come with the different language kits bundled with OS 9, I think I have almost all the Unicode characters in my System 9. -- But I have to mention also that TrueType was unable to convert Windows Arial Unicode font to the Mac format.
Two methods of handling multilingual text: Unicode and WorldScripts:
There are two methods of handling multilingual text in Mac OS. One is the traditional "language kits" or WorldScript method, and the other is the direct handling of Unicode. As I said above, Mac OS 8.5 and later supports the both methods, but in reality, since there is (almost) no application able to handle directly Unicode, it is the language kits method which is used in practice.

MLTE Demo and SUE

Now, I should describe how these pioneer programs, MLTE Demo and SUE, work in reality. I have to confess that I did not use them very extensively. I only tested them somehow. When I launch these programs, the Fonts menu is very different from other applications. For example, I have Titus Bitstream Unicode at the top of the Fonts menu in MLTE Demo and Sue, while it is Osaka which is at the top of the Fonts menu in other, non-Unicode savvy applications, etc. And in these two programs, the keyboards named "Extended Roman (U)" and "Unicode Hex Input" in the Keyboards menu are enabled, while in other programs, these items are grayed out.
I can open files of Unicode text in these programs (but not Little Endian Unicode text files [on "Little Endian Unicode text file", see below]; this will simply crash the system! -- in SUE, you can open Little Endian files using "Import" menu item in the File Menu ...). The text is displayed in the System default font (in Osaka, in my case), so that if the text is a multilingual text, there will be many "garbled" characters -- this problem will remain as long as there is no full Unicode font...! The problem is that even if I change the fonts to display correctly the characters, if I save the file as Unicode text file, when I will re-open the same file, all the font information is lost, and I will have to change again the fonts... In this regard, SUE is much better, because it has a file format named "Textension Document" in which the font information seems to be preserved.
MLTE Demo has a menu named "Features", which seems to contain items related to the typographical effects, such as "Common Ligatures", "Rare Ligatures", etc. But I could not figure out how to use this menu.

SUE can import from, and export to differently encoded text files: it seems to use Text Encoding Converter (see below) to do these conversions. However, if the text is multilingual (multi-script), these conversions would not be so useful, because they seem to convert only one script to one another script.
SUE has the Search menu, which is an indispensable feature for an text editor. So, it can find and replace Unicode text.
Now, for the inputting of text, we can use different keyboards in the Keyboard menu. In addition to the usual script keyboards, as I said above, we can use the "Extended Roman (U)" keyboard and the "Unicode Hex Input" keyboard. The "Extended Roman (U)" is used for inputting common diacritical characters such as "a" with macron, etc. It mainly uses the "dead key" system. For example, to input "a" with macron, we have to press "Option + a + a", etc. You can find a jpeg image of the layout of that keyboard at http://homepage.mac.com/goldsmit/.Pictures/ExtendedRoman.jpg. As to the keyboard "Unicode Hex Input", it is like the standard US one, but if you hold the option key down you can type 4 hex digits to produce any Unicode code point.

But as valuable as they are, I have to say that these two programs are rather "demo" programs or experimental programs. I think it would be not very recommended to use these programs for "real life" work...

WorldScript aware applications

There are several well-known WorldScript aware applications: as to word-processor or editor programs, those editors which use WASTE text engine, such as Style and Tex-Edit Plus, are fully multilingual; moreover, Style can read and save documents in Unicode text. As to word-processor programs, Pascal Write is also fully multilingual; Nisus Writer, which is reputed as multilingual, cannot handle correctly Indian languages (i.e. those languages which use ligatures), and it has some incompatibility with the latest version of Chinese language kits (Traditional Chinese, and Simplified Chinese), although for Japanese, Arabic, Hebrew, etc., it is one of the most powerful and versatile programs. Other than those programs, HyperCard is fully multilingual, and Mariner-J is also.

Text Encoding Converter

When Mac applications have to deal with Unicode text, the usual way is to convert that text to "legacy codes" and use different language kits and fonts to display them.
The web-browsers, such as Internet Explorer and Netscape Communicator, seem to use Text Encoding Converter to convert Unicode text to "legacy codes". However, it seems that the best web-browser for the display of Unicode text is iCab. Cyberdog, Netscape Messenger and Outlook Express can also deal with Unicode text via TEC.
On the other hand, the latest version of Style or the editor named "Jedit3" use the same TEC to convert text from Unicode to "legacy codes", or vice versa. Style can also convert text to Unicode (UTF-16) or to UTF-8 using AppleScript scripting. Here is an example of Style script which converts all the style runs of a document into a UTF-8 text, in a new document:
tell application "Style"
	set theText to contents of document 1 as styled text
	make new document
	set selection to theText
	tell document 1
		set every style run to UTF8 text of current item
	end tell
end tell
TEC is a text encoding converting API able to deal with many encodings (there are at least 64 encoding names listed [but some of them are different names for the same encodings]) (see below) built-in in the system. It is fast and powerful, but it has some drawbacks.

When the needed language kit is not installed in the system, or when some Unicode characters cannot be converted in any of the installed language kits encodings, TEC gives the output "?" -- so that some of the original information can be lost.
Users cannot customize the conversion tables. This is a major problem, when the correspondence between different encodings is not "one to one" correspondence as is the case for CJK characters and Unicode CJK characters.
There are two kinds of standard Unicode (UCS-2 or UTF-16) text: the one, produced by Mac and some other OS (Unix?) is named "Big Endian", and the other, produced by Windows is named "Little Endian". These are distinguished by two bytes at the beginning of the files, named "byte order mark" (BOM). Big Endian text is marked by 0xFEFF, while Little Endian text is by 0xFFFE. TEC can deal only with the Big Endian UTF-16 texts. Otherwise, to convert Little Endian texts with TEC, we have to reverse the order of bytes in advance, with some utility (for example a little MacPerl droplet that I wrote can be used [please see below for this utility]). -- Note that other Unicode "flavors", like UTF-8 (usually used for HTML) and UTF-7, have no problem of "byte order".
TEC is a system API, having no user interface, so that end-users are generally unable to use it directly.

Tools using TEC

This last weak-point is covered by several useful utilities:
OS 9 and later ships with a little program named "Chinese Text Converter", which can be used in fact as a universal encoding converter.
Another freeware encoding converter using TEC is Cyclone. The user interface is very simple and intuitive. It is AppleScript aware and can be scripted by AppleScript. The ReadMe file has some interesting information about Unicode and other encodings.
There is yet another freeware encoding converter program named Uctrans 1.0d1; it converts encodings between Unicode and Japanese encoding (Shift-JIS) and Big5 (Traditional Chinese) and GB (Simplified Chinese), and is created by Mr. Motohiko Kitahara, author of another good converting program. The documentation is in Japanese only.
Finally, a very powerful (and free) OSAX (AppleScript Scripting Addition) named TEC OSAX can be downloaded from its author's web site, Mr. Hideaki Iimori. As it is an OSAX, it has no user interface in itself, but it provides a framework with which users can create simple utilities with interface (as AppleScript scripts, droplets, or UserLand Frontier droplets, for example). By the way, it comes with some sample droplets which can be used immediately. TEC OSAX has the outstanding feature of converting encodings between Unicode and Mac's Styled Text.
Styled Text is a text with style attributes such as font, size, font face (plain, bold, etc.), colors, etc. -- and each Mac font has a "script attribute", i.e. a language information. Thus, it is possible to determine from the font used in a chunk of text its language (or "script"). With this feature, it is possible to convert multilingual Unicode text to multilingual Styled Text text, and vice versa. Note that all the other encoding converters using TEC that I mentioned above cannot convert multilingual text.
Very unfortunately, the latest version of TEC OSAX's, version 1.3b2, seems not fully compatible with the latest version of TEC itself (version 1.5, which comes with OS 9 and later); some of the most common encodings (e.g. "macintosh" or "X-MAC-CYRILLIC", etc.) are not available to TEC OSAX. The latest version of TEC with which TEC OSAX 1.3b2 is fully compatible is TEC version 1.4.

To remedy this situation, I wrote a little AppleScript demo script which will convert multilingual/multi-script StyledText text to UTF-8 text. It uses Cyclone, and two OSAX, Text X and System Misc, which are included in the package of an excellent editor named "QuoEdit" (version 0.641 and later); you can find it at <http://hyperarchive.lcs.mit.edu/HyperArchive/Abstracts/text/HyperArchive.html>, looking for "QuoEdit". You may need Jon's Commands also if you use OS earlier than 8.5 (download for example from Download.com). -- This script gets a selected text from a document of Style, and writes the converted UTF-8 text in a new document in Style. In fact, since Style can do this conversion on its own, with its native AppleScript command (as I described above), this script is not useful in itself. I simply wanted to demonstrate that with the help of some OSAX and Cyclone's scripting capability, this kind of conversion is possible -- although it is very slow, because it uses the clipboard to do these conversions. -- But, I should add that, as far as I know, it would be impossible to do the reverse conversion, i.e. from Unicode to multi-script StyledText text, using Cyclone...

I wrote another little demo script, which will convert a mixed StyledText text with Mac Roman font and Mac Cyrillic font, to a mixed text with Windows "Latin-1" encoding and Windows Cyrillic encoding; it uses the same OSAX and Cyclone. I don't know if such a conversion is useful at all. This is also a kind of "demo" script -- but you will be able to easily change the pairs of the encodings to be converted for your own needs.

Please try these demo scripts in the Script Editor (they are not runnable on their own).

You can download the package of my two AppleScript scripts here (16K to download).
Cyclone is now at its version 1.3; it supports now the command "convert text aText, fromCode, toCode". This simplifies the script that converts multilingual Styled Text to UTF-8 text. I join this new script, named "convStyledTextScript for 1.3".

Here you can also download a package of three little AppleScript scripts "Nisus Unicode Tools" (20K to download), which can be used with Nisus Writer (I wrote these scripts). Two of them will convert the selected multilingual text in a document in Nisus Writer to a multilingual UTF-8 or UTF-7 text and will paste it in a new document in Nisus Writer. The two others will convert the selected UTF-7 or UTF-8 text in Nisus Writer to multilingual StyledText and will paste it in a new document in Nisus Writer. They require TEC OSAX (and Jon's Commands for OS earlier than 8.5). You can put them in the Apple Menu Folder, in the System Folder, so that you will have access to them from within Nisus Writer. -- Note that these scripts cannot handle text bigger than 32 KB.

When, for example, you want to make a multilingual HTML document in Nisus Writer, write all your HTML with different fonts for each language; when you have finished all your editing and are sure that your text displays as you expect in the web browser (except for the "garbled" foreign characters...), then, select your text and choose this script in the AppleMenu. Now, you will have your text converted to UTF-8. You will have to add the "charset" info in the meta tag, like this:

<html> <head> <title>Your_Multilingual_Web_Page</title> <meta http-equiv="content-type" content="text/html; charset=unicode-2-0-utf-8"> </head> <body alink="#008000" bgcolor="#FFFFFF" link="#0000FF" vlink="#800080"> .....

As to the other two scripts, converting text to/from UTF-7 text, they may be used for example with the new Nisus Email, the new product from Nisus Software, to send and receive multilingual text as email.
I should also mention another OSAX called "Unicode OSAX", although it is not a text converting tool. It converts between character and their (decimal) number in the Unicode character set. For example, you can have:
Unicode number "c"
	-->	99
or:
Unicode character 99
	-->	"c"

Other converters (using MacPerl)

As I wrote above, the major problem with the conversions using TEC is that when the needed characters are lacking in Mac OS or the installed language kits, they are rendered by "?" -- loosing some of the original information in the Unicode text. This problem is can be solved by MacPerl (or MacJPerl) scripts. One of the most comprehensive Perl scripts for conversions between Unicode and legacy CJKV ["V" stands for Vietnamese] codes is the one written by Ken Lunde: it is "cjkvconv.pl". Very unfortunately, this script won't run easily on MacPerl/MacJPerl, because of its particular interface.
But there are two converters written by Nowral-san which work with MacPerl/MacJPer. They are:
Uni2Multi
and
Uni2SJIS
that you will find in his Unicode page (please visit first his ReadMe page where you will find a detailed explanation as to how to use his tools).
Uni2Multi converts multilingual Unicode text file to multilingual Mac Styled Text file.
Uni2SJIS converts Unicode Chinese or Japanese (and Roman) text file to Shift-JIS text file.
Nowral-san's tools have the great feature of being able to convert Unicode characters which are not in JIS (or other) character set but are in the Morohashi Kanji Dictionary to the correspondent Mojikyo character numbers. If you have Mojikyo fonts for Mac, you can use Nowral-san's another script, Mojikyo2Font (that you will find in his Mojikyo page, to convert Mojikyo numbers to the correspondent characters.

Nowral-san wrote two other scripts to convert UCS-2 (UTF-16) text to UTF-8 text, and vice versa. These are:
UTF82Uni
and
Uni2UTF8
that you will find in his Unicode page.
All the Unicode tools written by Nowral-san support both Big Endian and Little Endian files.
Finally, I myself wrote a little MacPerl script which reverse the bytes of Little Endian Unicode text files (to make them Big Endian files). If you need it, please download reverseByte (6K to download), here.

I learnt from Mr. Yusuke Kinoshita a very informative web site on Unicode, Unicode Resources by Alan Wood. It is not specifically on Unicode on the Mac OS, but one can learn much from this site.

Here is the list of available encodings to Text Encoding Converter:

macintosh
X-MAC-JAPANESE
X-MAC-CHINESETRAD
X-MAC-KOREAN
X-MAC-ARABIC
X-MAC-HEBREW
X-MAC-GREEK
X-MAC-CYRILLIC
X-MAC-DEVANAGARI
X-MAC-GURMUKHI
X-MAC-GUJARATI
X-MAC-THAI
X-MAC-CHINESESIMP
X-MAC-CENTRALEURROMAN
X-MAC-SYMBOL
X-MAC-DINGBATS
X-MAC-TURKISH
X-MAC-CROATIAN
X-MAC-ICELANDIC
X-MAC-ROMANIAN
X-MAC-FARSI
X-MAC-UKRAINIAN
X-MAC-VT100
UNICODE-1-1
UNICODE-1-1-UTF-7
UNICODE-1-1-UTF-8
UNICODE-2-0
UNICODE-2-0-UTF-7
UTF-8
ISO-8859-1
ISO-8859-2
ISO-8859-5
ISO-8859-6
ISO-8859-7
ISO-8859-8
ISO-8859-9
cp437
cp864
windows-1252
windows-1250
windows-1251
windows-1253
windows-1254
windows-1255
windows-1256
US-ASCII
JIS_C6226-1983
csISO58GB231280
X-GBK
csKSC56011987
ISO-2022-JP
ISO-2022-CN
ISO-2022-KR
EUC-JP
GB2312
X-EUC-TW
EUC-KR
Shift_JIS
KOI8-R
Big5
X-MAC-LATIN1
HZ-GB-2312
X-NEXTSTEP
cp037

Return to the main text

Go to Research tools Home Page
Go to NI Home Page

Mail to Nobumi Iyanaga

Part of Nobumi Iyanaga's website. n-iyanag@ppp.bekkoame.ne.jp. 9/26/00.

Unicode and MacOS, and Code converters

Generalities

MLTE Demo and SUE

WorldScript aware applications

Text Encoding Converter

Tools using TEC

Other converters (using MacPerl)

This page was last built with Frontier on a Macintosh on Tue, Sep 26, 2000 at 22:38:44. Thanks for checking it out! Nobumi Iyanaga