Language Tag and Unicode conversion

Part of Nobumi Iyanaga's website. n-iyanag@ppp.bekkoame.ne.jp. 10/17/04.

Language Tag and Unicode conversion

This page is in Unicode

Introduction
Presentation of the Problem to be solved, and general info:
The problem I try to solve here is: how to convert to Unicode a multilingual text containing chunks of text written with some special diacritical font(s), such as Appeal, Hobogirin, etc. I already wrote a page entitled East Asian Diacritical Fonts and Unicode in which I described how it is possible to convert to Unicode text written with special fonts -- but this could not be used if the text was a multilingual text, with mixed languages.
The main information that can be used for the solution of this problem is the font information; in the Mac OS, fonts are distinguished by their ID numbers and by their names. And font ID numbers are distributed according to the language (or "script") information. Here is the table of scripts supported by Mac OS and the corresponding font ID numbers ranges (cf. http://developer.apple.com/techpubs/mac/Text/Text-534.html#HEADING534-0) [This url doesn't work any more: now (as of October 2004), we can find the same info at http://developer.apple.com/technotes/te/te_02.html]:

Script name Script Code Font ID range in Hexadecial Font ID range in Decimal

Roman 0 $0000-$3fff 0-16383

Japanese 1 $4000-$41ff 16384-16895

Chinese(Trad.) 2 $4200-$43ff 16896-17407

Korean 3 $4400-$45ff 17408-17919

Arabic 4 $4600-$47ff 17920-18431

Hebrew* 5 $4800-$49ff 18432-18943

Greek 6 $4a00-$4bff 18944-19455

Russian 7 $4c00-$4dff 19456-19967

Right-Left symbols 8 $4e00-$4fff 19968-20479

Devanagari 9 $5000-$51ff 20480-20991

Gurmuki 10 $5200-$53ff 20992-21503

Gujarati 11 $5400-$55ff 21504-22015

Oriya 12 $5600-$57ff 22016-22527

Bengali 13 $5800-$5ff9 22528-24569

Tamil 14 $5A00-$5Bff 23040-23551

Telugu 15 $5c00-$5dff 23552-24063

Kannada 16 $5e00-$5fff 24064-24575

Malayalam 17 $6000-$61ff 24576-25087

Sinhalese 18 $6300-$63ff 25344-25599

Burmese 19 $6400-$65ff 25600-26111

Cambodian 20 $6600-$67ff 26112-26623

Thai 21 $6800-$6ff9 26624-28665

Laotian 22 $6A00-$6bff 27136-27647

Georgian 23 $6c00-$6dff 27648-28159

Armenian 24 $6e00-$6fff 28160-28671

Chinese(Simpld.) 25 $7000-$71ff 28672-29183

Tibetan 26 $7200-$73ff 29184-29695

Mongolian 27 $7400-$75ff 29696-30207

Ethiopian 28 $7600-$77ff 30208-30719

Non-CyrillicSlav. 29 $7800-$79ff 30720-31231

Vietnamese 30 $7A00-$7bff 31232-31743

Sindhi 31 $7c00-$7dff 31744-32255

Uninterpreted Symbol 32 $7e00-$7fff 32256-32767

* I am sorry, there was an error in the range of Hebrew: I was writing "$4800-$4ff9"/"18432-20473"; this was false: the true value is: "$4800-$49ff"/"18432-18943". And thank you, Kino-san, for pointing out this!

Another source of information that will be needed is the tables of correspondance between used encodings in Mac OS and Unicode. This can be obtained from the Unicode Consortium: you will have to download all the files found at: http://www.unicode.org/Public/MAPPINGS/VENDORS/APPLE/. But more practically, you can download and use the tables contained in the package named "Uni2Multi f" written by Nowral-san: http://member.nifty.ne.jp/Nowral/31_Unicode/31_Unicode.html
Once we have gotten this set of needed info, we can look for our solution. There may be more than one solution for our problem. One of them, and probably the best one from the point of view of the simplicity of use, is to use the font information included in the styled text. But this would require another set of information about the structure of styled text, and we would have to deal with binary data. This is certainly possible (and I will probably write another web page explaining this method), but styled text data seem somehow heavy, requiring much memory and long time (at least when we try to process it with MacPerl script). On the other hand, there are applications which don't use styled text (for example Frontier 4.2.3 is in this case), and the plain text has the advantage of being universally accessible, and easy to process in any scripting or programming languages. This is the reason why the "language tag" method may be a valuable solution.

The "language tag" method:
The "language tag" is a simple idea of enclosing each chunk of text written in a single language between a pair of tags like this:
<language="japanese">あいうえお</language><language="roman">c’est un bel été</language><language="ITimesSkRom">Śiva Maheśvara</language>
These tags can be easily parsed by scripts (for example MacPerl scripts), and the text enclosed between a pair of these tags can be easily converted to Unicode (or any other encodings), according to the value of language tags.
As you can see in the above example, you can use font names for "language names". This is important, because this enables you to switch the conversion table and algorithm according to font names, such as "Appeal" or "Hobogirin", etc.

New (02.12.28) : Nisus macro "languageTag_macro" for Nisus Writer
A rather simple Nisus macro can be used to insert language tags in a multilingual text in Nisus Writer . I wrote a sample macro, which supports three languages:

MacRoman
Japanese
Traditional Chinese
and five special fonts:

ITimesSkRom
Appeal
Hobogirin
NormanSk
Normyn
Times_Norman

It would be easy to add other scripts and special fonts to this macro.
Note that this macro will run in Nisus Writer 6.5 running in Classic mode, in OS X. -- And you can use MacPerl scripts on OS X as well.

AppleScript script "Insert language tag" for Style
Scriptable and multi-style text-editor programs, such as Style or Tex-Edit Plus, can be scripted in such a way that we can insert language tags automatically: the text of a document can be parsed as "style runs", and for each style run, we can get style attribute information, including font information. I wrote an AppleScript script named "Insert language tag" for the editor Sylte. To use it, you will have to put this script in the "Style Scripts" folder, inside the Style folder.
When you put this script in the "Style Scripts" folder, a menu item named "Insert lanfuage tag" will appear in the Script menu. To try it, open a multilingual document, and select this menu item. A new window will open, and you will have the tagged text.

To run this script, you will need one AppleScript Scripting Addition named "System Misc" (http://www5c.biglobe.ne.jp/~quo/libix.htm#s_misc). This scripting addition contains some important commands related to the font manager and the script manager:
get font ID from name: Get font ID for the specified font name

get font ID from name string -- the font name
Result: small integer -- the font ID (or 0 if not installed)

get script from font ID: Get script code corresponding to the specified font ID

get script from font ID small integer -- the font ID
Result: small integer -- the script code

With these commands, we can get the script code from the font used in a chunk of text; from the script code to the script name, the correspondance is established by the table of scripts that we quoted above. -- To use this scripting addition, you must download it, expand it, and place it in the Scripting Additions Folder, inside the System Folder (inside the Extensions folder, if your system is earlier than OS 8.5 [?]).

As this is a scripted solution, we can customize it at will. We can define a property named "specialFonts", and a property named "specialFontsAttrbutes". For example, and for my own use, I have the following two lines:
property specialFonts : {"ITimesSkRom", "Appeal", "Hobogirin"}
property specialFontsAttrbutes : {"", "", "non conversion"}

In this example, chunks of text written with the fonts named "ITimesSkRom" and "Appeal" will be tagged in the following way:
<language="ITimesSkRom">aiueo</language><language="Appeal">kakikukeko</language>
and chunks of text written with the font named "Hobogirin" will be tagged:
<language="non conversion">aiueo</language>

Of course, the MacPerl script which does the conversion itself must contain special conversion table(s) and algorithm(s) that will be activated according to these special tags.

And of course, you can insert these tags manually, without being helped by the AppleScript script. In many cases, this may be even preferable, because often, the script solution is too sensitive to the font information, and tend to insert too many tags (this may cause some slow down of the execution of the conversion script itself). But if you tag manually your text, you must be very careful to use the exact format; the conversion script will be messed up totally by every unmatching opening and closing tags, or unmatching double-quotes, etc.

Note that unfortunately, this script, as well as the other AppleScript script for Style included in this page (see below), will not run on OS X, because there is no (not yet?) OS X version of the OSAX "System Misc". I am sorry for that. I investigated other possibilities, but found nothing (the text property writing code in Style Text Suite seems broken, at least in OS X. The same property in Tex-Edit Plus 4.3 for OS X returns always "-1". And there is no such property in Jedit4 AppleScript dictionary... So, for now, you will have to use Nisus macro described above...

MacPerl Conversion Script "languagetag2Unicode.pl"
The conversion script is written in MacPerl. It is fast and powerful enough to be used in practical use. The text to be converted is passed by the @ARGV argument. It contains conversion tables for the fonts ITimesSkRom, Appeal, Symbol, and Zapf Dingbats. You can add other conversion tables using Perl scripts that you will find in the page East Asian Diacritical Fonts and Unicode (generated_perl_scripts.sit.hqx [but you should convert the character notation to hexadecimal notation...]). It reads and uses as it is needed the conversion tables contained in the folder "Macintosh HD:Desktop Folder:Perl scripts:Uni2Multi f:" (cf. the line 85:
my $cwd = "Macintosh HD:Desktop Folder:Perl scripts:Uni2Multi f:";
You must change this line so that it matches to the path of the folder named "Uni2Multi f" [or any other folder containing Apple's conversion tables] in your hard disk).

The supported encodings are for now MacRoman, Japanese (Shift-JIS), Traditional Chinese (Big5), Korean (EUC-KR) and Simplified Chinese (GB), but you can add any encoding if you wish (for the Indic languages, right-to-left languages and composite characters, new algorithms must be added, which is not so trivial...). The output will be in UTF-8 format (you can change the script easily so that the output text is in UTF-16 format). The chunk(s) of text tagged with "non conversion" attribute will not be converted at all (so, they must be valid UTF-8 formatted Unicode text at the beginning). And all the language tags will be removed from the output text.
The input text may begin with non tagged text, and may contain non tagged text as well -- the only condition required is that all the tags are matching tags, with an opening tag and a closing tag.

This script itself can be placed anywhere you like. You can call it from within an AppleScript script, or a Frontier script, etc.

New (02.08.06) :
I added another MacPerl script, a droplet named "languageTag2UTF8.pl", which will convert a "language-tagged" text file to UTF-8. The result will be in a file, in the same folder, with the extension ".out".
As the other MacPerl script, you should change the line (line 112)

my $cwd = "Macintosh HD:Desktop Folder:Perl scripts:Uni2Multi f:";

so that the folder path matches to the path of the folder containing the conversion tables.

AppleScript script which inserts language tags and converts text at once
Finally, I added an AppleScript script for Style which does all the process at once ("Insert language tag and2UTF-8"): it inserts language tags in a new document, and converts the tagged text into UTF-8 in another new document.
In fact, after having inserted language tags, it simply calls the MacPerl script "languagetag2unicode.pl" which does the conversion. The first time you run this script, it will ask you to locate the script folder in which you will have placed that MacPerl script. You can place it anywhere you want, but the folder "Style Scripts" inside the Style folder may be a good place. When you run this command, you should have MacPerl running.

Please note that this script will not run on Mac OS X (see above).

If you use OS 9.1 and later, you can verify if the conversion is successful. To do that, you will save the result as a TEXT file; you would convert it from UTF-8 to UTF-16 using Cyclone or Chinese Text Converter, then you will open the new file with WorldText. You will choose "Font Susbtitution" in the "Layout" menu. -- Of course, to display some special diacritical characters, you will need some special font, such as TITUS Cyberbit Basic.

Download

Download the package containing the AppleScript script named "Insert language tag" for Style, the AppleScript script named "Insert language tag and2UTF-8" for Style, the MacPerl script named "languagetag2Unicode.pl", the MacPerl droplet named "languageTag2UTF8.pl", and the Nisus macro named "languageTag_macro" from here (48 KB). -- Please note that this package does not contain any ReadMe. You should save this page to use the software contained in it.

You can download the font Appeal here, and the font ITimesSkRom here.
And here is the link to the page where you can download the font "Hobogirin": http://www.meijigakuin.ac.jp/~pmjs/resources/fonts/mac/

Thank you, and have fun!

Go to Research tools Home Page
Go to NI Home Page

Mail to Nobumi Iyanaga

Script name	Script Code	Font ID range in Hexadecial	Font ID range in Decimal
Roman	0	$0000-$3fff	0-16383
Japanese	1	$4000-$41ff	16384-16895
Chinese(Trad.)	2	$4200-$43ff	16896-17407
Korean	3	$4400-$45ff	17408-17919
Arabic	4	$4600-$47ff	17920-18431
Hebrew*	5	$4800-$49ff	18432-18943
Greek	6	$4a00-$4bff	18944-19455
Russian	7	$4c00-$4dff	19456-19967
Right-Left symbols	8	$4e00-$4fff	19968-20479
Devanagari	9	$5000-$51ff	20480-20991
Gurmuki	10	$5200-$53ff	20992-21503
Gujarati	11	$5400-$55ff	21504-22015
Oriya	12	$5600-$57ff	22016-22527
Bengali	13	$5800-$5ff9	22528-24569
Tamil	14	$5A00-$5Bff	23040-23551
Telugu	15	$5c00-$5dff	23552-24063
Kannada	16	$5e00-$5fff	24064-24575
Malayalam	17	$6000-$61ff	24576-25087
Sinhalese	18	$6300-$63ff	25344-25599
Burmese	19	$6400-$65ff	25600-26111
Cambodian	20	$6600-$67ff	26112-26623
Thai	21	$6800-$6ff9	26624-28665
Laotian	22	$6A00-$6bff	27136-27647
Georgian	23	$6c00-$6dff	27648-28159
Armenian	24	$6e00-$6fff	28160-28671
Chinese(Simpld.)	25	$7000-$71ff	28672-29183
Tibetan	26	$7200-$73ff	29184-29695
Mongolian	27	$7400-$75ff	29696-30207
Ethiopian	28	$7600-$77ff	30208-30719
Non-CyrillicSlav.	29	$7800-$79ff	30720-31231
Vietnamese	30	$7A00-$7bff	31232-31743
Sindhi	31	$7c00-$7dff	31744-32255
Uninterpreted Symbol	32	$7e00-$7fff	32256-32767

Part of Nobumi Iyanaga's website. n-iyanag@ppp.bekkoame.ne.jp. 10/17/04.

Language Tag and Unicode conversion

Introduction

The "language tag" method:

New (02.12.28) : Nisus macro "languageTag_macro" for Nisus Writer

AppleScript script "Insert language tag" for Style

MacPerl Conversion Script "languagetag2Unicode.pl"

AppleScript script which inserts language tags and converts text at once

Download

This page was last built with Frontier on a Macintosh on Sun, Oct 17, 2004 at 10:20:17 AM. Thanks for checking it out! Nobumi Iyanaga