Part of Nobumi Iyanaga's website. n-iyanag@nifty.com. 4/1/07.

logo picture

Conversion of Word files with diacritical fonts to Unicode

Version 0.7.2 (2007-02-06)

Introduction

This is another page related to the problem of old diacritical fonts, used especially for transliteration of Asian languages, that should be converted into Unicode fonts. As I discussed in another page (see East Asian Diacritical Fonts and Unicode), in old Classic Mac OS days, we used to use some specially encoded fonts for transliteration of Asian languages: these were for example Appeal, Hobogirin, Norman fonts, etc. But when the age of Unicode arrived, with OS X, these old fonts became somehow obsolete. Old diacritical fonts were not good for data exchange -- for example, if we give our files in which we used Appeal, to someone who doesn't have that font, he will be unable to read them. They were not good for searching as well: for example, it would be impossible to search for a Sanskrit term in a text file if we use one of these fonts. And it seems that in some cases, old fonts no longer work in OS X applications (such is the case for BharatiTimes. On the other hand, the TrueType version of Appeal stopped to work on my machine in OS 10.4x. I seem to have managed to fix this problem using the PostScript version of the same font...). This may happen more often on Intel-Mac machines which cannot run Classic OS.

Now, we can use good Unicode fonts for such use: there are for example Gandhari Unicode, Gentium, Thryomanes or Times Ext Roman; and we can use useful keyboard layouts for typing (see my AsianExtended keylayout for example). But what can we do for old files that we created using these fonts? I think this is a very serious problem for very many researchers -- even if some of them are still not very conscious of it.

For my own files, that I did all with Classic Nisus Writer, I could find an automated solution long ago (this was the main theme of my page East Asian Diacritical Fonts and Unicode already referred to). But I was always thinking that there are much more people who worked with MS Word, and they have the same problem. As I don't use Word almost at all, I don't know Word macros... But I figured out that converting the files to the rtf format, and working on it, it would be possible to convert these old fonts to modern Unicode fonts.

I wrote a Perl script that would do that, and tested it on some simple sample files. My script seems to work well. So, I "embedded" this Perl script in an AppleScript droplet, which would automate all the process.

This is this set of scripts that I present here.

Package contents and how to use:

The package that you will download from this page (see below) contains:

Put the folder anywhere you want, but do NOT change the folder structure: especially, the folder encodings, with its data files MUST be in the same folder as the droplet conv_word_diacritical_file.app.

As the names of the data files in the folder named encodings indicate, I have managed to create conversion tables for fourteen non-standard encoding fonts: Appeal, BharatiTimes, Hobogirin, ITimesSkRom, LotusPalatino, Macron, Minion-Indologist, MyTimes, Norman, NormanSk, Normyn, SanskritTimes, South Asia Roman, and TimesCSXPlus. In fact, I could test only two simple files in Norman and one simple file in Appeal. For some reason unknown to me, my copy of MS Word says that it cannot use many fonts (in fact, all the fonts which have names coming later than Hobogirin), and its Font Menu displays only fonts of names from "A" to "G"..., although it can open without problem files containing fonts that are not listed in the Font Menu. And BharatiTimes doesn't appear in Font Menu of any OS X applications. I could use the two Norman font files for testings, thanks to a friend who sent them to me... This is to say that I am not sure if all the other conversion tables work as expected, but I think the principle is the same for all. If you experience any unexpected results, please send your files to me, along with the font, so that I can test them.

The Perl script named "diacritical_font_convert.pl" is the "embedded" script in my AppleScript droplet; I included it in the package only for those who would be curious to understand how the script works.

Version history
New in version 0.7.1:
I added the conversion tables for LotusPalatino, Macron, Normyn, SanskritTimes and South Asia Roman.

Note that Norman, NormanSk, Normyn, etc. have very similar conversion tables, but they are different at some code points. And EVEN IF two fonts have exactly the same conversion tables, there must be one coversion table for each font. This is because the fonts are distinguished by their name, not by there conversion table.

Latest Note (for v. 0.7.1):
The first release had a VERY SILLY bug, and didn't work at all. I hope that with this one, it will work!

I changed also the Perl code so that it will work on OS 10.3x machines.

New in version 0.7.2:
There may be cases in which you use for example the font Appeal in your Word files; but words in Italic or Bold may be actually in fonts AppealItalic or AppealBold. The first release, as well as the version 0.7.1, were unable to handle cases of this kind. From version 0.7.2 onward, the droplet will be able to convert all the different fonts of the same "family" to a Unicode font (the conversion will be to a unique font, not to different fonts of a "family" of Unicode font).
When you have a set of fonts of the same "family", they will be treated as one font. So, you would choose "No" in the fourth dialog which will ask you: Does your file contain more than one Roman font?... (see below).

Basic use:

Before beginning the conversion, you should first follow the following steps:
  1. Make sure that your file is a .doc file, with the extension .doc; if it has no extension, add .doc to its name;
  2. You must know the name of the "problematic font", for example Appeal or Norman. You must know also if the file contains more than one Roman font. In most cases, I think people use only ONE Roman font (there will be no problem if the file contains fonts for other scripts, for example Japanese or Chinese fonts; and there will be no problem if it contains different fonts of the same family, for example Appeal and AppealItalic, etc. -- on this point, see below). However, there may be cases that you file contains for example Norman and Times. It is even possible that it contains more than one "special diacritical font", example, Norman and Appeal. My droplet cannot convert more than one "special diacritical font" at once; but if you drop your file on it more than once, it should be able to convert any number of "special diacritical fonts."

  3. On the other hand, make sure that your MS Word is running when you run the droplet conv_word_diacritical_file.app.
  4. And before anything, you should install at least one good Unicode Roman font, and a good keyboard layout to use it...! (see above, my Introduction).

Now that you have done these preparatory steps (and your Word is running...), you will drag your .doc file onto the icon of conv_word_diacritical_file.app; there will be four dialogs:

  1. First, a folder choosing dialog will ask you to select a folder in which you want to put your converted file(s). -- You can create a new folder in this dialog if you press the button New Folder which is at the lower left corner of the dialog.
  2. The second dialog will ask you to enter the name of the font to be converted. Enter there for example Norman, or Appeal -- If the font name you enter here does not exist in the file, the droplet will quit with an error message, so be sure to enter the exact name of the font! .
  3. The third dialog will ask you to enter the name of a Unicode font to which you want to convert your old diacritical font. At this moment, you will simply ignore this dialog, and press the Return key, without entering anything (I will explain this option below).
  4. The fourth and last dialog will ask you: Does your file contain more than one Roman font?.... As I said above, most of cases, you use only one Roman font for your file; in that case, press the default button No; otherwise, press the button Yes. -- I think this option can be useful, because if there is only one Roman font, the process can be very simple and fast, while if there is more than one Roman font, it must be much more complicated and slow.

When you have entered these options, a new list selecting window will ask you to confirm the configuration; if you press the Cancel button at this step, you will have to begin again from the beginning; if it is OK, press the OK button.

The droplet will begin to work:

(A little) better (?) use:

Now, I must explain the third dialog, which asked you to enter the name of the Unicode font to which you want to convert your old diacritical font. You can enter in that dialog for example Gentium if your file contains a text, or some trace of it written with the font Gentium.

This dialog lets the script know what Unicode font you want, but if the file has no trace of that font, the script cannot proceed. The font Times New Roman is the default Roman font in every Word file, so that we know that there is always this option, and this is why we entered nothing in the third dialog in the example above.

It seems that you can leave a "trace" of a font, for example Gentium, in your file if you do the following steps:

  1. Open your original file, xxxx.doc in Word, choose the font Gentium, and type anything -- a simple a must be enough.
  2. Save your file at this moment, then delete the text typed in the previous step, then save the file again.

After you have done these two steps, you can drop your file onto conv_word_diacritical_file.app; and then, you can enter Gentium in the third dialog. This will change all your "special diacritical font" to Gentium, so that you will not have to change the font after the conversion.

If this does not work, you will have to leave your typed text in Gentium without deleting it. You would delete it after the conversion has been done.

Note that the font name that you enter in the third dialog must be exactly that of the font, so, you must enter Gandhari Unicode with a space in the name; you must enter Times Ext Roman, with the two spaces in the name.

I am not sure if this is better, but this way, you hava the control of the font to which you would like to convert the file, before the conversion is done.


Some technical details to understand how this conversion system works:

I must admit that the interface for this conversion droplet is not very intuitive. So it is perhaps good to explain some technical details in order to get you started.

Non-Unicode Roman fonts (such as (old) Times, (old) Geneva, Appeal, Norman, etc.) contain 256 code points (from "0" to "255"). These code points can be divided into three categories or ranges:

  1. from 0 to 31: control characters, reserved by the system (example: 10: linefeed; 13: carriage return, etc.)
  2. from 32 to 127: "lower ASCII", i.e. most common "alpha-numeric" characters, such as ABCD..., abcd..., 0123... The correspondence between code points and characters is in principle fixed (for example, "A" is at the code point 65 in all kinds of fonts) -- although there may be exceptions... (for example, BharatiTimes use the code point 36, normally "$", for the character "a with macron");
  3. from 128 to 255: "higher ASCII" characters, which are normally accented characters or signs/marks of different kinds. There are different standards for this range, such as MacRoman, Windows 1252 or ISO-8859-1, etc.; our "special diacritical fonts", like Appeal or Norman, etc. normally use this range to put at different code points characters such as "a with macron" or "s with acute accent", etc. Example: for MacRoman (ex. (old) Times, (old) Geneva), the code point 140 is reserved to "a with ring above", but in Appeal, it is used for "a with macron".

So, strictly speaking, we should distinguish between font and encoding: for example, Appeal, AppealItalic, AppealBold (or AppealBoldItalic) all use the same code mapping (encoding), but they are different fonts. -- But for our conversion purpose, these different fonts of the same encoding are treated as only one font.

On the contrary, if your file contains characters of "higher ASCII range" from Times, and from Appeal, for example a "a with ring above" (140 in Times) and a "a with macron" (140 in Appeal), we must distinguish portions of your file by the used font. We will have to split the body text according to the fonts used, and we will convert ONLY the portions in which you have used Appeal (or AppealItalic, or AppealBold, etc.). In such cases, the conversion will be much more complicated, and will be slower...

-- And if you have put a character in Gentium or any other Unicode font, to leave a trace of it in your file (see the previous section), this will not be counted as a different font!


Download

Please download the package from this link (101K to download).

I would appreciate any feedback, comments, bug reports or requests.

Thank you!


Go to Research tools Home Page
Go to NI Home Page


Mail to Nobumi Iyanaga


frontierlogo picture

This page was last built with Frontier on a Macintosh on Sun, Apr 1, 2007 at 10:39:44 AM. Thanks for checking it out! Nobumi Iyanaga