Batch Convert Files to UTF-8

Part of Nobumi Iyanaga's website. n-iyanag@ppp.bekkoame.ne.jp. 1/11/07.

Batch Convert Files to UTF-8

This is yet another conversion utility from text files of legacy encodings such as MacRoman, Shift_JIS, Big5, to UTF-8 files. The special feature of this utility is that it supports a recursive parsing of folder hierarchy, so that you can convert many files at once, even if they are in a deep sub-folder inside a folder hierarchy.
This is an AppleScript droplet application: it contains a Perl script, and that script calls iconv, a Unix utility specialized for the conversion between different encodings, to perform the actual conversion. The Perl script contains also a routine that does a "further conversion" (see below), and a conversion of the end of line characters.
I think iconv comes with the System from OS 10.3 onward -- so if your system is earlier than Panther, this utility will not run. It will run certainly on OS 10.4.x (Tiger). You will find in the file batch_convert2utf8.pl contained in the package all the different encodings supported by iconv.

How to use...

Drag and drop a folder full of text files of one of the following encodings:

MacRoman
ISO-8859-1
Shift_JIS
Big5
Big5-ETen
-- In fact, many other encodings are supported (see below), but for this utility, I thought these are enough. If you need other encodings, don't hesitate to write me; I will update the utility for you. Note that Shift_JIS is the encoding used by SAT files; Big5-ETen is the Windows version of Traditional Chinese encoding, and is used by CBETA files.
As I said, the source folder may contain sub-folders in it.

The droplet application will launch, and ask you to choose a folder in which you will create a new folder: this will be the destination folder, in which you will have your converted files. To make the converted files easily accessible, it would be a good idea to choose your Desktop at this dialog.

Another dialog will follow, asking you to enter the name of the new folder. It will present you a default name, which is the name of the source folder, plus "_utf8". In most cases, you can simply accept this name (pressing the Return key). There, the folder hierarchy of the source folder will be reproduced.
If there is already a folder of that name, and if in that folder, there are files of the same name as the source files, the conversion process will stop, and ask you to enter another folder name (it will not overwrite the existent files).

A list choosing dialog will ask you to choose the name of the encoding of the source files. -- Be sure that ALL the source files are in the same encoding as the one you have chosen.

A dialog will ask you to enter the grep pattern of the file extension(s) that will be used as file filter. The default value is "\\.txt": with this pattern, only files with the extension ".txt" will be converted (note that this is NOT the glob pattern -- so it will not be "*.txt", etc.). If you enter for example "\\.txt|\\.html", files with the extensions ".txt" or ".html" will be converted. You can enter in this dialog "all" (in small letters): this will make the utility convert every (non-binary) file found in the source folder and in its sub-folders.

Another dialog will ask you if you want to do a "further conversion"; there are two "further conversions" available for now: the "html-entity" conversion and "html" conversion. You will be asked to choose one of the two options, or simply press "No". "Html-entity" conversion will convert for example "é" to "é"; it will convert "一" to "一", or "Ś" to "Ś". -- The forms "é", "à" are called the "Latin-1 entities" (see the link for a complete list); the forms "一", "ߐ", etc. are the Unicode decimal representations, and the forms "Ś", "ş", etc. are the Unicode hexadecimal representations. These are standard representations that are (or should be) supported on any platform. There may be other, more private representations: for example, CBETA files use "&rdotblw;" to represent "ṛ" (i.e. in Unicode standard, "ṛ"), or "&M003255;" to represent "叶" (i.e. "叶" in Unicode standard) [the latter represent the Mojikyo character number]. At this time, these private representations will not be supported. "Html-entity" conversion is useful especially if your source files are html files; it is useful also, because you can represent any Unicode character with an entity and get it converted into Unicode with the conversion script. Thus, you can "embed" html entities in a text file in Shift_JIS for example, to represent Asian transliteration characters (or any other characters), and get them right in the converted UTF-8 file. For more details on the html entity conversion, please have a look at my another page East Asian Diacritical Fonts and Unicode.
The other "further conversion" is "html" conversion. It will convert files to UTF-8, and html (the file extension will be changed to ".html"). I will explain below the usefulness of "html" conversion.

A final dialog will ask you if you want to convert the end of line characters to Unix end of line characters. As it is well-known, Windows, Classic Mac OS and Unix (Mac OS X) use different end of line characters: Windows: cr + lf, Classic Mac OS: cr, and Unix: lf. If you press "Yes" button at this dialog, your converted files will have Unix end of line characters (whatever the original files come from). -- If you intend to use Unix grep to search in the converted files, it is a good idea to convert the end of line character. See my another page unix_grep on OS X.

When you have set these settings, the actual conversion will begin. If there are many big files to convert, this process may take some time...

Why convert files to UTF-8...:

All the following section became now (2007-01-15) obsolete since I wrote a new utility which can be used as a user interface for GNU grep: please look at my page unix_grep on OS X. However, I leave this section as it is, for historical reason.
As you may have guessed, I wrote this utility having in mind especially Mac OS X (and particularly OS 10.4.x and later) users who would need to use CBETA or SAT Buddhist etext files. These users know by experience that CBETA data come with some utilities which work only on Windows machines. On the other hand, CBETA files are in Big5-ETen encoding, which is not supported on Classic Mac OS (Traditional Chinese encoding supported in Classic Mac OS is Big5; the character set Big5-ETen contains some supplementary characters not included in Big5...).
If you use Classic Mac OS (and if you have Japanese and/or Chinese language kits installed), you can use CBETA and/or SAT files without TOO much trouble. You can use MgrepApp to search in these files (when you launch MgrepApp, it says that the beta version has expired; but you can get rid of this warning pressing OK button, and closing the little window). However, it you use mainly OS X, and if you dislike to launch Classic OS -- in this case, even though you can open the data files without problem with TextEdit or Jedit X or other editors, there are almost no good searching utilities that can search in these files. With Mac OS 10.4.x (Tiger), we have a new situation: there is Spotlight, the "super powerful" indexing tool with which -- it is claimed -- it would be possible to find any files in your hard disk. Unfortunately, this is not quite true for these files, because Spotlight seems to index only html, rtf, xml files... (as to plain text files [*.txt files], it indexes *perhaps* UTF-16 files, but not UTF-8 files !). This situation may be all the more frustrating for researchers who use OS 10.4.x and who are still unable to search in these Buddhist etext files. So, this may be an important reason why you would want to convert these files into Unicode.
There is also another reason, that is with OS X, we have a Unix OS, with many powerful Unix tools -- one of which is GNU grep. With Tiger, we have grep version 2.5.1, which is very fast and very powerful -- but it only supports UTF-8 files...! If you have your Buddhist etext files in UTF-8 encoding, you can use grep to search in these files. For example, you can do something like the following in Terminal:
% egrep -R --include=*.txt -Hn "摩訶迦羅天" ~/Documents/cbeta/app1_utf8/*
and get the result:
/Users/me/cbeta/app1_utf8/T18/T18n0852a.txt:1302:T18n0852ap0123c11(00)║　摩訶迦羅天　　多聞虛心合
/Users/me/cbeta/app1_utf8/T18/T18n0852b.txt:1167:T18n0852bp0140b26(00)║　摩訶迦羅天　　多聞虛心合
...
/Users/me/cbeta/app1_utf8/T21/T21n1287.txt:128:T21n1287_p0356c13(05)║理趣釋云七母女天者是摩訶迦羅天眷屬。可居東北方。
in less than one minute (on my iBook with 1.42 GHz PowerPC G4) [depending on your setting of Terminal, etc., it is possible that you don't get this result; but I think you will get something similar if you redirect the result of your grep search to a text file, for example doing something like:
% egrep -R --include=*.txt -Hn "摩訶迦羅天" ~/Documents/cbeta/app1_utf8/* > ~/Desktop/grep_res.txt
].
Now, of course, it is not easy to use Terminal for these kinds of search, but it is still very interesting to know that this kind of things is possible. This may mean that it would be not VERY difficult to write some utilitie(s) that would use Unix grep to do searches in UTF-8 files...
As to Spotlight search, you can use html conversion -- if you want (but it will take more time for this conversion).
I hope that all this explains why I wrote this utility. I would NOT recommend you to try to convert all the CBETA files. First, try with some small files that you would copy in a new folder. If this works, try again with other files..., and after you are really sure that it works, try to convert, for example, only the app1 folder (this is the best format for searching)...
This may take hours (?) to convert all the files (of app1 folder) [but I could convert 155 files of T01 and T02 in less than one minute...]; and Spotlight may work more hours to index these files. But after the indices will be done, you will be able to use Spotlight to find files in which the searched word is contained (be warned: Spotlight is NOT perfect; its search is never COMPLETE!)... And hopefully, it will possible to write the grep utilitie(s) to which I alluded above. [...Which I have done: see unix_grep on OS X.]
Download
The package contains three items:

batch_convert2utf8.app -- the AppleScript droplet: this is the actual application onto which you will drag and drop your folder.
batch_convert2utf8.pl -- this is the Perl script which is contained in the AppleScript droplet, and which does the actual conversion. I put this script in the package only to make it easy to examine this script, if there are users interested in this kind of things...
ReadMe.rtf file. -- This is the same file as this web page.

Please download the package batch_convert2utf8.zip from this link (44 KB).
You will find the latest version of this utility at
http://www.bekkoame.ne.jp/~n-iyanag/researchTools/batch_convert2utf8.html
I would welcome any feedback, bug report or suggestion. 2005.10.12

Go to Research tools Home Page
Go to NI Home Page

Mail to Nobumi Iyanaga

Part of Nobumi Iyanaga's website. n-iyanag@ppp.bekkoame.ne.jp. 1/11/07.

Batch Convert Files to UTF-8

How to use...

Why convert files to UTF-8...:

Download

This page was last built with Frontier on a Macintosh on Thu, Jan 11, 2007 at 1:01:57 PM. Thanks for checking it out! Nobumi Iyanaga