While doing our research work in the fields of Asian humanity studies, we have often to deal with electronic texts of different kinds. Since we have many classical texts in electronic files now, there are more and more needs of tips and tools for processing texts. I would like to gather here some tips and tools that can be useful for those people working in these fields.
I should first describe my personal working environment. I am an user of a Macintosh (PowerMac 7600/132 with 180Mhz upgrade card), with Japanese system (7.6.1 currently) and Chinese Language Kit (Traditional Chinese, or Big5, and Simplified Chinese, or GB). I deal mainly with Japanese texts (in Shift-JIS code), but I write also sometimes English or French texts, and much more rarely I must deal with some Chinese texts (in Big5 or GB codes).
My main word-processor is Nisus Writer (now at version 5.1.2), which is "WorldScript savvy", and has a powerful macro language. I use it with a scripting tool called UserLand Frontier. I use also sometimes an editor called Jedit 2.0; I have MacJPerl (a Japanese localized version of MacPerl) installed. I have many Buddhist e-texts files on my hard disk, and some CD-ROM of Buddhist e-texts also. I think I have some knowledge of Nisus macro language and Frontier scripting; unfortunately, I am an eternal beginner in Perl scripting -- although Perl is the most powerful tool when one must deal with text processing.
While it seems that there are many good research tools in the MS-DOS, Windows or UNIX environments, we Macintosh users suffer of the lack of such tools. For example, there is no standard grep utility in the "Macintosh world"; there are very few editors having good macro capabilities, etc. But at the same time, there are some good features of the Mac OS, such as WorldScript, or AppleEvent, with which we can do things which are probably difficult in MS-DOS or Windows.
- I would like first to show how I use Nisus Writer, Frontier and MacPerl (or MacJPerl) in combination. Nisus Writer may be replaced with some other scriptable word-processor or editor program (for example Jedit 2); one could perhaps use AppleScript instead of Frontier, but this combination is more essential (go to MacPerl with Frontier and Nisus Writer).
- Here are some MacPerl droplets that can be used for some specific purposes: for example to split large files in parts that you can specify, or to concatenate splitted files, or to make vocabulary indexes of texts. I wrote some of them; some better Perl scripters wrote some others. You will find for each of them a detailed ReadMe. (Sorry! These pages are still to be done...)
Here is however a page presenting two droplets which allow to create Vocabulary index with Perl.
- I rewrote these scripts to make them work on OS X with Unicode text files. Please have a look at Vocabulary index with Perl for OS X. (Uploaded 2010-01-15).
- I wrote a script to convert Simplified Chinese characters to Traditional Chinese. It has no user interface, but can be used from Terminal (or used as a helper script for a Jedit X script, or a macro for Nisus Writer Pro...). Please have a look at Simplified to Traditional Chinese. (Uploaded 2010-01-02).
- I wrote a suite to be used with a CD-ROM of Buddhist texts, named TendaiCD-2 (Tendai CD 2 Suite). The package includes also a very neat Perl module, written by Gokuaku-san (based on a script written by Hayashi Hiroshi-san, with some code written by ajif-san), ported to MacJPerl by myself. This package will require Tex-Edit Plus, MacJPerl, Tanaka's OSAX 2.0, Choose Files & Folders OSAX and MgrepApp. The main engine is AppleScript scripts. [This suite is a "superset" of the old TendaiCD-1 suite. I will no longer publish this latter suite.]
- Seven years after TendaiCD2, a new Tendai etext collection, TendaiCD3 has been published in May 2007. I wrote a new macro/script set for this CD (and for TendaiCD2 as well); I hope with this macro/script set, Mac users will be able to use these CDs -- an invaluable resource for the studies of early Japanese Tendai. Please have a look at TendaiCD3_suite (2007.06.04).
- Some old diacritical fonts, such as Appeal, Hobogirin or ITimesSkRom, stopped working correctly in OS X applications since I upgraded from OS 10.4.7 to 10.4.8 (or perhaps from 10.4.8 to 10.4.9 -- I don't remember exactly). I investigated these problems, and propose some solutions. I distribute also newly generated versions of Hobogirin and ITimesSkRom. Please have a look at the page Font Problems in OS 10.4.x and later (2007/06.24).
- Here is another little goody: a table of correspondence between JIS characters and their pronunciation in pinyin, and Unicode HAN characters and their pinyin pronunciation (the page is in Japanese). You can download the tables along with two MacJPerl scripts and an AppleScript for the editor Style with which you will be able to get easily the pinyin pronunciation of JIS characters that you will write in Style; there are also two Perl scripts for TextExtras and Nisus Writer Express. (Updated 04-09-19)
- You will find a page on the current situation of support of Unicode on the Mac platform (as of July 2000); there are links to important information, programs and converters. There are also some scripts (in MacJPerl, AppleScript) that can be downloaded.
- I wrote a number of tables of conversion between some of the most used diacritical fonts for the transliteration of East Asian languages and Unicode (characters like vowels with macros, s with acute, etc.), and some converting scripts. The page East Asian Diacritical Fonts and Unicode presents these tables and tools. (Uploaded Dec. 19 2000)
This page has been updated on July 31 2001. It contains now some more conversion tables, and some fonts (Appeal and ITimesSkRom).
- I uploaded a page entitled Language Tag and Unicode conversion. The problem I try to solve here is: how to convert to Unicode a multilingual text containing chunks of text written with some special diacritical font(s), such as Appeal, Hobogirin, etc. (Uploaded on July 22 2002)
- Another page entitled From Unicode to Styled Text conversion describes a tip with which you can easily convert Unicode text files into legacy encoding files, with some special diacritical font(s) for transliteration of East-Asian text. (Uploaded on December 29 2002)
- I made Kotoeri dictionary for OS 9.2 and OS 10.2 from the Kotoeri Zen Dictionary that was included in ZenBase CD1. I added a Kotoeri dictionary for Kuten Code conversion for OS 9.2 and 10.2. I wrote a page entitled Installing Kotoeri IRIZ Dictionaries. (Uploaded 30.4.2003)
- I wrote a page named Migrating from Classic Nisus Writer to OS X NW-Express, to present some tips and macros/scripts that will (hopefully) help people to migrate from Classic Nisus Writer files to the new Nisus Writer Express files. My main concern is (as always...) the problem of non-standard diactirical fonts. Another problem for which I tried to present a solution is the lack of footnotes feature in NW-Ex. Perhaps Nisus Writer Express is not (yet??) good enough for we want to migrate to it, but I thought it may be interesting to try at this moment already. (Uploaded 13.5.2003)
- Uploaded Trash_Info: tiny utility for OS X, an AppleScript script which will display the number of items and the total amount of the contents in Byte, KB, MB or GB, and it will ask if you want to empty the trash, or open it, or cancel, with the warning "Are you sure you want to empty the trash?" (Uploaded 21.5.2003)
- I uploaded a new page, multiformat_nisus, presenting a macro/script set that convert Classic Nisus files to Unicode text files, Unicode tagged files, and html, TeX and Cocoa-rtf files. (Uploaded 15.12.2003)
- Uploaded the page Indexation_pl, a set of scripts that generate multi-field indices from a tagged text file. By "multi-field indices", I mean indices composed of several headings, like "Person Names", "Places", "Bibliographic References", etc. (Uploaded 17.8.2003)
- Uploaded a page Convert mbox files to UTF-8 files, presenting a set of utilities to convert mbox files created by Mail.app to UTF-8 files. This conversion may be useful to search in mails (Mail.app's search function has a bug so that it sometimes fails to find words in Japanese). (Uploaded 2004.09.11).
- Uploaded a page scriptApp_buildHelper, presenting an AppleScript droplet which helps to build AppleScript droplet/script application bundles, calling "external scripts" written in Unix scripting languages (e.g. Perl, Python, etc.). (Uploaded 2004.12.14). -- Sorry, I am in the process of updating this page and its software. Please wait sometime! (2005.04.29)
- Uploaded a page ASUnicodeDialogs: an AppleScript Studio application -- a helper application for AppleScript scripts, with which you can use the "display dialog" accepting Unicode input, and the "choose from list" which can display items in Unicode text. -- This page was updated, with some new information and examples [version 0.7.1].
- Uploaded a page Conversion of Word 5.x files containing Chinese text, presenting a script that converts Word 5.0/5.1 files containing Chinese text. When opened by Word X or Word 2004, the Chinese text in these files are garbled. With the script that can be downloaded from this page, you will be able to convert these files into a format compatible with Word X or Word 2004. (Uploaded 2006.01.19)
- Uploaded a page Classic Nisus Writer to Nisus Writer Express/Pro: an AppleScript droplet with which you can convert a bunch of Classic Nisus Writer files to Nisus Writer Express files; it supports the conversion of the special encoding font ITimesSkRom (I can add the support of other non-standard encoding fonts on request...). New in Version 0.7.2: I added the support of Appeal, Hobogirin and Norman. I added also another AppleScript droplet which will convert Appeal, Hobogirin and ITimesSkRom to Gandhari Unicode, in Nisus Writer Express files. (Uploaded 2007.06.25)
- Uploaded a page Classic Nisus Writer to Nisus Writer Pro: an AppleScript droplet with which you can convert a bunch of Classic Nisus Writer files to Nisus Writer Pro files; it supports the conversion of the special encoding font ITimesSkRom (I can add the support of other non-standard encoding fonts on request...). New in Version 0.7.3 I added the support of Appeal, Hobogirin and Norman. I added also another AppleScript droplet which will convert Appeal, Hobogirin and ITimesSkRom to Gandhari Unicode, in Nisus Writer Express files. (Uploaded 2009.07.12)
- Uploaded a page Taisho Catalog and related scripts, where you will find a file containing the Catalog of all the works of Taisho Canon, vol. 1-85. You will find also some macros for Jedit X that can be useful for the use of the Catalog file. (Uploaded 2007.07.28)
- Uploaded a page Lamotte's note on Buddhist hells to show how links to original texts referred to in a study can be useful for the learning. (Uploaded 2008.12.13).
- Uploaded a page rtf to UTF-8 text and pTeX: a number of scripts with which you can convert rtf files (either in plain Cocoa rtf format, created by TextEdit or Jedit X, or Nisus Writer Express rtf format) to tagged UTF-8 text files, and then to pTeX files (a specialized version of LaTeX2e for Japanese language). (Uploaded 2006.06.29.)
- Uploaded a page unicode_grep on OS X: an AppleScript droplet which can be used as an interface for grep on OS X. Used in combination with Jedit X, this droplet can simulate to a certain extent the behavior of the excellent search utility for Classic Mac OS, MgrepApp. It can do very fast search for text contents of UTF-8 files. --- This page replaces the old page entitled "unix_grep on OS X" which is now obsolete. (uploaded 2011.07.10)
- Uploaded a page Make Symlink, presenting a simple AppleScript droplet, which let you create easily symbolic linked files on OS X. You can use it in combination with my search utility unix_grep.app that you will find on the page unix_grep on OS X. (uploaded 2007.01.12)
- Uplaoded a page Batch Convert Files to UTF-8, presenting an AppleScript droplet which can convert to UTF-8 text files files of some legacy encodings, such as MacRoman, Shift_JIS, Big5, Big5-Eten, etc. With this utility, you can convert to UTF-8 files of CBETA or SAT; the converted files can be searched with my search utility unix_grep.app that you will find on the page unix_grep on OS X. (uploaded 2007.01.12)
- Uploaded a page Conversion of Word files with diacritical fonts to Unicode, presenting an AppleScript droplet which will convert old diacritical fonts (such as Appeal, Hobogirin, etc.) to a Unicode font. (uploaded 2007.02.06)
- Uploaded a new page Two Clipboard Utilities for OS X, two little clipboard utilities. Because of a shortcoming in OS X, it is often impossible or difficult to copy Japanese or Chinese text in OS X Unicode savvy applications and paste it into a Classic application. One of my two scripts tries to avoid this problem. (Uploaded 03.11.2003)
- I uploaded a page entitled AsianExtended keylayout. You will find there a new keylayout that can be used with OS X.2 (Jaguar), to type easily diacritical characters for transliterating text in Sanskrit, Chinese or Japanese (it can be used probably to type text in Korean or Tibetan, but I am not sure). (Uploaded on December 25 2002).
- Uploaded on October 14 2001 a new page entitled Hobogirin Style Reference to CBETA Text. You can find there a utility which enables to quickly look up any text from the CBETA Taisho e-text collection.
- Uploaded on October 14 2001 a new page entitled Muller's DDB Lookup System for Mac which enables the Mac users to quickly look up a entry in Charles Muller's Digital Dictionaly of Buddhism.
- Uploaded on October 21 2001 a new page entitled "MacJPerl script for n-gram analyze...". It contains a simple description of what is "n-gram analysis", and a simple MacJPerl script with which you can do n-gram analysis on the Mac (this page is in Japanese only).
- Uploaded on July 2 2011 a new page entitled "open_classic_nw_files_with_nwp". You will find there an AppleScript droplet which changes the default application of Classic Nisus Writer files to Nisus Writer Pro.
- Here is another Frontier suite which works with MacPerl: it can retreive data from large text databases. I implemented there some very basic functionalities of relational database. (Sorry! These pages are still to be done...)
Links
Go to NI Home Page