Part of Nobumi Iyanaga's website. 1/15/10.

logo picture

Vocabulary index with Perl for OS X

Here is a folder droplets (371K to download) with which you can make an index of vocabulary in a given text. You will find a detailed ReadMe in Japanese in the package. Here is an English version of the ReadMe (please keep this page if you don't read Japanese, because it is not in the package itself). -- An older version of this tool was published here ten years ago (see Vocabulary index with Perl); it used MacPerl or MacJPerl, and could work with legacy encodings. This new version will work on OS X, for Unicode texts.

  1. First, make a text file of your document with some Unicode text editor (for example Jedit X). It must be a plain text file in UTF-8, with Unix line endings (i. e. lf). The document may be in any language (English, French, Chinese, Japanese, etc.). If the document is in Japanese or Chinese, etc., each word must be separated with a "delimiter character". This may be any character, but I would recomment to use a space for that. If a word is divided by a carridge return (i.e if it is over two lines), a "continuation sign" must be put at the end of the line, to indicate that the word is over two lines. This sign may be any character also [although it cannot be "+" sign or other GREP sensitive characters], but I would recommend to use the "=" sign. To indicate to the script the location of words, you may put down the volume, the page, the recto or verso of the page (a to d) and the line. Finally the last word of the document must be followed by a carridge return.

    • <V 1> ................indicates the volume. This may be omitted.
    • <P 1a> ...............indicates the page and a, b, c, d division. The page indication cannot be ommited if you need it.
    • <L 1> ................indicates the line. If the line begins with 1, it is not necessary. If it begins with some other number, it must be indicated.

    Here is a sample document:

    <V 1>
    <P 1a>
    <L 1>
    Je suis étudiant à l'Université Kan=
    sai. J'ai vingt-deux ans.
    <P 2>
    <L 2>
    J'étudie la langue chinoise. J'habite à Osaka.
    J'habite avec mes parents et mon petit frère.

    <V 2>
    <P 1>
    Mon frère a dix-sept ans. Il doit passer un concours d'entrée à l'Université l'année pro=
    Mon père a 54 ans et ma mère a 52 ans.

  2. You must put the preference file "voc_freq_pref.txt" in the same folder as the droplet(s). Then you must edit it to indicate:
    1. the delimiter character;
    2. the continuation sign;
    3. if the output must be "case sensitive" (0) or not (1).

    Each of these must be entered between < and > after "delimiter[tab]", "continuation sign[tab]" and "ignorecase[tab]". Here is the default setting:

    delimiter   < >
    continuation sign   <=>
    ignorecase   <0>

    This is the default setting which will be used if the file "voc_freq_pref.txt" is not in the same folder as the droplets.

    Here is the typical setting for a document of Roman text:

    delimiter   < |\.|\?|!>
    continuation sign   <>
    ignorecase   <1>

    As this example shows, you may use several delimiter characters. Each of them must be separated by "|" sign. The period (".") and the question mark ("?") must be "escaped" with the "\" sign because of the rules of regular expressions.

  3. Finally, you will drag and drop your file(s) on the droplet. The difference of the two droplets, and is that the former one processes one file at once, while the latter processes a folder containing several files at once. Consequently, will produce one output file for each input file, while produces only one output file for several files. The output file will have the name of the input file plus "_idx.txt", and will be in the same folder than the input file. If there is a file of the same name in that folder, the output file will have serial numbers like &";_idx_1.txt", "idx_2.txt", etc.

    Here is an example of output:

    idx result of the file /Users/ni/Desktop/test_things/

    52:1   (2-1-3)
    54:1   (:2-1-3)
    Il:1   (2-1-1)
    J'ai:1   (1-1-2)
    J'habite:2   (1-2-2, 1-2-3))
    J'étudie:1:1   (1-2-2)
    Total words: 40

    The format is:


    total words: number_of_words

    The file_name will be indicated only if you use; it will be omitted if you use

These scripts have been developped on a request of Uchida Keiichi (keiuchid[at] and Nobumi Iyanaga ( by a person who wanted to keep his anonymity. Some parts have been modified by N. Iyanaga.

I rewrote them to adapt them to Unix Perl, and the AppleScript droplets as a user interface, but the core engine is the same as the scripts written ten years ago. The Perl scripts "" is the script used for "", and the script "" is the one used for "".

I added another droplet, named "" that sorts the result files of Chinese index by the pinyin values of the Chinese words. It contains the pinyin data for 25476 characters of Unicode 4.1 (when there are more than one pronunciation, only the first one is used). All the lines beginning with a character between "\x{0020}" (a space) and "\x{318E}" ("HANGUL LETTER ARAEAE") will be ignored; the line beginning with other characters (most of them, Chinese [or Korean or Japanese] characters) will be split into two by the separator ":" and the first part will be compared with the pinyin data table; and the lines will be sorted according to the alphabetical order of the pinyin values of the words. If the lines do not contain the separator character ":", the script will not work properly, so this is to be used only for the result files of indexing scripts given here.

The result file will be in the same folder as the input file, and will be named "[original_name]_sorted.txt".

Please write to N. Iyanaga if you have any comments, feedbacks or bug reports.
Thank you very much in advance!

Fri, Jan 15, 2010

Go to Research tools Home Page
Go to NI Home Page

Mail to Nobumi Iyanaga

frontierlogo picture

This page was last built with Frontier on a Macintosh on Fri, Jan 15, 2010 at 10:47:39 AM. Thanks for checking it out! Nobumi Iyanaga