Vocabulary index with Perl

Part of Nobumi Iyanaga's website. n-iyanag@ppp.bekkoame.ne.jp. 2/3/99.

Vocabulary index with Perl

Here are two droplets (25K to download) to be used with MacPerl, with which you can make an index of vocabulary in a given text. You will find a detailed ReadMe in Japanese in the package. Here is an English version of the ReadMe (please keep this page if you don't read Japanese, because it is not in the package itself).

First, make a text file of your document with some text editor. The document may be in any language (English, French, Chinese, Japanese, etc.). If the document is in Japanese, I would recommend to use MacJPerl. If the document is in Japanese or Chinese, etc., each word must be separated with a "delimiter character". This may be any character, but I would recomment to use a space for that. If a word is divided by a carridge return (i.e if it is over two lines), a "continuation sign" must be put at the end of the line, to indicate that the word is over two lines. This sign may be any character also [although it cannot be "+" sign], but I would recommend to use the "=" sign. To indicate to the script the location of words, you may put down the volume, the page, the recto or verso of the page (a to d) and the line. Finally the last word of the document should be followed by a carridge return.

<V 1> ................indicates the volume. This may be omitted.
<P 1a> ...............indicates the page and a, b, c, d division. The page indication cannot be ommited if you need it.
<L 1> ................indicates the line. If the line begins with 1, it is not necessary. If it begins with some other number, it must be indicated.

Here is a sample document:

<V 1>
<P 1a>
<L 1>
Je suis étudiant à l'Université Kan=
sai. J'ai vingt-deux ans.
<P 2>
<L 2>
J'étudie la langue chinoise. J'habite à Osaka.
J'habite avec mes parents et mon petit frère.

<V 2>
<P 1>
Mon frère a dix-sept ans. Il doit passer un concours d'entrée à l'Université l'année pro=
chaine.
Mon père a 54 ans et ma mère a 52 ans.

You must put the preference file "voc_freq.pref" in the same folder as the droplet(s). Then you must edit it to indicate:

delimiter character
continuation sign
if the output must be "case sensitive" (0) or not (1).

Each of these must be entered between < and > after "delimiter[tab]", "continuation sign[tab]" and "ignorecase[tab]". Here is the default setting:

delimiter   < >
continuation sign   <=>
ignorecase   <0>

This setting will be used if the file "voc_freq.pref" is not in the same folder as the droplets.

Here is the typical setting for a document of Roman text:

delimiter   < |\.|\?|!>
continuation sign   <>
ignorecase   <1>

As this example shows, you may use several delimiter characters. Each of them must be separated by "|" sign. The period (".") and the question mark ("?") must be "escaped" with the "\" sign because of the rules of regular expressions.

Finally, you will drag and drop your file(s) on the droplet. The difference of the two droplets, index.pl and idx.pl is that the former one processes one file at once, while the latter processes several files or a folder containing several files at once. Consequently, index.pl will produce one output file for each input file, while idx.pl produces only one output file for several files. The output file will have by default the name of the input file plus ".idx", and will be in the same folder than the input file. If the name of the input file has more than 26 characters, the last part will be troncated.

Here is an example of output:

idx result of 1 files:

Macintosh HD:Desktop Folder:Developmt:idx.pl:French test

52:1   (French test:2-1-3)
54:1   (French test:2-1-3)
a:3   (French test:2-1-1, French test:2-1-3, French test:2-1-3)
ans:4   (French test:1-1-2, French test:2-1-1, French test:2-1-3, French test:2-1-3)
avec:1   (French test:1-2-3)
chinoise:1   (French test:1-2-2)
..............
total words: 37

The format is:
The_word:number_of_occurrences[tab]([file_name:]volume-page-line)

and

total words: number_of_words

The file_name will be indicated only if you use idx.pl; it will be omitted if you use index.pl.

These scripts have been developped on a request of Uchida Keiichi (keiuchid@pp.iij4u.or.jp) and Nobumi Iyanaga (n-iyanag@ppp.bekkoame.ne.jp) by a person who wants to keep his anonymity. Some parts have been modified by N. Iyanaga.

Please write to N. Iyanaga if you have any comments, feedbacks or bug reports.
Thank you very much in advance!

Wed, Feb 3, 1999

Go to Research tools Home Page
Go to NI Home Page

Mail to Nobumi Iyanaga

Part of Nobumi Iyanaga's website. n-iyanag@ppp.bekkoame.ne.jp. 2/3/99.

Vocabulary index with Perl

This page was last built with Frontier on a Macintosh on Wed, Feb 3, 1999 at 18:40:55. Thanks for checking it out! Nobumi Iyanaga