Indexation

Part of Nobumi Iyanaga's website. n-iyanag@ppp.bekkoame.ne.jp. 8/18/03.

Indexation_pl

-- Tool to generate indices from tagged source --

© Nobumi Iyanaga 2003.8
version 0.7

Introduction
I wrote a set of scripts that generate multi-field indices from a tagged text file. By "multi-field indices", I mean indices composed of several headings, like "Person Names", "Places", "Bibliographic References", etc. These scripts may be useful for other researchers, so I wanted to share them with others.

Contents of the Package:

Indexation_pl:

bin:

indexation.pl

examples:

Jien_original
Jien_tagged
Jien_tagged-utf8-idx.txt
Jien_tagged-utf8.txt
Jien_tagged.idx
Jien_tagged_edited.idx

free_tagging macro Classic Nisus Writer macro file
indexation.app AppleScript droplet
indexation.dp MacPerl droplet
insert_free_tag.scpt AppleScript macro for Nisus Writer Express
ReadMe

How to Install
The main scripts are the MacPerl droplet "indexation.dp" and Perl script "indexation.pl". They do almost the same thing.
If you use OS X, put the folder "bin" in your HOME directory, with the Perl script "indexation.pl" with it. If there is already a folder "bin" at that place, put only the Perl script "indexation.pl" in it.
The MacPerl droplet needs MacPerl itself. You can download it from <http://www.perl.com/CPAN/ports/index.html#mac>.
You can place the MacPerl droplet "indexation.dp" and the AppleScript droplet "indexation.app" wherever you want, easy to access for drag-&-drop operation.
If you use Classic Nisus Writer, please put the Nisus macro file "free_tagging macro" in your "Macros" folder, inside "Nisus Writer Tools" folder, inside "Nisus® Writer" folder.
If you use Nisus Writer Express, please put the AppleScript macro for Nisus Writer Express named "insert_free_tag.scpt" in /Users/[your_account]/Library/Application Support/Nisus Writer/Macros/ folder.

You will need my font, "ITimesSkRom" to read this ReadMe and example files. Please download it from:
<http://www.bekkoame.ne.jp/~n-iyanag/researchTools/diacritical_fonts_download/ITimesSkRom.sit.hqx>.

How to use this system
You will find in the "examples" folder a file containing a short extract from an article I published: “Ḍākinī et l’Empereur — Mystique bouddhique de la royauté dans le Japon médiéval —” (published in the Italian journal VS (Versus), no. 83/84 — Quaderni di studi semiotici, maggio-dicembre 1999 —, special issue with the title “Reconfiguring Cultural Semiotics: The Construction of Japanese”, edited by Fabio Rambelli and Patricia Violi, p. 41-111; see Dakini Table of Contents). The original text is quoted in the file "Jien_original"; and this text is tagged for index in the file "Jien_tagged". Comparing the two files, you will find how I tagged up this text.

I have
1. Added page number tags [these page numbers are arbitrary. They have no relation with the page numbers in the original journal.]
2. Inserted different tags for indexing [hereafter, "indexing tags"]. I used the following tags:
<author>, <date>, <keyword>, <name>, <ref_book>, <sect>, <title>
I added also some tags for other purposes:
<paragr>, <fn>, 
These tags have no effect on the index, except for tag: look at the tagged text:
<title>Gukan-shō 愚管抄</title>
and the generated index:
Gukan-shō 愚管抄 152
As this example shows, tags with names in which UPPER CASE characters (and the character "&") are used, which are inside tagged text with indexing tags (here, <title> and </title>) will be preserved in the generated index. You can thus use style tags like , , , ; and you can insert also Unicode entities in tags, like <〹>.
I used the Nisus Writer macro named "free-tagging" to insert these tags. It is easy to use: first, you should assign a keyboard shortcut to this macro (I use "CMD + TG"). To markup a text, first select it, then run this macro. A dialog will ask you to enter a "tag abbreviation". There are a number of abbreviations already defined:
bk 　　　　 book ag 　　　　 age nm 　　　　 name pg 　　　　 paragr pl 　　　　 place qt 　　　　 quote rt 　　　　 ritual tt 　　　　 title dt 　　　　 date ps 　　　　 person tg 　　　　 thing rb 　　　　 RUBI ct 　　　　 chapter_title st 　　　　 sect pr 　　　　 period_name at 　　　　 author rf 　　　　 ref_book kw 　　　　 keyword
You can add or remove any pair of tag and tag abbreviation and customize the macro as you want. Now, at the dialog asking you to enter a tag abbreviation, you will type, for example, "nm", and the selected text will be marked up with the tag <name>xxxx</name>. This macro works with multiple non-contiguous selections, so you can select several texts at once, and get them marked up at once. And you can select more than one paragraph at once.
If you have forgotten the abbreviations, you can enter any tag name preceded by "n-". For example, you can enter "n-blockquote" to insert the tag "<blockquote>" and "</blockquote>"

I wrote another little AppleScript macro for Nisus Writer Express: "insert_free_tag.scpt". To use it, please put it in "Macros" folder, inside Nisus Writer folder, inside your ~/Library/Application Support/ folder:
/Users/[your_acount]/Library/Application Support/Nisus Writer/Macros/insert_free_tag.scpt
Unfortunately, it is not as good as the macro for Nisus Writer. There is no "abbreviation" feature. At the dialog, you must enter the complete tag name. It does not support the non-contiguous selections.

You can insert multi-level indexing tags. Example:
<ref_book><author>Taga Munehaya 多賀宗隼</author>, 1959</ref_book>
This will be indexed as:
<author>:
Taga Munehaya 多賀宗隼 180
....
<ref_book>:
Taga Munehaya 多賀宗隼, 1959 180
...

Page number tags may be inserted inside a tagged text. Example:
<keyword="pouvoir*crise de, à la fin Heian">époque extrêmement
<p. 151>
critique</keyword>
This will be indexed as:
<keyword>:
pouvoir
 crise de, à la fin Heian 150-151
Page numbers may be arbitrary. After <p. 153>, you can have <p. 180>, then <p. 153> again, etc. BUT if you have a tagged text inside which page number tag(s) are inserted, avoid to have non-logical page numbers. For example,
<keyword="pouvoir*crise de, à la fin Heian"><p. 158>époque extrêmement
<p. 151>
critique</keyword>
would be a non-sense (this would generate "crise de, à la fin Heian 157-151").

All the return characters will have no effect on the generated index..

This system supports some special features:.
1. Index As feature. An example:
<name="Fujiwara no Tadamichi 藤原忠通">Tadamichi 忠通</name>
This will be indexed as:
<name>:
Fujiwara no Tadamichi 藤原忠通 150
2. Sub-entry feature. There are two types:
a: type 1 which uses "*" as the key character. An example:
<keyword="aristocratie*pouvoir de">aristocratique</keyword>
This will be indexed as:
aristocratie
 pouvoir de 151
As you see, the first part of the string in double-quotes, before "*", will be used as the main-entry, while the last part will be used as the sub-entry.
b: type 2 uses ":" as the key character. An example:
<name=":et l’ésotérisme Tendai">Jien 慈圓</name>
This will be indexed as:
Jien 慈圓 152
 et l’ésotérisme Tendai 152
Here, the main-entry will be the text enclosed by the indexing tag, and the text after ":" in <name=":et l’ésotérisme Tendai"> will be used as the sub-entry.
Another possibility:
<keyword="histoire:théologie chrétienne de l’">théologie historique chrétienne</keyword>
This will be indexed as:
histoire 152
 théologie chrétienne de l’ 152, 152

These two types may seem similar, but internally, they are quite different. In the case of type 2, first example "<name=":et l’ésotérisme Tendai">Jien 慈圓</name>", it is "Jien 慈圓:et l’ésotérisme Tendai" which is used as entry in the script (i.e. the text enclosed between the tag, "Jien 慈圓", to which is concatenated the text put as value of the tag, ":et l’ésotérisme Tendai"), and at the moment of output, the part "Jien 慈圓:" is replaced with a tab; this is why it comes after the main entry "Jien 慈圓" through the Perl's sort function. This mean that if there is no other "<name>Jien 慈圓</name>" in the file, the sub-entry " et l’ésotérisme Tendai" will become an "orphan" entry.
This would be the case of type 2, second example: "<keyword="histoire:théologie chrétienne de l’">théologie historique chrétienne</keyword>". In this case, it is "histoire:théologie chrétienne de l’" which is taken as entry in the script (the text between the tag is not taken account of); at the moment of the output, the part "histoire:" is replaced with a tab. If there were no other main entry "histoire", this sub-entry would be an orphan entry.

On the contrary, in the type 1, the main entry is supplied by the tag's value itself (in the example given above, "aristocratie*pouvoir de", it is the part before "*", "aristocratie"), the character "*" will be replaced with a return and a tab, and the sub-entry, "pouvoir de" will be printed. But if there are more than one "aristocratie", the main entry, without any page number, will be repeated each time....

Important limitation: you must not use certain characters inside indexing tag values, i.e. text that are in <a_tag="xxxxxx">: those characters are:
double-quote, the characters "<", ">", "*", ":" and "="

Before finishing, there are two choices as to the output of the page numbers in the generated index:
1. "uniq": if this option is chosen, there will be only one page number even if a same text (in the same index field) occurs several times in a page; and the continuous page numbers will be joined by a hyphen. This is the standard index format, for example:
Jien 慈圓 150-152, 154, 180...
2. "multi": with this option, you will have as many page numbers as the occurrences of the indexed text; and the hyphen between page numbers will be used only when the indexed text is across more than one page. Example:
Jien 慈圓 150, 150, 151, 152, 152, 154, 154, 180...
One or the other option must be at the second line of the file:
uniq
or
multi

Finally, we have to gather the indexing tags, and put their names at the first line of the file. Each tag name must be separated by ", " -- a comma and a space. Be sure that there is no extra spaces. Be sure also that there is no non-indexing tags (for example "fn" or "paragr" in our example).

The final two tasks, i.e. choosing one or the other of "uniq" or "multi" options, and gathering the tags, can be done with a Nisus Writer macro named "gather_tags", included in the macro file "free_tagging macro". BUT this macro cannot know if a tag is an indexing tag or not; so you must edit the list made by the macro.

When you have finished all this tagging work, you will save and close the file, and drag-and-drop the file, either on the icon of the MacPerl droplet "indexation.dp", or on that of the AppleScript droplet "indexation.app" (which runs the Perl script "indexation.pl"). The two droplets work exactly the same way, except that the MacPerl droplet generates the index file with the extension ".idx", while the AppleScript droplet generates the index file with "-idx.txt" extension.

You can use, in principle, any encoding for the tagged file -- MacRoman, Shift-JIS or UTF-8. Both MacPerl and Perl scripts use much memory if your file is a large file. If the memory is not sufficient for MacPerl, please allocate more memory to it.

I hope that this tool will be of some help in your research work.

Download
Please download the package "Indexation_pl" from here (76 KB).

Go to Research tools Home Page
Go to NI Home Page

Mail to Nobumi Iyanaga

bk		book
ag		age
nm		name
pg		paragr
pl		place
qt		quote
rt		ritual
tt		title
dt		date
ps		person
tg		thing
rb		RUBI
ct		chapter_title
st		sect
pr		period_name
at		author
rf		ref_book
kw		keyword

Part of Nobumi Iyanaga's website. n-iyanag@ppp.bekkoame.ne.jp. 8/18/03.

Indexation_pl

Introduction

Contents of the Package:

How to Install

How to use this system

Download

This page was last built with Frontier on a Macintosh on Mon, Aug 18, 2003 at 12:24:53 AM. Thanks for checking it out! Nobumi Iyanaga