Part of Nobumi Iyanaga's website. n-iyanag@ppp.bekkoame.ne.jp. 12/29/03.

logo picture

multiformat_nisus

© Nobumi Iyanaga July 2003
New version: v. 0.7.1 Uploaded Mon, Dec 15, 2003
Even newer version: v.0.7.1.1 Uploaded Mon, Dec 29, 2003 with a minor bug fix
A minor upgrade version, compatible with Panther

Introduction

I have been using Nisus/Nisus Writer for about 10 years to write all my word-processor documents. I am a Japanese, researcher in the field of Buddhist studies, and I spent some time in France when I was young. My main language is Japanese, but I use sometimes French, English, and I had to use some special fonts for transliteration of Asian languages (Japanese, Sanskrit, Chinese, etc.). -- By the way, I created my own font for such a purpose (ITimesSkRom, that you can download from this link).

As I moved to OS X, many things changed. One of the most important changes is the use of Unicode. In many regards, Unicode is certainly better than the older many encodings -- especially for users like me, who need to write multilingual text, and who had to use private encoding fonts. But converting my older files to Unicode is not an easy task; moreover, I realized that, as Unicode files are plain text files, my older files lose all the formatting information if I simply convert them. I am particularly sensitive to the need of formatting information, because the Classic Nisus file format had (and still has) the wonderful and unique feature of being at the same time "plain text" and "styled text," so that it is easy to search in Classic Nisus files using plain text-file search utilities like MgrepApp, and easy to recognize visually (or not...) the meaning of such or such parts of my documents (for example, italic used to emphasize a meaning, or to indicate the title of a book; or "Contents" marker to indicate that such or such parts of documents are the title of chapters, etc.).

This is why I thought that it will be important to preserve at least the minimum of style/structure information of my Classic Nisus files when I convert them to Unicode. Once converted to Unicode, my documents should be saved in (at least) two formats, one, without any other information than the plain text for search facility, and another, which would be marked-up to make them reusable for other purposes, for example for database use, html format, TeX format for typesetting, etc. The best mark-up language is certainly xml, and more specifically for users such as myself, the TEI (Text Encoding Initiative) markup system. But before coming to that step, I have to find a way to convert my files to Unicode with some basic style/structure information.

This is the main purpose of this package. The method I propose here is a very simple marking-up system. Based on marked-up Classic Nisus files, it is easy to convert them to marked-up Unicode files, Unicode files without any tags, and also to other formats, like TeX, html, and Cocoa-rtf.

New version info

I continued to develop this package. There are some changes and improvements:
As of version 0.7 (uploaded on 30.7.2003)

  1.   The new version no longer uses Perl 5.8; it uses instead Perl 5.6 (or 5.6.1) installed at /usr/bin (this is the built-in Perl in OS 10.2).
  2.   It no longer uses Perl 5.8 built-in code conversion system, but uses instead a custom conversion system based on the conversion table provided by Apple. The conversion may be slower, but probably is more accurate and more flexible. If it encounters a mal-formed (or an unknown) Shift-JIS code, it will output a readable hexadecimal code (<sjisXXXX>).
  3.   I rewrote the rtf conversion script, so that it no longer needs the "language tags".
  4.   Conversions to styled formats (to rtf, html and TeX) support a new alignment tag: <bblockquote>xxxx</bblockquote>: this is a "blockquote inside blockquote" style (more indented than <blockquote>).
  5.   Some "structure tags" are introduced:
    •  <title>xxx</title>
    •  <author>xxx</author>
    •  <section>xxx</section>
    •  <subsection>xxx</subsection>
    With a new command embedded in the tagged document, it is possible to generate a Table of Contents with TeX and html conversions (these new tags are not used in rtf conversion...).  I added corresponding macros to "multiformat_nisus" macros.
  6.   I added support for two Japanese-specific styles:
    •  <rubi="yyy">xxx</rubi>
    •  <bouten>xxx</bouten>
    They are used only in the TeX conversion. [<rubi>xxx</rubi> requires a TeX style file named "furikana.sty"; if you don't have it, please do not use it...]
  7.   I added support for more diacritical characters for TeX conversion.
  8.   The TeX conversion script tries to open the converted TeX file with "mi". Even if mi is not installed, it will no longer generates an error message.
  9.   I added some more macros, and changed the titles of the macros so that it is easier to use them.
  10.   I tried to improve scripts in general.

As of version 0.7.1: this is a minor upgrade version, mainly for the compatibility with Mac OS 10.3 Panther (uploaded on Sat. 13 Dec. 2003):

  1. The default Perl installed with Panther is Perl 5.8.1-RC3. Some of the scripts used in multiformat_nisus 0.7 no longer work in this environment. So, I will upload two packages, one compatible with Jaguar, and another, compatible with Panther.
  2. I added a little new option in the conversion to rtf format: now, if you choose "Footnotes" in the first dialog, another dialog will ask you "Do you want to set the Japanese font to MS P Mincho?", with three options: "Cancel", "MS P Mincho", "Hira-Mincho", the default being "Hira-Mincho". If you choose here "MS P Mincho", the Japanese font will be set to MS P Mincho, a default font in Windows. The converted rtf file with this setting will be "friendlier" in the Windows environment. -- Note that this will work with or without Footnotes in the text, so that if you want to use the rtf file in Windows, it will be better to choose "Footnotes", then "MS P Mincho", even if your file does not contain any notes.
  3. A non intended extra line used to be added at the beginning of rtf text. This line has been removed.
  4. For the conversion to TeX, the default TeX editor that will open automatically after the conversion is no longer "mi", but TeXShop (download TeXShop from http://www.uoregon.edu/~koch/texshop/texshop.html, and set it up for Japanese following the instruction found at http://home.att.ne.jp/alpha/z123/texshop-j.html).
  5. A new option for TeX conversion is added: a dialog will ask you if you want "no leading space option". If you click on "Yes", a new line will be added "\parindent=0pt" after "\begin{document}", and there will be no automatic leading space at the beginning of paragraphs. If you use this option, you will no longer need to run the Nisus macro "jp-delete_leading_space".
  6. I added the support of Unicode gaiji, of format "&U+hex_value;". If the converting scripts find this sequence, they will try to convert it to its Unicode equivalent.
    • In the utf-8 conversion and utf-8 without tag conversion, the scripts will try to convert to the Unicode equivalent every sequence of format "&U+hex_value;".
    • In the conversion to rtf format, the sequence of format "&U+hex_value;" will be converted to its Unicode equivalent, with the Japanese font (Hiragino Mincho Pro).
    • In the conversion to TeX format, the sequence of format "&U+hex_value;" will be converted to "\UTF{hex_value}", and the line "\usepackage[deluxe, expert, multi]{otf}" will be added in the preambule. This supposes that the otf package is installed. For the installation of otf package, see below.
  7. Otherwise, nothing is changed for the conversions to html, utf-8 text, and utf-8 text without tags (except that a version compatible with Panther is added).

I hope that this package will be more useful than the initial package.


The package contains:


Requirement:

OS 10.2x or OS 10.3x

Optionally (if you use the TeX conversion droplet):
pLaTeX (a special distribution of LaTeX enabled to deal with Japanese text):

The package I installed is "pTeX(sjis) + JMacoros package for MacOSX" for OS X, distributed from:
http://www2.kumagaku.ac.jp/teacher/herogw/
The latest version as of today is http://www2.kumagaku.ac.jp/teacher/herogw/archive/bigptex031208.dmg
and
"ESP Ghostscript 7.07.1 for MacOSX 10.2/10.3" distributed from:
http://www2.kumagaku.ac.jp/teacher/herogw/
The latest version as of today: http://www2.kumagaku.ac.jp/teacher/herogw/archive/gs20031102.dmg

You may need also "Mxdvi.app", that you can get from:
http://macptex.appi.keio.ac.jp/~uchiyama/macptex.html
The latest version as of today: http://macptex.appi.keio.ac.jp/~uchiyama/mxdvi0260.tar.gz

The script "sjis_htmlEntity2TeX.app" launches the editor "TeXShop".
Please download TeXShop from http://www.uoregon.edu/~koch/texshop/texshop.html, and set it up for Japanese following the instruction found at http://home.att.ne.jp/alpha/z123/texshop-j.html.

The tag <rubi>xxx</rubi> requires the TeX style file named "furikana.sty" that you can download from:
<http://imt.chem.kit.ac.jp/fujita/fujitas2/texlatex/>
<http://imt.chem.kit.ac.jp/fujita/fujitas2/texlatex/tategumi/furikana.sty>
I installed it at:
/usr/local/share/texmf/ptex/platex/misc/jtate/furikana.sty

To use the Unicode gaiji in TeX, we have to install the "OTF.sty for MacOSX". You can find the latest version at http://www2.kumagaku.ac.jp/teacher/herogw/ (the latest version for now is http://www2.kumagaku.ac.jp/teacher/herogw/archive/otfs20031209.dmg).
It was not easy for me to install correctly this package. If you have problems, you may ask questions at http://www.r.dendai.ac.jp/cgi-bin/ptex/treebbs.cgi; you may also write me if you want...

After installing a new package in TeX, you have to issue the command:
% sudo mktexlsr
% password: xxxx
in Terminal.


Installation:

Put the five AppleScript scripts on the Desktop, or anywhere you can access easily:
They are droplets (they accept only files; please don't drop folders on them!)

Put the folder "bin" in your HOME folder (if you have already a folder named "bin" in your HOME folder, put the contents of "bin" inside that folder).

The main Nisus macro file to be used is "multiformat_nisus"; the two other macro files are added only as appendices.


Supported styles

Style tags:

Italic: <i>xxxx</i>
Bold: <b>xxxx</b>
Superscript: <sup>xxxx</sup>
Subscript: <sub>xxxx</sub>

Font size: <font size="n">xxxx</font>

Supported alignments:

center: <align="center">xxxx</align>
right: <align="right">xxxx</align>
left: <align="left">xxxx</align>
justified: <align="justified">xxxx</align>
blockquote: <blockquote>xxxx</blockquote>
bblockquote: <bblockquote>xxxx</bblockquote> ("quotation inside quotation" format)

Supported footnote format:

<fn>xxxx</fn> in the main text
You can use macros contained in the macro file "NoteMacros" to make these notes.

Some "structure tags":

<title>xxx</title>
<author>xxx</author>
<section>xxx</section>
<subsection>xxx</subsection>

<section> and <subsection> are used in the TeX and html conversions.  Combined with the tagged command '<html/TeX-command="makeTOC">', it is possible to generate a Table of Contents (in the case of html conversion, it will be a cross-linked table of contents) at the top of the document.

Japanese specific tags:

<rubi>xxx</rubi>
<bouten>xxx</bouten>

These two tags are used only in the TeX conversion.  The tag <rubi>xxx</rubi> requires a TeX style file named "furikana.sty".

Supported languages:

Japanese
MacRoman (7 bit): all the "higher ASCII characters will be converted to html entity format

Supported Unicode format:

HTML entity format in decimal:  "&#1234;" format, and (for Japanese "gaiji", as of version 0.7.1:) "&U+hex_value;".
In the conversion to TeX format, the "gaiji" in the format "&U+hex_value;" requires the otf package installed.

To find out a gaiji code in Unicode, there may be several ways:

  1.   Use OS X Japanese input method's Character map feature.  With EGBridge for example, you can use a hand-writing pad to enter a character shape, and find the Unicode code.
  2.   Use Radical-Stroke Index of the Unicode web site: <http://www.unicode.org/charts/unihanrsindex.html>
    For example, to find the character "ta" (Ch. duo3) of Kongō-satta (Vajrasattva), you will point your browser to:
    1.  <http://www.unicode.org/charts/unihanrsindex.html>
    2. Click on the number 3 (stroke number of the radical "Earth"): <http://www.unicode.org/cgi-bin/UnihanRadicalIndex.pl?strokes=3>
    3. Press the radio button of the radical "Earth", and enter in the additional strokes field "8" in minimum and "8" in maximum (stroke number of the additional strokes of the character "ta"), and press the button "Submit": <http://www.unicode.org/cgi-bin/UnihanRSIndex.pl?minstrokes=8&maxstrokes=8&submit=Submit&radical=32>
    4. Find in the list of characters the character "ta" (duo) (it is the 32nd character in the list): http://www.unicode.org/cgi-bin/GetUnihanData.pl?codepoint=57F5&useutf8=false
    5. In the new page for the character "ta", you will find its number: decimal 22517, or in hex value, U+57F5.
  3. You can use UnicodeChecker (<http://www.earthlingsoft.net/UnicodeChecker/>) to get the decimal value of a given hexadecimal value, or vice versa.
Supported "special diacritical fonts":

ITimesSkRom

You can add other fonts using macros contained in the macro file "font<->HTML_Entities macros" (for these macros, please see my other web page: East Asizn Diacritical Fonts and Unicode).
Please copy one of the macros that you will find in the macro file "font<->HTML_Entities macros" and paste it in the macro file "multiformat_nisus".
List of fonts: 

Special note about "MacRoman2HtmlEntity":
All the Font in this particular macro is set to "Any Font/Any Size/Plain Text/+Any Style/Any More Defined Style/Any Language/Any Color" -- so you MUST NOT run this macro unless you are SURE that all your Roman text is written in MacRoman encoding.
DON'T use this macro if you have text written in any of the special diacritical fonts.


Supported conversions:

rtf (with Unicode characters...):

If the file contains footnotes in the format "<fn>xxxx</fn>", they can be converted into two formats (or rather, three formats, as of version 0.7.1):

a. Endnote type: the main text will have note numbers (1), (2), (3)... and at the end of the text, there will be notes with their numbers.
This type can be read by any (Cocoa-)rtf savvy editor/word-processor: TextEdit, Nisus Writer Express, etc. [But if you edit notes, the note numbers will not be updated automatically!]
b. Footnote type: use the normal rtf footnote format. Can be read by MS Word...
A dialog will ask you the note format that you want (even if the file doesn't contain any notes...); if you choose the "Footnote" format, the file will have "-f.rtf" extension; otherwise, it will have only ".rtf" extension.

c. Another dialog ask you if you want to set the Japanese font to MS P Mincho?", with three options: "Cancel", "MS P Mincho", "Hira-Mincho", the default being "Hira-Mincho". If you choose here "MS P Mincho", the Japanese font will be set to MS P Mincho, a default font in Windows. The converted rtf file with this setting will be "friendlier" in the Windows environment. -- Note that this will work with or without Footnotes in the text, so that if you want to use the rtf file in Windows, it will be better to choose "Footnotes", then "MS P Mincho", even if your file does not contain any notes.

The rtf conversion uses Hiragino Mincho Pro-W3 as default Japanese font (or MS P Mincho);
Gandhari Unicode Roman as default Roman font;
Gandhari Unicode Italic as default Italic font;
Gandhari Unicode Bold as default Bold font.
You can download Gandhari fonts from http://depts.washington.edu/ebmp/software.htm
Extension added to the original file name:
".rtf" or "-f.rtf"

UTF-8 with tags

The "html title" and "html/TeX-command" will be removed; and Unicode HTML entities will be converted to Unicode characters, but all other tags will be preserved.
Extension added:
"-utf8.txt"

UTF-8 without tags

All the tags (except <fn>xxx</fn> tags) will be removed (this is mainly for the searching facility).
Extension added:
"-utf8-woTag.txt"

HTML in UTF-8

The title of the html page will be provided by the special tag that you will add at the beginning of the file
<htmlTitle="xxxxx">
Notes in <fn>xxx</fn> format will be placed at the end of the document, as cross-linked notes.
If you add the command:
<html/TeX-command="MakeTOC">
a cross-linked table of contents will be generated for text tagged <section>xxx</section> and <subsection>xxx<subsection>.
You will have to add manually links, images, lists, tables, etc....
Extension added:
"-utf8.html"

(Japanese) TeX format:

The used LaTeX Preambule is:

\documentclass[a4j]{jarticle}
\usepackage{times}
\usepackage[T1]{fontenc}
\usepackage{textcomp,amsfonts}
\begin{document}

You can change the document class to (for example)
\documentclass{article}
if your file is mainly in English (or a Roman language).

If you use <rubi="xxx">yyy</rubi> tag, another package:
\usepackage{furikana}
will be added in the preambule.

And if you use the gaiji format &U+hex_value;, another package:
\usepackage[deluxe, expert, multi]{otf}"
will be added in the preambule.

If you choose "No leading space" option at the dialog when you drag-&-drop your file onto the droplet, the command:
\parindent=0pt
will be added after
\begin{document}
and there will be no automatic leading space at the beginning of paragraphs. If you use this option, you will no longer need to run the Nisus macro "jp-delete_leading_space".

The optional "title" and "author" command wills be added if you use <title>xxx</title> and <author>xxx</author> tags.  They don't do anything in themselves, but you can use them with the "maketitle" command:
\title{xxx}
\author{xxx}
\date{xxx}
\maketitle
[You would add manually \date and \maketitle commands]

The conversion supports currently 264 diacritical characters (you will find a list in the file "htmlEntity2TeX.txt").
I will add more if I am able and will have time.

Notes in <fn>xxx</fn> tags will be converted to footnotes.

Text in <section>xxx</section> and <subsection>xxx</subsection> tags will be used as section and subsection titles.  Sections and subsections will be numbered in the style "1, 1.1, 1.2; 2, 2.1, 2.2...".  If your section and subsection titles are already numbered, you will have to remove the numbers manually; or, you will add "*" (asterisk) after section and subsection commands [but then, these section and subsection titles will be not included in the Table of Contents].

If you place the special "html/TeX command" tag:
<html/TeX-command="makeTOC">
at the beginning of document (just after html title tag), a Table of Contents will be generated at the beginning of the TeX document.  You will have to compile the TeX source at least twice to have the table of contents.

After the conversion is done, the AppleScript script "sjis_htmlEntity2TeX" will activate the editor "mi" (if you have it installed).
Extension added:
".tex"

======================

The conversions to: 

UTF-8 with tags, UTF-8 without tags

are normally supposed to work well;

Other conversions, to:

rtf, html and TeX

will do only the minimum. You should certainly tweak the results to have the wanted results.

For TeX especially, if the compiling does not work, please quite the compiling entering "X" in TeXShop's console window ("X" for "exit"?), then examine and edit the source to have better results.

If there is already a file of the same name (with the respective extension) in the same folder, every conversion will overwrite that file.


How to work:

All the conversions themselves will be done by drag-&-drop the tagged files on the AppleScript droplets.

The tagging of the original files requires a careful work.

Here are some of the points on which you should pay a special attention:


Order of the macros to run:

First of all, save the file with a different name!! Then...

0.  If the file contains footnotes, first, run one of the macros of the macro file "NoteMacros" (for example "0-Footnotes->fnTagNotes") to set the notes in <fn>xxx</fn> tags. -- See my other page, Migrating from Classic Nisus Writer to OS X NW-Express for more details on the "NoteMacros".

1. "1Pre-process": it sets all the return characters and <fn>/</fn> tags in Plain Text/Times.

2. Optionally: different macros inserting "structure tags":

2Insert_author_tag
2Insert_section_tag
2Insert_subsection_tag
2Insert_title_tag
3. "3Insert_italic-etc_tag": it inserts <i>, <b>, <sup> and <sub> tags.

4. Optionally: different macros inserting font size tags:

4FontSize10
4FontSize14
4FontSize18
4FontSize24
4FontSize9
4FontSize_automatic
4FontSize_to_specify

The font sizes will be reflected exactly in the rtf conversion.
In the TeX conversion, the following sub-routine will be used:


	if ($size < 7) {
		return ("scriptsize");
	}
	elsif ($size >= 7 && $size <= 10) {
		return ("small");
	}
	elsif ($size >= 13 && $size <= 18) {
		return ("Large");
	}
	elsif ($size > 18 && $size < 25) {
		return ("huge");
	}

In the html conversion, the following sub-routine will be used:


	if ($size > 12 && $size <= 14) {
		return ("+1");
	}
	elsif ($size > 14 && $size <= 18) {
		return ("+2");
	}
	elsif ($size > 18 && $size <= 24) {
		return ("+3");
	}
	elsif ($size < 12 && $size >= 9) {
		return ("-1");
	}
	else  {
		return ("");
	}

5. Different macros inserting "align" tags:

5Insert_align-center
5Insert_align-justified
5Insert_align-left
5Insert_align-right
5Insert_bblockquote
5Insert_blockquote
Don't forget to add <align="justified">xxxx</align> tags in the next paragraph of the paragraph in which you inserted other <align...> tags. [You must locate and select manually/visually the paragraphs on which these tags must be inserted]
For the <bblockquote> tag, please see above.

6. One of  <font>2HTMLEntity macros. (There must be no selection before running this macro). Of course, if you don't use any "higher ASCII" characters in any of the "special fonts", you don't need to run this macro. -- This macro uses much memory if the file is large, and there are many diacritical characters. I recommend to save the file once before running this macro.

7. Optionally: one or both of the following two macros:

7htmlTitle
7MakeTOC
8. 8Clear Invalid characters, which will removes all invalid characters.

Finally, you can use the two "verify tags" macros:

[verify_tag_util1]
[verify_tag_util2]
to verify the tags.


Japanese specific macros:

Macros with "jp-" are macros specific for Japanese text, and the TeX conversion.

jp-delete_leading_space:
When you compile a Japanese text with pLaTeX, with the preambule:
\documentclass[a4j]{jarticle}
the leading zenkaku-space at the beginning of paragraphs are added automatically, so that the leading zenkaku-spaces that you have added manually would be redundant.  The macro "jp-delete_leading_space" will remove all the leading zenkaku-spaces at the beginning of paragraphs (in non-tagged text, and tagged text). [If you use the option "No leading space", this macro is not needed...]

jp-set_rubi/bouten:
jp-lw-underline2bouten:
It is difficult to explain these two macros in a html document (most of the browsers being unable to display "ruby"...). Please look at the ReadMe file in Nisus format for details.


examples folder:

This folder contains two example files, a portion of an article in French with Japanese text and many diacritical characters, and a portion of an article in Japanese with "rubi" and "bouten". These files are converted to every possible format. Two files, "steps_dakini" and "steps_orientalism" describe all the steps I followed to markup these articles, and convert them to different formats.


Download

Please download the package "multiformat_nisus" for Jaguar from here (744 KB).

And please download the package "multiformat_nisus" for Panther from here (744 KB).


Go to Research tools Home Page
Go to NI Home Page

Mail to Nobumi Iyanaga


frontierlogo picture

This page was last built with Frontier on a Macintosh on Mon, Dec 29, 2003 at 4:20:47 PM. Thanks for checking it out! Nobumi Iyanaga