Part of Nobumi Iyanaga's website. n-iyanag@ppp.bekkoame.ne.jp. 10/29/06.

logo picture

rtf to UTF-8 text and pTeX

version 0.71
2006-10-05

Introduction

Mac OS X is a very good and stable OS, but we have still no good word-processing program for this environment. MS Word 2004 is too big and not easy to use; Nisus Writer Express is still unstable, and its feature set is not sufficient (there are however two interesting editors/word-processors which support: vertical writing, footnotes/endnotes, etc.: iText Express, and LightWayText [this one suppports also rubi, but not footnotes/endnotes]... But I have never used them extensively, so I cannot say anything sure about these programs...). I still use Classic Nisus Writer for much of my daily work, but with the release of Intel Mac, the future of Classic OS on OS X seems practically dead. In this condition, one of the best options is LaTeX -- especially pTeX for Japanese and other languages. Of course, pTeX is not easy to use either, but potentially, it can do practically everything we would need to write an article or a book: here are some of the things that can achieved relatively easily if we use pTeX:

And above all, TeX typesetting is so beautiful that no other word-processor or DTP program can be compared in this regard.

Of course, we have to download and install pTeX and other packages, and this is not always simple. I for myself have installed the package distributed by Ogawa Hirokazu: http://www2.kumagaku.ac.jp/teacher/herogw/index.html (bigptex060120.dmg, otfc20050519.dmg); I have also installed the Mojikyo package: http://www.mojikyo.org/html/download/tex/tex_list.html (cf. http://homepage.mac.com/dokuroryokan/hihou/labo04.html); the otfcjk package: http://psitau.at.infoseek.co.jp/experiment.html (cf. Okamura's TeXWiki); the furikana and kanbun packages by Fujita Shinsaku: http://imt.chem.kit.ac.jp/fujita/fujitas2/texlatex/index.html, etc.

But writing TeX code is not easy -- and of course, it is not a WYSIWYG environment. The ideal would be that we could write in a WYSIWYG environment and generate the needed TeX code. In default of such a system, the next good thing would be to write in a WYSIWYG program, and generate the needed TeX code with some macro or script. This is what I tried to achieve with the scripts I present here. The WYSIWYG environment I will use is the Cocoa rtf savvy programs: that may be TextEdit or Jedit X, and also Nisus Writer Express -- note that Nisus Writer Express' rtf format is not the same as the plain Cocoa rtf format used by TextEdit or Jedit X.

With rtf savvy applications, you can write styled text, with Italic, Bold, etc. styles; you can set the indents of your text, or points of your fonts. There are other features, such as the footnotes/endnotes, etc. which are only supported by some applications (Nisus Writer Express). If you use MS Word, many more features are supported -- but the rtf format for Word 2004 is excessively complicated, and I am totally unable to deal with it.

On the other hand, the rtf format has features which are meaningless for TeX, for example the line height, the tab setting, the page margins, font size of footnotes, etc. These attributes are generally defined automatically in TeX, by the document class that we choose. There are also some features which are lacking in the Cocoa rtf format (TextEdit format or Nisus Writer Express format) which can be achieved automatically in TeX (for example the first line indent...), or easily with some packages (for example rubi, using the furikana package)... For these latter features, I will use some simple tags, which will be converted to TeX commands by script.

Note that there exist converter of rtf files to TeX/LaTeX2e files: see for example http://ctan.unsw.edu.au/support/rtf2tex/, and http://sourceforge.net/projects/rtf2latex2e/. Unfortunately, they seem to not support Japanese or other double-byte or Unicode scripts.

This new version uploaded Sun, Oct 29, 2006 contains some bug fixes and some new features.

Bug fix: I tried to fix some bugs in NWE_rtf2utf8.pl and Cocoa_rtf2utf8.pl; I hope they are a little better.

New features:

and additionally,

See below for details.


Package contents

The basic scripts consist of three Perl scripts:
  1. NWE_rtf2utf8.pl and Cocoa_rtf2utf8.pl
  2. to_shiftjis.pl
  3. to_TeX.pl

As you can see, there are three steps in the process of converting from rtf to pTeX:

  1. first, we convert the original rtf file into a UTF-8 text file in which style attributes are described in tagged keywords;
  2. then, we convert that UTF-8 file to a Shift-JIS file, in which every character which cannot be represented in JIS character set will be repesented in html entity format (i.e &#[decimal number of Unicode code point];);
  3. finally, we generate a valid pTeX file, converting tagged style attributes to appropriate TeX codes, and html entities to TeX representations of the corresponding characters.

    You will find here a list of all the Latin Unicode characters that will be converted (some of the combined characters will be displayed in wrong shapes in the browser, but in TeX typesetting, they should be in right shapes).

I added to these three basic scripts two series of AppleScript scripts, in order to facilitate their use:
One is for Nisus Writer Express:

And the other is for the plain Cocoa rtf applications (TextEdit and Jedit X):

The first one (NWE_rtf2TeX.app and cocoa_rtf2TeX.app) is a droplet; you can drag and drop Nisus Writer Express rtf files, or plain Cocoa rtf files, on the icon of one of the droplets, and generate automatically the UTF-8 text files, Shift-JIS text files, and the final TeX files.

The second one (NWE_rtf2TeX.scpt and cocoa_rtf2TeX.scpt) is a script that can be used as a macro either in Nisus Writer Express or in Jedit X. Running NWE_rtf2TeX.scpt from Nisus Writer Express' Macro menu, you can generate the three kinds of files automatically; similary, running cocoa_rtf2TeX.scpt from Jedit X's Macro menu, you can generate the three kinds of files automatically.


Features

There are mainly two kinds of features:
  1. Style attributes of rtf format which are "translated" to tagged keywords in UTF-8 files. This is done automatically by the script NWE_rtf2utf8.pl or Cocoa_rtf2utf8.pl. You have nothing to do about them. If you want, you will be able to use these style attributes to convert the UTF-8 files to other format, e.g. html files...
  2. Some tagged keywords that will be "translated" to appropriate TeX code: you will have to insert these keywords manually in your rtf files. These keywords can be divided into two or three categories:
    1. Style/character attributes which are not supported by the Cocoa rtf format. These are for example "rubi", "bouten", kanbun style, etc. We can include here the footnotes and endnotes, which are supported in Nisus Writer Express rtf format, but not in plain Cocoa rtf format.
    2. Characters which are not supported in JIS or Unicode: principally Mojikyo characters.
    3. Cross-references, Table of Contents, sections and sub-sections, etc.: formattings that can be achieved easily using TeX commands.


Here is the list of the rtf style attributes that will be "translated" into tagged style keywords in the UTF-8 text files:

In Nisus Writer Express, you can use footnotes and endnotes; they will be "translated" to appropriate TeX code. -- For endnotes, you will need the endnotej.sty package, included in Ogawa's distribution (cf. above).


You can use the following tagged keywords in your rtf documents to get the effects which are impossible in the Cocoa rtf format:

<fn>xxx</fn>
footnotes: example: main_text main_text<fn>footnote_text footnote_text</fn>main_text main_text
<en>xxx</en>
endnotes: example: main_text main_text<en>endnote_text endnote_text</en>main_text main_text.

The above two are for plain Cocoa rtf applications. You can of course use the footnote/endnote feature in Nisus Writer Express.

The following keywords are for all the Cocoa rtf applications (TextEdit, Jedit X and Nisus Writer Express):

<rubi="xxxx">yyyy</rubi>
rubi (or furigana): example: <rubi="はんにゃ">般若</rubi>
<bouten>xxxx</bouten>
bouten: example: <bouten>これが</bouten>
<kanbun>xxxx</kanbun>
kanbun: this must be used in combination with the superscript and subscript styles of rtf format, and the <rubi> tagged keyword. Here is an example:
<kanbun>提婆論にも、婆羅門の古説を<rubi="シル">載</rubi>して、那羅延天。従臍生梵天。梵天衆生。大地其戒場タリ。一切衆生。と云ふ。</kanbun>
This will be "translated" to the following TeX code:
¥usepackage{sfkanbun}
¥begin{document}
提婆論にも、婆羅門の古説を¥kana{載}{シル}して、那羅延天。¥kundoku{従}{}{リ}{レ}臍¥kundoku{生}{}{ス}{二}梵¥kundoku{天}{}{ヲ}{一}。梵¥kundoku{天}{}{ハ}{}¥kundoku{為}{}{}{二}衆¥kundoku{生}{}{ノ}{}¥kundoku{祖}{}{}{一}。大¥kundoku{地}{}{ハ}{}¥kundoku{是}{}{レ}{}其戒¥kundoku{場}{}{タリ}{}。と云ふ。¥par
This makes an output like the following:
kanbun_test picture
As you see, you need the sfkanbun.sty package to use this keyword.

Kanbun formatting is complicated, but it can be analyzed into four elements:

  1. kanji
  2. furigana => <rubi> tag
  3. okurigana => superscript
  4. kaeriten => subscript
You can understand the order with the following image:
kanbun_order picture

That is: \kundoku{1}{2}{3}{4}.

There is one important rule when you use the <rubi> tag in the <kanbun> environment. If the furigana applies to only one kanji, there is no problem, but if it applies to more than one kanji, you must separate each part of the furigana that applies to one kanji by a "・", that is a nakaguro. Example:

<kanbun><rubi="な・ら・えん・てん">那羅延天</rubi>...</kanbun>
This will be "translated" to the following TeX code:
¥kundoku{那}{な}{}{}¥kundoku{羅}{ら}{}{}¥kundoku{延}{えん}{}{}¥kundoku{天}{てん}{ハ}{}...
As you see, the number of the separator "・" must be the same as the number of the kanjis -- if you omit to put "・", this would generate the following code:
¥kana{那羅延}{ならえんて}¥kundoku{天}{ん}{ハ}{}...
which is not logical...

Related to kanbun, we added a little feature in the version 0.71: In the >kanbun< environment, if there is a sequence "云々" or "云云", it will be rendered to unnun_yoko picture (horizontal mode) or unnun_tate picture (vertical mode) (thanks to Kino-san!).


You can use the following two kinds of html entities to get the output of special characters:


The following tagged keywords will have special effects in TeX:

<page_ref_label="xxx">
This is to be used with the next tagged keyword <label="xxx">
<label="xxx">
These two are paired tags. You will use them to make a cross-reference in your text. You will insert the tag <label="xxx"> at the place to be referred to, and the tag <page_ref_label="xxx"> at the place referring. The label name, xxx must be the same for a pair of cross-reference. Example:
You will write at the place to be referred:
... We can say that Meiji Restauration has destroyed the traditional Japanese religious system<label="MeijiRestauration">...
Then, in a later passage, you will refer to the page you wrote about the destruction of the traditional Japanese religious system:
... As we have written above (see p. <page_ref_label="MeijiRestauration">), Meiji Restauration caused a destruction of the traditional Japanese religious system...

<fn_ref="yyy">
This is to be used with the next tagged keyword <fn_label="yyy">
<fn_label="yyy">
These two are paired tags. You will use them to make a cross-reference in your text to a footnote (or endnote). You insert the tag <fn_label="yyy"> in the footnote (endnote) to be referred to, and the tag <fn_ref="yyy"> at the place referring to that footnote (endnote). The label name, yyy must be the same for a pair of cross-reference. Example:
In a footnote, you have this text:
<fn>On Meiji Restauration, see Yasumaru Yoshio, Kamigami no meiji ishin, Tokyo, 2006<fn_label="Yasumaru">.</fn>
And in a later text, you refer to that footnote:
As Yasumaru showed it in his book (see n. <fn_ref="Yasumaru">), Meiji Restauration in Japanese history is...
I learned the TeX code for this cross-referencing system for footnotes from Kino-san.

You must compile twice the TeX code to get the cross-references.


The following keywords will be "translated" to simple TeX commands:

<title>xxxx</title>
This will be "translated" to TeX code: \title{xxx}, inserted before the body of the main text.
<author>xxxx</author>
This will be "translated" to TeX code: \author{xxx}, inserted before the body of the main text.
If you have both <title> tag and <author> tag, we will insert befor the body of the main text the TeX code % \maketitle; please uncomment the line yourself to make the command effective.

<section>xxxx</section>
This will be "translated" to TeX code: \section{xxx}
<subsection>xxxx</subsection>
This will be "translated" to TeX code: \subsection{xxx}

The next keyword must be used in combination with <section>xxxx</section> and <subsection>xxxx</subsection>:

<TeX-command="makeTOC">
This will be "translated" to the TeX command \tableofcontents, inserted before the body of the main text. -- This will insert a table of contents according to the <section> and <subsection> keywords. -- You will have to compile twice the TeX code to get the effect of this command.

Finally, for people who have installed the otfcjk package (see above), I added the next keywords:

<Tex_command=otfcjk:"Traditional Chinese/Simplified Chinese/Korean begins">
You will have to insert this command, in a separate line, just before the part of the text which will be in Traditional Chinese, or Simplified Chinese, or Korean.

<Tex_command=otfcjk:"Traditional Chinese/Simplified Chinese/Korean ends">
You will have to insert this command, in a separate line, just after the part of the text which is in Traditional Chinese, or Simplified Chinese, or Korean.

<TeX_commented_out>xxx</TeX_commented_out>
TeX has seven plus one special characters which must be escaped. These are $, %, {, }, &, _, # and \. If you type these characters in your rtf documents, they will be properly escaped in the TeX output. But if you need to use the character % to comment out some TeX code in your text, you cannot use this character. Instead, please use the tag <TeX_commented_out>xxx</TeX_commented_out> to comment out TeX code in your text.

Similarly,
<U+003C> and <U+003E>
You cannot use freely the characters "<" or ">" in your text. They must be used only either for the tags described here, or the html link tag <a href="xxx">xxx</a>, the only html tag allowed. If you want to use "<" or ">" in your text, please use instead the tags:
<U+003C> and
<U+003E>


I added also some little features.


How to install and how to use...

The package will contain two folders: NWE_rtf2TeX and Cocoa_rtf2TeX.

NWE_rtf2TeX contains:

- NWE_rtf2utf8.pl
- to_shiftjis.pl
- to_TeX.pl

These are the basic Perl scripts. They must be in a same folder. You can use them in the following way:

  1. Open Terminal and change the directory to the folder containing the three scripts using cd command;
  2. Write the following command in Terminal
    perl NWE_rtf2utf8.pl "[path of the Nisus Writer Express rtf file you want to convert to TeX]"
  3. This will generate three files in the same folder as the original file. One will be named "[file_name]_utf8.txt", another, "[file_name]_sjis.txt", and the third one "[file_name].tex". This final file will open in the default TeX editor (in my case, TexShop.app). If there are already files of these names, the generated files will have a number added. For example, "[file_name]_utf8_1.txt", "[file_name]_sjis_1.txt" and "[file_name]_1.tex", or "[file_name]_utf8_2.txt", etc.
  4. You can try to compile this TeX file right away -- but I would recommend to examine it carefully, to see if it is correct. Rtf codes are complicated, and my scripts often fail to generate correct TeX codes... You will perhaps have to correct them, but TeX error messages are most often helpful.

But of course, you can avoid using Terminal. These three scripts are given here in this form rather to in order to let you open them, and examine the scripts yourself.
Instead of using Terminal, you would rather use an AppleScript droplet, or Nisus Writer Express macro...

NWE_rtf2TeX.app
This droplet contains the three Perl scripts in its Resources folder.

You can simply drag and drop Nisus Writer Express rtf files on it, and make it generate the same three kinds of files, that is, the utf-8 text file, the sjis text file, and the TeX file.

NWE_rtf2TeX.scpt
This is an AppleScript script which will work as a Nisus Writer Express macro. You will have to put it in your Nisus Writer macro folder:
/Users/[your_account]/Library/Application Support/Nisus Writer/Macros/

macro_helper: a folder
The above NWE_rtf2TeX.scpt cannot work alone; this folder, macro_helper, contains a version of the three Perl scripts a little different from those which are at the top level of the folder. You will have to put this folder in:
/Users/[your_account]/Library/Application Support/Nisus Writer/
The Nisus Writer Express macro NWE_rtf2TeX.scpt calls these Perl scripts.

When you have installed these files at the right places, you can open Nisus Writer Express, edit your rtf document, save it when you have finished your editing, and choose the macro NWE_rtf2TeX from the Macro menu. This will generate the three kinds of files, as indicated above, and will open the generated TeX file in your TeX editor.

Examples: a folder
This folder contains four files:
  • basic_example.rtf, and
  • basic_example_result.pdf
You can examine both files, one with Nisus Writer Express, and the other with some pdf viewer. You can try to convert the file basic_example.rtf to a TeX file using the droplet or Nisus Writer Express macro as explained above (I added "_result" to the name of the example pdf file, in order to avoid an overwriting of the resulting file). The TeX file should compile without any special package for TeX installation (the only package needed is endnotesj.sty which comes with Ogawa's distribution).

The same folder contains also:

  • special_features_example.rtf, and
  • special_features_example_result.pdf
You can examine both files, one with Nisus Writer Express, and the other with some pdf viewer, and try to convert the file special_features_example.rtf to a TeX file using the droplet or Nisus Writer Express macro. The TeX file requires some special packages: furikana package for <rubi>, sfkanbun for kanbun typesetting, mojikyo package for Mojikyo characters, otf package for Unicode kanji characters, and otfcjk package for Simplified Chinese rendering. -- Mojikyo characters are disabled in the rtf file, because of the copywrite problem.


Cocoa_rtf2TeX contains:

- Cocoa_rtf2utf8.pl
- to_shiftjis.pl
- to_TeX.pl

These are the basic Perl scripts. Only the first is different from NWE_rtf2utf8.pl; the two others are the same. They must be in a same folder. You can use them in the following way:

  1. Open Terminal and change the directory to the folder containing the three scripts using cd command;
  2. Write the following command in Terminal
    perl Cocoa_rtf2utf8.pl "[path of the Cocoa rtf file you want to convert to TeX]"
  3. This will generate three files in the same folder as the original file. One will be named "[file_name]_utf8.txt", another, "[file_name]_sjis.txt", and the third one "[file_name].tex". This final file will open in the default TeX editor (in my case, TexShop.app). If there are already files of these names, the generated files will have a number added. For example, "[file_name]_utf8_1.txt", "[file_name]_sjis_1.txt" and "[file_name]_1.tex", or "[file_name]_utf8_2.txt", etc.
  4. You can try to compile this TeX file right away -- but I would recommend to examine it carefully, to see if it is correct. Rtf codes are complicated, and my scripts often fail to generate correct TeX codes... You will perhaps have to correct them, but TeX error messages are most often helpful.

But you can avoid using Terminal. These three scripts are given here in this form rather to in order to let you open them, and examine the scripts yourself.
Instead of using Terminal, you would rather use an AppleScript droplet, or Jedit X macro...

Cocoa_rtf2TeX.app
This droplet contains the three Perl scripts in its Resources folder.

You can simply drag and drop Cocoa rtf files on it, and make it generate the same three kinds of files, that is, the utf-8 text file, the sjis text file, and the TeX file.

JeditX_rtf2TeX.scpt
This is an AppleScript script which will work as a Jedit X macro. You will have to save it as Jedit X macro, choosing Macro > Show Scripts Window, and pressing the "+" button.

script_helper
The above JeditX_rtf2TeX.scpt cannot work alone; this folder, script_helper, contains a version of the three Perl scripts a little different from those which are at the top level of the folder. You will have to put this folder in:
/Users/[your_account]/Library/Application Support/Jedit X/
The Jedit X macro JeditX_rtf2TeX.scpt calls these Perl scripts.

When you have installed these files at the right places, you can open Jedit X, edit your rtf document, save it when you have finished your editing, and choose the macro JeditX_rtf2TeX from the Macro menu. This will generate the three kinds of files, as indicated above, and will open the generated TeX file in your TeX editor.

Examples: a folder
This folder contains four files:
  • basic_example.rtf, and
  • basic_example_result.pdf

The same folder contains also:

  • special_features_example.rtf, and
  • special_features_example_result.pdf
These four files being the same as those which are in the folder NWE_rtf2TeX, I ask the users to refer to the above explanation.


The package will contain also:

  • ReadMe.rtfd -- this file
  • Quick_Reference.txt -- a little file listing the tagged keywords to be used with this package.
  • Tex_yoko2tate.pl -- a little Perl script that may be useful when you want to change the writing direction of a Japanese TeX file from horizontal to vertical: it will replace some characters which have no vertical shapes with those characters which have vertical shapes (e.g. "<" with "〈" and ">" with "〉", etc.).
    (New in the version 0.71:)
    1. The repetition mark ku no ji ten (くの字点) that you will have entered with the substitution mark "/\" or "/″\" will be replaced with the correct forms:
      kunojiten1 picture and kunojiten2 picture
    2. The "FULLWIDTH HYPHEN-MINUS" or "ダーシ" (-) will be rendered as a vertial rule: |.
  • Tex_yoko2tate.app -- AppleScript droplet which works as a droplet interface for the Perl script of the same name.
  • Example files.

New in the version 0.71:

  • to_html -- a folder containing:
    • to_html.pl -- Perl script which converts the utf-8 text file generated by NWE_rtf2utf8.pl or Cocoa_rtf2utf8.pl
    • to_html.app -- AppleScript droplet which works as a droplet interface for the Perl script of the same name.
    • basic_example_utf8.txt (same as the file generated by NWE_rtf2utf8.pl or Cocoa_rtf2utf8.pl) and basic_example_utf8.html
    • special_features_example_utf8.txt (same as the file generated by NWE_rtf2utf8.pl or Cocoa_rtf2utf8.pl) and special_features_example_utf8.html
  • sfkanbun2kanbun -- a folder containing:
    • sfkanbun2kanbun.pl -- Perl script which will convert kanbun TeX files using sfkanbun.sty into kanbun TeX files using kanbun.sty v. 1.1.
    • sfkanbun2kanbun.app -- AppleScript droplet which works as a droplet interface for the Perl script of the same name.
    • Example files.


Download

Please download the package from this link (646K to download) .

I would appreciate any feedback, comments, bug reports or requests.

Thank you!


Go to Research tools Home Page
Go to NI Home Page


Mail to Nobumi Iyanaga


frontierlogo picture

This page was last built with Frontier on a Macintosh on Sun, Oct 29, 2006 at 3:39:50 PM. Thanks for checking it out! Nobumi Iyanaga