Part of Nobumi Iyanaga's website. n-iyanag@ppp.bekkoame.ne.jp. 10/18/05.

logo picture

Convert mbox files to UTF-8 files

v. 0.7 October 15, 2005
New version compatible with Tiger's Mail.app mail file format

I wrote this script set because of a limitation of the search function of Mail.app (v. 1.3.9 running on Mac OS 10.3.5 -- this was the version when I wrote this page the first time, in September 2004; now, Mail.app's version is 2.0.3 and OS version is OS X 10.4.2): it is unable to do a complete/accurate search of Japanese (and certainly other "complicated" character sets) words in "Entire Message" search (this situation has not changed even with the introduction of Spotlight, new with Tiger...!!). But the conversion of mbox files to UTF-8 files may be useful for other purposes as well: it generates mail files that are readable for human beings in (hopefully) every case.

This new version (0.7) will be compatible with the new file format of mail files in Mail.app v. 2.0 and later -- but I will continue to provide the scripts for previous versions of Mail.app. On the other hand, I upgraded all the scripts: they are now application bundles, including the core Perl script(s) in them (in Contents/Resources/ folder), so that you will not have to install the Perl scripts in ~/bin/ folder.

It will be good to describe here the differences between the old mbox file format and the new .emlx file format.

By the way, if you had a mail box in the older version format, and you have upgraded to the new 2.x version, the old mbox file will be preserved in xxx.mbox folder.

And it is good to know that there is a utility which convert new .emlx files to older mbox files: see emlx2mbox.

The task of converting these files to UTF-8 was not easy at all: the mbox files (and .emlx files) are very complicated files. Each mail may be written in some character set, encoded in several possible formats (quoted-printable, base64, etc.), and each mail may contain several "parts" in plain text, in html, attachment in base64, etc.

In my effort of getting as much "readable" text as possible, I tried to decode not only the character sets and transfer-encodings but also html entities, and I decided to delete the attachments.

If the size of the mbox file is not VERY big -- or, in the new format, if a Messages folder does not contain VERY many files --, it is good to convert it to one text file; but if it is VERY large, then, a very big converted file will be generated, which will be not easy to handle (editors like TextEdit would take long time to open it, etc.). In such cases, it is handier to generate one file for each mail. This makes very numerous files, but each file can be open easily.

To search in these files, a utility like my unix_grep can be used (see my other page Unix grep for OS X and Nisus Writer Express). For search in big single files, I wrote a simple utility named "mail_grep_in_utf8.app" using Platypus 2.5, containing a utility named CocoaDialog 1.1.3 and a perl script (these are in "mail_grep_in_utf8.app" package, in mail_grep_in_utf8.app/Contents/Resources/CocoaDialog.app/ and unix_grep.app/Contents/Resources/script). I had to use these utilities, because AppleScript in pre-Tiger OSs was unable to accept Unicode characters as input in dialogs. From OS 10.4 onward, AppleScript can be used for these tasks. -- So, I added an AppleScript droplet named mail_grep_in_utf8_Tiger.app to do the same thing in Tiger. They are slow, but it has the capability of doing "AND search", i.e. search using two or more key words.

Package contents

You will find two packages, one for "pre-Tiger_version", and another for "Tiger_version". Pre-Tiger_version contains three script applications, and one ReadMe file; Tiger_version contains three script applications of same functionalities, plus another script application to gather multiple converted mail files into one big file, and one ReadMe file.
  1. Pre-Tiger_version
    • convert_mbox2utf8_bundle.app
    • convert_mbox2utf8_separ_mail_bundle.app
    • mail_grep_in_utf8.app
    • ReadMe.rtf -- same as this web page

  2. Tiger_version
    • convert_mbox2utf8_Tiger.app
    • convert_mbox2utf8_separ_mail_Tiger.app
    • gather_multi2one_Tiger.app
    • mail_grep_in_utf8_Tiger.app
    • ReadMe.rtf -- same as this web page

The two packages have the same functionalities. convert_mbox2utf8... makes a big UTF-8 text file containing the text of mails of a mbox file, or many .emlx files in a Messages folder; convert_mbox2utf8_separ_mail... makes a folder full of UTF-8 text files, each of them being the converted text of a mail.

The little application "mail_grep_in_utf8..." can be used to search in large single files which are the results of conversion done by convert_mbox2utf8.app.

You can put these files anywhere you want (perhaps your /Applications/Utilities folder?).

How to have access to mbox files (for pre-Tiger version)

In OS earlier than OS 10.4.x, "mbox" files are files in which all your mails are stored. They are invisible files inside "packages" having the extension ".mbox".

For example, I have a file named "Down.mbox" at:

/Users/[my_account]/Library/Mail/Mailboxes/MacScripting/Down.mbox

This is in fact not a file, but a "package", containing several files. To open the package, you will select it, while pressing the Control key: this gives you access to the Context menu, in which you will find the menu-item "Show Package Contents". Select this menu-item, and a new folder will open, showing the contents of the package. You will find there a file named simply "mbox" (with a blanc icon): this is the mbox file containing all the mail data. The full path is:

/Users/[my_account]/Library/Mail/Mailboxes/MacScripting/Down.mbox/mbox

Now, drag this file while pressing the Option key to the Desktop: this will make a copy of that file on the Desktop (don't move the file itself! Mail.app will be unable to find it). When you have your mbox file on the Desktop, you can close the package folder, and all other folders. Since all the mbox files are simply named "mbox", it is a good idea to rename the copied file with a more informative name, for example "MacScripting_Down_mbox"; but please keep the name "mbox" at the end of the new name.

How to convert an mbox file

1-a. To generate one converted file (pre-Tiger version):

Drag and drop an mbox file onto the icon of the AppleScript droplet named "convert_mbox2utf8_bundle.app".
If the dropped file's name does not end with "mbox", the droplet stops displaying a dialog saying "Please drag-&-drop a mbox file..."
If it is a mbox file, a new file choosing dialog will ask you to select a folder in which you will create a new file -- which will be the converted file [i.e. output file]. You will enter the file name in the dialog. The file must have the extension ".txt".
A final dialog will ask you to choose either "From:" or "To:" field which will be used in the index file.
The index file will be generated in the same folder as the output file, with the name of that file plus "-mail_index.txt": for example, if your output file is named "MacScripting_Down.txt", the index file will be named "MacScripting_Down-mail_index.txt". Each line of the index file will contain information on a mail that was converted:

year-month-day_hour_minute_second_[name from "From:" field or "To:" field]
For example:
Processing: 2004-6-11-10_46_17_Iyanaga_Nobumi at Thu Sep 9 00:30:54 2004
which means a mail from [or to] Iyanaga Nobumi received at June 11 10:46:17 2004.

1-b. To generate one converted file (Tiger version):

Drag and drop a "Messages" folder onto the icon of the AppleScript droplet named convert_mbox2utf8_Tiger.app. All the remaining process is the same as the above.

Unfortunately, the generated file by this script has an important problem: as I indicated above, the numbers used for file names of .emlx files seem to not correspond to any apparent order (other than a very rough chronological order). I think there will be no problem if you have the habit of storing your received mails distributing in different mail boxes immediately after you receive mails. I, for one, have a different habit: I store my daily received mails in my Inbox mail box for a while, and after some time, using a script, I distribute them at once in different mail boxes. In this latter case, at least, the numbers in .emlx file names do not correspond to the precise chronological order. When converting these .emlx files into one big file, as the script reads the files in order of the file names, the order of mails in the converted file becomes almost random, and this is very inconvinient when you try to read the converted file.

To avoid this problem, the only solution that I could find is to do the operation in two steps. First, I convert mail files into multiple converted files, using convert_mbox2utf8_separ_mail_Tiger.app; this script names each file with the date-hour-minute of the original mail (see below). When I have a folder full of converted files, I can gather these files into one big file, in which mail texts will be written in the chronological order. For this step, I wrote another script application named gather_multi2one_Tiger.app.

Drag and drop a folder created with convert_mbox2utf8_separ_mail_Tiger.app, containing multiple converted email files, onto the icon of gather_multi2one_Tiger.app. A folder choosing dialog will ask you to select a folder in which you will create a new folder. This will be the "parent folder".
Another dialog will ask you to enter the name of the new folder in which the script will gather the multiple files into one big mail file.
A final dialog will ask you to enter the name of the final big file.

After some time, the script application will quit, having generated a new folder in which you will find the gathered big file, with an Index file.

2-a. To generate a folder full of converted files, each one representing one mail (pre-Tiger version):

Drag and drop an mbox file onto the icon of the AppleScript droplet named "convert_mbox2utf8_separ_mail_bundle.app".
If the dropped file's name does not end with "mbox", the droplet stops displaying a dialog saying "Please drag-&-drop a mbox file..."
If it is a mbox file, a folder choosing dialog will ask you to select a folder in which you will create a new folder -- in which the script will generate the converted mail files. This will be the "parent folder".
Another dialog will ask you to enter the name of the new folder in which the script will generate the converted mail files.
A final dialog will ask you to choose either "From:" or "To:" field which will be used in the index file.
The index file will be generated in the same folder as the converted files, and will be named simply "-mail_index.txt". Each line of the index file will contain information on a mail that was converted:

year-month-day_hour_minute_second_[name from "From:" field or "To:" field]
For example:
Processing: 2004-6-11-10_46_17_Iyanaga_Nobumi at Thu Sep 9 00:30:54 2004
which means a mail from [or to] Iyanaga Nobumi received at June 11 10:46:17 2004.

2-b. To generate a folder full of converted files, each one representing one mail (Tiger version):

Drag and drop a "Messages" folder onto the icon of the AppleScript droplet named convert_mbox2utf8_separ_mail_Tiger.app. All the remaining process is the same as the above.

How to search in a single big converted file

To search a word in a folder full of mail files in UTF-8, you can use a utility like my unix_grep (I am in the process of updating this utility also [as of October 2005]). For a single mail file, the best is to open it with an editor, such as TextEdit or SubEthaEdit, and use the Find dialog. But if the file is VERY big, it may take time to open it in an editor; in such cases, my utility "mail_grep_in_utf8.app" (or, if you use OS 10.4.x and later, the utility called "mail_grep_in_utf8_Tiger.app"), included in this package, may be useful.

To activate mail_grep_in_utf8.app, you will simply double-click on its icon.

A dialog will ask you to enter the words to be searched. You can enter there only a word, or several words delimited by "," (a comma). If you enter more than one word delimited by a comma, the script will search for each of them in each of the mails. For example, you can search for mails sent from one of your friends, containing one word (in that case, it is probaby better to use your friend's email address as one of the key words). -- Caveat: this means that you cannot use a word containing a comma as a search word. On the other hand, if you enter many words in this dialog, the search will be much slower, since the script repeats searches for each word. Note that you cannot use "Paste" in this dialog: this is due to a limitation of CocoaDailog.app (this is not the case for "mail_grep_in_utf8_Tiger.app").

When you have entered your word(s) to be searched, a file selecting dialog will ask you to select the file to be searched. Select a mail file which has been converted from a mbox file, using convert_mbox2utf8-bundle.app or convert_mbox2utf8_Tiger.app (note that you can select only a file having the extension ".txt").

To activate mail_grep_in_utf8_Tiger.app, you will drag and drop a converted mail text file with the extention ".txt" onto its icon; a dialog will ask you to enter the word(s) to be searched. -- Thus, as this is a droplet, the file selecting dialog is omitted. All the rest functionality is the same as mail_grep_in_utf8.app.

After a while, a file will open in TextEdit: it will contain every mail containing the word(s) that you have entered in the first dialog box. Each mail will be delimited by "======..." The file is named "mail-grep_res.txt", "mail-grep_res1.txt", "mail-grep_res2.txt", etc., and will be on your Desktop.


Download

Download the package "convert_mbox_pre_tiger.zip" from here (336 KB to download).

Download the package "convert_mbox_tiger.zip" from here (132 KB to download).


All the scripts may contain many bugs. Please use these utilities with caution. Especially, if the conversion takes too long time (for example more than several minutes, for a mbox file of several MB large...), there may be a infinite loop bug. In such cases, you should probably force quit the droplet....

I would appreciate any feedback, bug report or suggestions. Thank you in advance.

Have fun!


Go to Research tools Home Page
Go to NI Home Page


Mail to Nobumi Iyanaga


frontierlogo picture

This page was last built with Frontier on a Macintosh on Tue, Oct 18, 2005 at 11:31:22 PM. Thanks for checking it out! Nobumi Iyanaga