unix_grep on OS X

Part of Nobumi Iyanaga's website. n-iyanag@nifty.com. 9/26/07.

unix_grep on OS X

This page replaces the old page entitled Unix grep for OS X and
Nisus Writer Express, which is now obsolete. (2007-09-26)
Version 0.7.2 (2007-09-26)

Introduction

As I wrote elsewhere (cf. my page rtf to UTF-8 text and pTeX), Mac OS X lacks a good word-processing program -- this problem can be solved in part by using LaTeX... ; but it lacks also a good search program for text contents of files. The best search program for Classic Mac OS, MgrepApp still works on the Classic environment of OS X (although MgrepApp throws at the start-up time a warning saying that it expired, you can get rid of it by hitting the Return key and clicking the close window box); but it will not work on Intel-Mac...! Mac OS X comes with a very powerful and fast searching utility which is GNU grep, but you have to use Terminal to use it, and this is not easy at all. On the other hand, GNU grep can only search in UTF-8 text files; as many files for our stuides, such as CBETA Taisho files or SAT Taisho files are either in Big5-Eten or in Shift_JIS, this is very inconvenient (although CBETA is beginning to distribute UTF-8 files as well... [but be aware that CBETA UTF-8 files use Windows line ending characters, so that they may raise problems when using on OS X applications...]).
Therefore, I wrote an AppleScript droplet which will convert all files in a folder into UTF-8 files (please see my other page Batch Convert Files to UTF-8); and I wrote another AppleScript droplet which can be used as an interface for GNU grep. Used in combination with TextWrangler and/or Jedit X, this latter droplet can simulate to a certain extent the behavior of MgrepApp: you will get a list of result; double-clicking on one item, TextWrangler or Jedit X will open the target file, and will select the target matched word. It is this droplet, named unix_grep.app, that I would like to present in this page. The files MUST be in UTF-8 encoding; files with Mac line ending characters cannot be searched, but those with Windows line ending characters can be used (of course, those with Unix line ending characters are preferable).
TextWrangler is a free text editor, very powerful, very fast, and very useful. Unfortunately, it cannot handle styled text.
Jedit X is a shareware text editor (2940 yen or $28); it is also very powerful; it is perhaps a little slower than TextWrangler, but it can handle styled text as well as TextEdit.
I assume in this ReadMe that we are using unix_grep.app for search words in Taisho text files, and will take examples for this kind of use. To follow these examples, you should first have converted to UTF-8 text files, at least some folders of CBETA files, using my other utility, "batch_conv2utf8_encoding_check.app". -- But of course, you can use unix_grep.app for any other UTF-8 text files or folders containing UTF-8 text files.

Notes on the new version 0.7.1:
After I released the first version of my unix_grep.app, I asked a friend, Hamid Haji, to test it. He discovered a serious problem with Arabic text files -- in many cases, my droplet fails to find the searched words, and often it crashes. I tried to find the culprit, and discovered that AppleScript's choose from list command is unable to handle long string of Arabic text. Moreover, AppleScript's droplet mechanism cannot deal correctly with Arabic file names. I think the same is true for other languages (scripts) using ligatures, e.g. Hebrew or Devanagari -- and perhaps other languages/scripts.
To avoid the problem with AppleScript's choose from list, I wrote a new version in which I added a new option, special_scripts. If you set this option to 1, the list selecting window will be skipped, and the result of the search will be opened right away in your default application. -- This option can be used for other languages/scripts as well, if you don't need the list selecting window. It is certainly faster, and more robust than when using the list selecting window. So, if you think that the search result will be very large, it will be probably better to use this option.
As to the problem of file names in Arabic or other scripts using ligature, the only way to avoid it is either rename the files with Roman name, or put the files in a folder having a Roman names (and set the option recursive to 1 if the files are in sub-folders [I think the sub-folders can have Arabic names...]).
The new version is improved in other parts also: it can now accept theoretically any number of files or folders (so that perhaps the save symlink mode is a little less useful in this version [see below]).
I changed also the format of the result file, in which the first line will summarize the search result.
Finally, I changed the two scripts for TextWrangler and Jedit X, named open_file_fromGrepRes.scpt, so that now, it will not only open the target file and select the target line, but select the target word.
I rewrote the following documentation to fit with the new version.
End of notes for version 0.7.1.
notes for version 0.7.2
I fixed one bug in the interface: when you have once entered "*" in ext field (standing for "all files"), it was impossible to get rid of it. This bug, reported by John McRae, could be fixed.s
End of notes for version 0.7.1.

Requirements, Contents of the package and How to install:

Requirements:

OS X 104 and later
TextWrangler
Jedit X -- This is optional (the demo version, working for one month, can be downloaded free of charge...)
a bunch of folders containing UTF-8 text files

Contents:
When expanded, the package that you will download from this page (see at the bottom of the page) will contain:

unix_grep_AppleScript/ (Don't change this file) settings.txt ReadMe.rtf this file grep_symlinks_folder an empty folder unix_grep.app unix_grep_res.txt an empty UTF-8 file Put_in_App_script_folder/ for_Jedit_X/ open_file_fromGrepRes.scpt for_TextWrangler/ open_file_fromGrepRes.scpt

How to install:
To use the two scripts named open_file_fromGrepRes.scpt for two applications, one for TextWrangler and the other for Jedit X, you have to copy them in their respective "Scripts" folder.
For TextWrangler, it is easy:

Locate the Scripts folder for TextWrangler:
/Users/[your_account]/Library/Application Support/TextWrangler/Scripts/
Click on the script open_file_fromGrepRes.scpt in the folder for_TextWrangler, press the Option key, and drag the script into that folder.

You should also set the default file encoding of TextWrangler to Unicode (UTF-8, no BOM) if it is not, at this moment:
Launch TextWrangler, and choose the menu-item TextWrangler > Preferences; in the left side pane, click Text encodings. At the bottom of the window, set the popup menu under If file's encoding can't be guessed, use: to Unicode (UTF-8, no BOM).
This is all for TextWrangler.
For Jedit X:

Launch Jedit X, and select Window > Script Window or Macros > Show Script Window in Jedit X to display the Script window.
Click on the Macro Menu tab of the Script window;
Drag the script open_file_fromGrepRes.scpt in the folder for_Jedit_X from the Finder to the desired location in the Script window to save it there.
You will be asked which you want to copy the script file or the alias file. You would click on Copy, and the script file will automatically be saved in the following scripts folder:
/Users/[your_account]/Library/Application Support/Jedit X/scripts/

For Jedit X too, you should set the default file encoding to Unicode (UTF-8) at this moment:
Choose the menu-item Jedit X > Preferences; press the icon Encoding at the top of the window; set the pop-up menu under Default Encoding and Line Endings for Plain Text to Unicode (UTF-8) (and the Line Endings to Unix (lf)).

For details, you can refer to Jedit X's help: Chapter 11.2: "Script Window", and Chapter 2.4: "Encoding".
After you have installed these two scripts, you can place the folder Unix_grep_AppleScript anywhere you want (preferably on your desktop?), but you should NOT change the structure of this folder. Especially, unix_grep.app, the folder grep_symlinks_folder and the text file unix_grep_res.txt should be in the same folder.

How to use:

To see how unix_grep.app works, first, make sure that you have all the needed pieces:

TextWrangler
Jedit X (although this is optional, I would recommend to download it, if you don't have it already...)
one or a bunch of folders full of text files in UTF-8 (that you may have created with my another utility batch_convert2utf8.app [see the page Batch Convert Files to UTF-8]...) -- for example, a folder named "T01", containing all the files of volume 1 of the Taisho Canon.
-- Hereafter, I will take folders of the Taisho Canon as example.
and of course

unix_grep.app

Here are the basic steps:

Drag and drop your folder "T01" onto the icon of unix_grep.app.
A dialog will appear, asking you to enter a search word.
Enter for example "大自在" (without quotes)
You will see almost immediately a list selecting dialog, with the title:
Found 2 matches...
with the following prompt:
Choose one to open the file with the default_app... or... Press OK with no selection to save the result.
And in the list selecting window, you will see two lines: one beginning with T01n0022.txt and the other with T01n0081.txt (here, I use the CBETA files as example...).

For this time, select the first item, the one which is beginning with T01n0022.txt, and press OK (you can do this by hitting once the Down Arrow key, then the Return key; alternatively, you can select the item with the mouse and double-click on it).

TextWrangler will launch, open the file T01n0022.txt and select the word "大自在" in the line 403 (if the line contains more than one occurrence of the target word, only the FIRST one will be selected):
T01n0022_p0275b07(03)||其行平等。尊大自在。心念無畏。以一身化無數身。

The same list selecting window of unix_grep.app will appear again, at the front. This would repeat indefinitely if you don't do either of the following steps... So, to get rid of this list selecting window and return to TextWrangler, you can either click on the button Cancel or OK.

If you click the Cancel button, unix_grep.app will quit without doing anything (and the result of the grep search will be discarded);
If you click the OK button, the result of the search will be written in a file, the file named unix_grep_res.txt (located in the same folder as unix_grep.app), and this file will be opened by TextWrangler.

If you don't select any line at the first list selecting window, and press the OK button (or hit the Return key), the file unix_grep_res.txt will be opened by TextWrangler, and unix_grep.app will quit.

If you don't select any line at the first list selecting window, and/or press the Cancel button, unix_grep.app will quit, discarding the search result.

Be warned that the result of the search will be overwritten each time in the file unix_grep_res.txt, so that you should close this window each time. If you want to save the result, you will have to save it in another file.
You can use the result in the file unix_grep_res.txt to open the file and select the word of your search.

Select one line (ALL the line) of the result file that you want, and run the script open_file_fromGrepRes from the AppleScript menu of TextWrangler (or the Macro menu if you use Jedit X).
The target file will open, and the target word will be selected.

This is the basic use of the application.

How to configure the settings:

To see different possible settings, please open the file named (Don't change this file) settings.txt with TextEdit or any other text editor. You will see the following default settings:

ignore_case: 0 recursive: 0 ext: txt ---------------- default_app: TextWrangler ---------------- save_symlink: 0 add_to_symlink: 0 ---------------- special_scripts: 0

The file (Don't change this file) settings.txt is there only to show you the default setting of the droplet. If you don't need it, you can put it anywhere.
You will see the same list if you double-click on the icon of unix_grep.app. You can change this default setting:
Double-click on the the icon of unix_grep.app: you will see a list selecting window showing the current setting. You can simply hit the Return key, without selecting any item, -- or click the Cancel button -- to not change the setting.

If you select an item and hit the Return key, a new dialog will ask you to enter the value you want for the selected item (see below for possible values for each item, and some explanation).

When you click the OK button, another dialog will ask you: Have you finished your changes? -- If you press Finished, a confirming list window appears. Press OK in that window to save the changes, with three buttons, Cancel, Finished and Not yet... (the default button).

If you press Not yet..., the same list selecting window will appear, asking to select one item, and this will repeat until you press Finished (or Cancel -- in which case, all the changes made will be discarded...).

If you press Finished, a new list selecting window will appear: it is simply to confirm or not the changes made. You will either press the OK button, to save the changes, or press the Cancel button to discard any changes.

Now, here are some words for each option:

ignore_case: 0, that is case sensiive, or 1, case non-sensitive search (note that for kanji searches, ignore_case has no meaning).
recursive: 0, that is the search will be done only on the first level files in the folder dropped on unix_grep.app, or 1, that is the search will be done in all the files in nested folders in the folder dropped on the application.
ext: extension of the files to be searched. It can be for example txt, html, xml, or pl [for Perl source code files], etc., or "*". The last one, "*", means all the extensions. Note that the search will not be done if the files have no extension at all. It is *possible* to search in other kinds of files, for example "doc" files or "rtf" files, but the result will be totally garbled and meaningless. You should always specify an extension of text files in UTF-8 encoding (with preferably the Unix line ending characters).

default_app: This can be either TextWrangler or Jedit X. Jedit X will behave exactly the same way as TextWrangler, although Jedit X is slower to open large files. If you don't have Jedit X, and you set the option default_app to it, the application will quit, with a warning (but I could not test this situation...). -- It seems that Jedit X fails sometimes to open the target file. In such cases, I would recommend to use rather TextWrangler...
Latest note added: -- I think I could fix this problem...

The two options, save_symlink, and add_to_symlink, are somehow special, and need to be explained. I use egrep as the search engine for my application, which can perform "OR" search.
For example, if you want to search for lines which contain "尸棄" OR "光明" in T09, you would...:

Drag & drop the folder T09 onto the icon of unix_grep.app
Type "尸棄|光明" in the dialog asking you to enter the term to search, and you will get a list of 1253 matched lines, which contain either "尸棄" or "光明", or both at the same time. It is the operator "|" which means "OR" search.

But it is impossible to do "AND" search with grep or egrep. For example, you might want to find out files which contain both "尸棄" AND "光明"; this is impossible with a simple grep or egrep search. To achieve this goal, you have first to find out files containing (for example) "尸棄"; then find those containing the word "光明" in the found files. This is for such cases that the save_symlink option can be useful.

First, you will set the option save_symlink to 1 (double-click on unix_grep.app, select the save_symlink option, enter 1, press Finished, press OK...); then
You will drag and drop the same T09 onto the icon of unix_grep.app
Type (for example) "尸棄" in the first dialog.
You will see a list window showing 10 lines matching the word "尸棄" in T09; the title of the window will display:
Save_symlink mode: Found 10 match(es) in 3 file(s)...
and the Prompt of the window will say:
Press OK to save the symlink files (existing symlink file[s] will be deleted...)
-- So, there are only 3 files in T09 in which the word "尸棄" occurs.
Hitting the Return key, you will save the symbolic linked files of the matched files in your folder grep_symlinks_folder (selecting an item in the list has no meaning in Save_symlink mode!).
Opening the grep_symlinks_folder, you will find 3 files, named T09n0262.txt, T09n0264.txt, and T09n0278.txt -- each of them having a little arrow at the lower left corner of the icon, indicating that they are symbolic linked files (a symbolic linked file is a kind of alias files used in Unix; i is very little in size [only 4 KB each]; double-clicking on its icon will open the original file linked to it).

Now, set the option save_symlink to 0;
Drag and drop the folder grep_symlinks_folder onto unix_grep.app
Enter the word "光明" in the first dialog;
You will get a list of 1085 matched lines...

This means that the "OR" search is extensive , while "AND" search is restrictive.
Now, for the other option, that is add_to_symlink:
This option is meaningful only when the option save_symlink is set to 1. If the option add_to_symlink is set to 0, all the symbolic linked files that are in the folder grep_symlinks_folder will be deleted at each search in the Save_symlink mode, but it you set this option to 1, the symbolic linked files that are already in the grep_symlinks_folder will not be deleted.
This can be useful when you want to gather symbolic linked files satisfying some condition from one search session to another (with the previous version, which accepted only one folder, this was more crucial...).
For example, you have gathered in the previous example symbolic linked files containing the word "尸棄" that were in the folder T09. If you want to add to these files symbolic linked files satisfying the same condition from the folder T10, you will set the option add_to_symlink to 1, and drop the folder T10 on unix_grep.app, and perform the same search. You will get then 3 more files in the grep_symlinks_folder: T10n0279.txt, T10n0293.txt and T10n0294.txt. You can do any other searches on these files if you drop the grep_symlinks_folder onto unix_grep.app (you probably should set the options save_symlink and add_to_symlink to 0).
The last option, special_scripts, was explained above, in the "Notes on the new version 0.7.1".
If you want to perform searches in Arabic (or certainly Hebrew or probably Devanagari or other languages/scripts using ligatures), you have to set this option to 1. The list selecting window will be skipped, and the search result file will be opened directly in your default application. You will have to select one line of this result file, and run the script open_file_fromGrepRes, to open the target file, and select the target term. -- This is due to a bug in AppleScript, and this was the only way I could work around it.
Note that you can set this optionto 1 for other languages/scripts, if you don't need the list selecting window. It is certainly faster, and more robust than when using the list selecting window. So, if you think that the search result will be very large, it will be probably better to use this option.

Supplementary notes:

A. You can use the recursive option to search files in nested folders inside one folder. For example, if you have a folder named "Taisho", in which you have folders such as T01, T02, T03... T85, you can search all the files in these sub-folders with the option recursive set to 1 (a search for the term "摩訶迦羅天" in all the CBETA Taisho files -- which finds 10 matched lines -- takes less than one minute on my machine, a now rather slow PowerPC G4 Dual 867 MHz. The time needed for the search seems to depend more on the number of hits than the number of files to be parsed...).
You can drop also more than one folder or file onto unix_grep.app. But you can perform more sophisticated searches if you use symlinked folders, and for that, you can use my another utility, named make_symlink.app that you will find in my page Make Symlink. For example, you can do something like the following:

Make a new empty folder where you want, and name it, for example, "agama";
Locate your folders T01 and T02, and drag and drop them onto the icon of make_symlink.app;
A folder choosing dialog will ask you to select the folder you want: you would select the newly created folder "agama".
That's all: you will have symbolic linked folders of your T01 and T02 folders inside your folder "agama"; you would drag and drop this folder, "agama", onto unix_grep.app, to search all the files in your original T01 and T02 folders (the option recursive must be set to 1).
You can use the same technique to perform other kind of searches: for example, you would locate all the files whose translator is 鳩摩羅什, gather symbolic linked files of these files in a folder named "translations_kumarajiva", and search terms in these files, etc.
B. I would recommend to verify the setting of unix_grep.app before each time you want to use it. To do this, double-click on its icon; you will see the list selecting window showing the current setting. You can only hit the Return key if you are sasitfied with the setting; or you will select one item, to change the setting(s)...
C. You should learn also how egrep works, and what wildcard characters can be used. Please have a look at (for example):
http://www.wellho.net/regex/grep.html
D. Due to a bug in AppleScript's droplet mechanism, file or folder names in Arabic (or other "special" languages) will not be recognized. In such cases, the best is simply to change these file/folder names into Roman names. But the search itself can be done if you put your files/folders with "special" language names in a folder with a Roman name. You can put symlinked files/folders in a folder with a Roman name as well (don't forget to set the option recursive to 1 if the text files are inside sub-folders...).
E. A final note of warning: I think unix_grep.app is rather robust, but it is a simple AppleScript utility : you should NEVER search for words which may occur more than one or two thousands times. For example, NEVER try to search for "佛" in all the Taisho canon! That would crash certainly the application, and perhaps even the system!!

Download

Please download the package from this link (171K to download).
I would appreciate any feedback, comments, bug reports or requests.
Thank you!

Go to Research tools Home Page
Go to NI Home Page

Mail to Nobumi Iyanaga

Part of Nobumi Iyanaga's website. n-iyanag@nifty.com. 9/26/07.

unix_grep on OS X

Introduction

Requirements, Contents of the package and How to install:

Requirements:

Contents:

How to install:

How to use:

How to configure the settings:

Supplementary notes:

Download

This page was last built with Frontier on a Macintosh on Wed, Sep 26, 2007 at 1:39:34 PM. Thanks for checking it out! Nobumi Iyanaga