Moman

Description

This was supposed to be a suite of tools to be used by an orthographic/grammatical checker and the checker itself. However, the project is mainly dead right now. But I encourage you to look through the code and use it as inspiration/reference. The tools are currently coded in Python, but I started a while back to rewrite it in Lisp (which will never be finished). Moman, the suite itself, consist of the following tools:

FineNight is the FSA library.
A FST library. (Not yet implemented)
ZSpell is the orthographic checker.

Mostly, the only part of the tools suite which is worthwhile mentioning is the "Fast String Correction" which is used by Lucene's FuzzyQuery. You can read about the inclusion of this project in Lucene by reading Michael McCandless's article.

FineNight

The FineNight library contains many algorithms for Finite State Automatons. That includes:

Union of two FSAs
Intersection of two FSAs
Complement of a FSAs
Difference of two FSAs
Reversal of a FSA
Closure of a FSA
Concatenation of two FSAs
Determination of a NFA
Equivalence test
Minimization algorithm
Construction of an IADFA from a sorted dictionary
Graphviz support
Error-Tolerant IADFA (starred in Michael McCandless's Mike MChttp://blog.mikemccandless.com/2011/03/lucenes-fuzzyquery-is-100-times-faster.html

Almost all algorithms were taken from the book Introduction to Automata Theory, Languages, and Computation. The minimization algorithm is an implementation of Brzozowski's method. In this method, the (possibly non-deterministic) automaton is reversed, determinized, reversed and determinized. I'll eventually add the Hopcroft's nlog(n) minimization algorithm.

ZSpell

ZSpell is meant to be a concurrent of aspell, made by Kevin Atkinson. At this time, ZSpell can suggest words with a Levenshtein-distance of one. Before we were using Kemal Oflazer's algorithm. This algorithm is very slow, but now we use a faster algorithm (Schulz's and Mihov's algorithm). However, only substitution, removal and insertion are used for the faster algorithm. It means that transpositions errors, like "ehllo" -> "hello", are considered as two operations.

TODOs includes:

Add transposition errors for Levenshtein-distance algorithm.
Add phonetic errors (spelling by sound).
Add derivation errors.

References

John E. Hopcroft, Rajeev Motwani and Jefferey D. Ullman, Introduction to Automata Theory, Languages and Computation, 2nd edition, Adison-Wesley, 2001.
J. A. Brzozowski, Canonical regular expressions and minimal state graphs for definite events, in Mathematical Theory of Automata, Volume 12 of MRI Symposia Series, pp. 529-561, Polytechnic Press, Polytechnic Institute of Brooklyn, N.Y., 1962.
John E. Hopcroft , An n log n algorithm for minimizing the states in a finite automaton , in The Theory of Machines and Computations, Z. Kohavi (ed.), pp. 189-196, Academic Press, 1971.
Kemal Oflazer, Error-tolerant Finite State Recognition with Applications to Morphological Analysis and Spelling Correction , Computational Linguistics, 22(1), pp. 73--89, March, 1996.
Klaus U. Schulz and Stoyan Mihov, Fast String Correction with Levenshtein-Automata, International Journal of Document Analysis and Recognition, 5(1):67--85, 2002.
Zbigniew J. Czech , George Havas and Bohdan S. Majewski, An Optimal Algorithm for Generating Minimal Perfect Hash Functions , Information Processing Letters, 43(5):257--264, 1992.