Moman
Description
This was supposed to be a suite of tools to be used by an orthographic/grammatical checker and the checker itself. However, the project is mainly dead right now. But I encourage you to look through the code and use it as inspiration/reference. The tools are currently coded in Python, but I started a while back to rewrite it in Lisp (which will never be finished). Moman, the suite itself, consist of the following tools:
-
FineNight is the FSA library.
- A FST library. (Not yet implemented)
-
ZSpell is the orthographic checker.
Mostly, the only part of the tools suite which is worthwhile mentioning is the "Fast String Correction" which is used by Lucene's FuzzyQuery. You can read about the inclusion of this project in Lucene by reading Michael McCandless's article.
FineNight
The FineNight library contains many algorithms for Finite State Automatons. That includes:
- Union of two FSAs
- Intersection of two FSAs
- Complement of a FSAs
- Difference of two FSAs
- Reversal of a FSA
- Closure of a FSA
- Concatenation of two FSAs
- Determination of a NFA
- Equivalence test
- Minimization algorithm
- Construction of an IADFA from a sorted dictionary
- Graphviz support
- Error-Tolerant IADFA (starred in Michael McCandless's Mike MChttp://blog.mikemccandless.com/2011/03/lucenes-fuzzyquery-is-100-times-faster.html
Almost all algorithms were taken from the book Introduction to Automata Theory, Languages, and Computation. The minimization algorithm is an implementation of Brzozowski's method. In this method, the (possibly non-deterministic) automaton is reversed, determinized, reversed and determinized. I'll eventually add the Hopcroft's nlog(n) minimization algorithm.
ZSpell
ZSpell is meant to be a concurrent of aspell, made by Kevin Atkinson. At this time, ZSpell can suggest words with a Levenshtein-distance of one. Before we were using Kemal Oflazer's algorithm. This algorithm is very slow, but now we use a faster algorithm (Schulz's and Mihov's algorithm). However, only substitution, removal and insertion are used for the faster algorithm. It means that transpositions errors, like "ehllo" -> "hello", are considered as two operations.
TODOs includes:
- Add transposition errors for Levenshtein-distance algorithm.
- Add phonetic errors (spelling by sound).
- Add derivation errors.
References
-
John E. Hopcroft, Rajeev Motwani and Jefferey D. Ullman, Introduction to Automata Theory, Languages and Computation, 2nd edition, Adison-Wesley, 2001.
-
J. A. Brzozowski,
Canonical regular expressions and minimal state graphs for definite events,
in Mathematical Theory of Automata, Volume 12 of MRI Symposia Series,
pp. 529-561, Polytechnic Press, Polytechnic Institute of Brooklyn, N.Y.,
1962.
-
John E. Hopcroft
,
An n log n algorithm for minimizing the states in a finite automaton
,
in The Theory of Machines and Computations, Z. Kohavi (ed.), pp. 189-196,
Academic Press, 1971.
-
Kemal Oflazer,
Error-tolerant Finite State Recognition with Applications to
Morphological Analysis and Spelling Correction
,
Computational Linguistics, 22(1), pp. 73--89, March, 1996.
-
Klaus U. Schulz
and
Stoyan Mihov,
Fast String Correction with Levenshtein-Automata,
International Journal of Document Analysis and Recognition, 5(1):67--85, 2002.
-
Zbigniew J. Czech
,
George Havas
and
Bohdan S. Majewski,
An Optimal Algorithm for Generating Minimal Perfect Hash Functions
, Information Processing Letters, 43(5):257--264, 1992.