Why should I use String Search Algorithms?
Available algorithms (single string):
- Horspool (or Boyer-Moore-Horspool)
- Shift-And
- Shift-Or
- Knuth-Morris-Pratt
- Sunday
- BNDM (Backward Nondet. Dawg Matching)
- BOM (Backward Oracle Matching)
Available algorithms (multiple strings):
- Set Backward Oracle Matching
- Aho-Corasick
- Set Horspool
- Wu-Manber
String searching is about finding a pattern in a longer text. The support of the Java Standard API for searching strings is limited - one may get an impression here. There exist many algorithms that process text more efficiently. Yet the majority of these algorithms is only described in papers or books or listed on websites (and the language they are specified is usually C) - and not available as library (in Java).
StringSearchAlgorithms provides a Java library of various algorithms to support you in string searching - supporting single patterns and multi patterns. While most algorithms outperform the naive algorithm provided by
the Java-API (via String.indexOf
), the performance of the algorithms differs in the dimensions of:
- size of the alphabet (e.g. {A,C,G,T} vs [A-Za-z])
- size of the pattern (e.g. CG vs ACGCTACGACGCTTTTTACGAC)
- the number of patterns to match in one pass
Some heuristics how to choose the best algorithm for your scenario will be presented in Choosing the Best Algorithm.
How does StringSearchAlgorithms compare to other libraries or handcoded algorithms? Pretty well, but note that the main design goals of StringSearchAlgorithms were:
- correctness of all algorithms
- common api for all algorithms
- common features for all algorithms
Together with StringBench StringSearchAlgorithm is the best tested library known to us. Many other libraries do not even deliver correct results in all tested scenarios. Yet there are two libraries known to us that are correct and faster than StringSearchAlgorithms (actually their outperformance is limited to certain scenarios and the feature set is not as rich).
Usage
Available StringFinderOptions:
- MatchOption.LONGEST_MATCH:
only the longest match starting at one char will be reported
- MatchOption.NON_OVERLAP:
overlapping matches are not reported
- MatchOption.NON_EMPTY:
empty matches (strings of length 0) are not reported
Searching Strings:
To search a single string:
- Select a search algorithm, e.g. Horspool
- Create a Finder
- Find the next match with:
- Find all matches (may overlap):
Horspool stringSearch = new Horspool("word");
CharProvider text = new StringCharProvider("text with word in it", 0);
StringFinder finder = stringSearch.createFinder(text);
StringMatch next = finder.findNext();
List<StringMatch> all = finder.findAll();
Restricting to Non-Overlapping Results:
The basic string finder reports all matches, i.e. a match may overlap or subsume another. Of course this is only relevant for algorithms searching multiple strings. To restrict their matches to the longest non-overlapping ones:
- Select the multi-string search algorithm, e.g. AhoCorasick
- Create a Finder
- Finding matches is the same as above
AhoCorasick stringSearch = new AhoCorasick(asList("word", "longer word"));
CharProvider text = new StringCharProvider("text with longer word and word in it", 0);
StringFinder finder = stringSearch.createFinder(text, LONGEST_MATCH, NON_OVERLAP);
Choosing the Best Algorithm
For Searching a Single String
We compared each single-word algorithm in a benchmark, characterized by
- pattern size (the number of chars in the pattern)
- alphabet size (the number of different chars used in the text)
- ShiftOr for smaller patterns (length <= 16) with small alphabet (<= 4 chars)
- BNDM for middle-sized patterns (16 <= length <= 64) with small to mid-sized alphabet (<= 16 chars)
- Horspool/Sunday is most efficient for large alphabets and declines with greater patterns
- BOM is most efficient for large patterns and declines with greater alphabets
For Searching Multiple Strings
We compared each multi-word algorithm in a benchmark, characterized by
- pattern number (the number of words to search)
- pattern size (the number of chars in the pattern)
- alphabet size (the number of different chars used in the text)
- searching for a small pattern number prefers the naive implementation, yet we cannot explain why (benchmark problem?)
- small patterns with small alphabet are dominated by AhoCorasick
- large patterns are dominated by SetBackwardOracleMatching
- large alphabets are dominated by SetHorspool
- for middle sized pattern number, middle sized alphabets and middle-sized patterns WuManber is slightly more efficient than SetHorspool and SetBackwardOracleMatching
The Java SDK method to find strings, String.indexOf(String)
, uses naive string search. This is sufficient if patterns are small and few.
Yet this method does not scale for any dimension.
Yet the Java SDK offers a hidden string search in its Regexp package. java.util.Pattern
with a constant pattern (i.e no regex control characters) will create
java.util.Matcher
s that use the algorithm of Boyer-Moore. This is very similar to ours Horspool and Sunday - in implementation and performance.
Benchmarks
To keep track of the performance of each algorithm we created the benchmark project StringBench.
StringBench compares all available algorithms (of String Search Algorithms and other libraries) in different scenarios (differing in number of patterns, size of patterns and size of alphabet).
The benchmark results can be found here.