StringSearchAlgorithms - Efficient String Search Algorithms in Java

Why should I use String Search Algorithms?

Available algorithms (single string):

Horspool (or Boyer-Moore-Horspool)
Shift-And
Shift-Or
Knuth-Morris-Pratt
Sunday
BNDM (Backward Nondet. Dawg Matching)
BOM (Backward Oracle Matching)

Available algorithms (multiple strings):

Set Backward Oracle Matching
Aho-Corasick
Set Horspool
Wu-Manber

String searching is about finding a pattern in a longer text. The support of the Java Standard API for searching strings is limited - one may get an impression here. There exist many algorithms that process text more efficiently. Yet the majority of these algorithms is only described in papers or books or listed on websites (and the language they are specified is usually C) - and not available as library (in Java).

StringSearchAlgorithms provides a Java library of various algorithms to support you in string searching - supporting single patterns and multi patterns. While most algorithms outperform the naive algorithm provided by the Java-API (via String.indexOf), the performance of the algorithms differs in the dimensions of:

size of the alphabet (e.g. {A,C,G,T} vs [A-Za-z])
size of the pattern (e.g. CG vs ACGCTACGACGCTTTTTACGAC)
the number of patterns to match in one pass

Some heuristics how to choose the best algorithm for your scenario will be presented in Choosing the Best Algorithm.

How does StringSearchAlgorithms compare to other libraries or handcoded algorithms? Pretty well, but note that the main design goals of StringSearchAlgorithms were:

correctness of all algorithms
common api for all algorithms
common features for all algorithms

Together with StringBench StringSearchAlgorithm is the best tested library known to us. Many other libraries do not even deliver correct results in all tested scenarios. Yet there are two libraries known to us that are correct and faster than StringSearchAlgorithms (actually their outperformance is limited to certain scenarios and the feature set is not as rich).

Usage

Available StringFinderOptions:

MatchOption.LONGEST_MATCH:
only the longest match starting at one char will be reported
MatchOption.NON_OVERLAP:
overlapping matches are not reported
MatchOption.NON_EMPTY:
empty matches (strings of length 0) are not reported

Searching Strings:

To search a single string:

Select a search algorithm, e.g. Horspool
Create a Finder

Horspool stringSearch = new Horspool("word");
CharProvider text = new StringCharProvider("text with word in it", 0);
StringFinder finder = stringSearch.createFinder(text);

Find the next match with:

StringMatch next = finder.findNext();

Find all matches (may overlap):

List<StringMatch> all = finder.findAll();

Restricting to Non-Overlapping Results:

The basic string finder reports all matches, i.e. a match may overlap or subsume another. Of course this is only relevant for algorithms searching multiple strings. To restrict their matches to the longest non-overlapping ones:

Select the multi-string search algorithm, e.g. AhoCorasick
Create a Finder

AhoCorasick stringSearch = new AhoCorasick(asList("word", "longer word"));
CharProvider text = new StringCharProvider("text with longer word and word in it", 0);
StringFinder finder = stringSearch.createFinder(text, LONGEST_MATCH, NON_OVERLAP);

Finding matches is the same as above

Choosing the Best Algorithm

For Searching a Single String

We compared each single-word algorithm in a benchmark, characterized by

pattern size (the number of chars in the pattern)
alphabet size (the number of different chars used in the text)

From experimental benchmarks following heuristics could be derived:

ShiftOr for smaller patterns (length <= 16) with small alphabet (<= 4 chars)
BNDM for middle-sized patterns (16 <= length <= 64) with small to mid-sized alphabet (<= 16 chars)
Horspool/Sunday is most efficient for large alphabets and declines with greater patterns
BOM is most efficient for large patterns and declines with greater alphabets

For Searching Multiple Strings

We compared each multi-word algorithm in a benchmark, characterized by

pattern number (the number of words to search)
pattern size (the number of chars in the pattern)
alphabet size (the number of different chars used in the text)

From experimental benchmarks following heuristics could be derived:

searching for a small pattern number prefers the naive implementation, yet we cannot explain why (benchmark problem?)
small patterns with small alphabet are dominated by AhoCorasick
large patterns are dominated by SetBackwardOracleMatching
large alphabets are dominated by SetHorspool
for middle sized pattern number, middle sized alphabets and middle-sized patterns WuManber is slightly more efficient than SetHorspool and SetBackwardOracleMatching

String searching with Java SDK:

The Java SDK method to find strings, String.indexOf(String), uses naive string search. This is sufficient if patterns are small and few. Yet this method does not scale for any dimension.

Yet the Java SDK offers a hidden string search in its Regexp package. java.util.Pattern with a constant pattern (i.e no regex control characters) will create java.util.Matchers that use the algorithm of Boyer-Moore. This is very similar to ours Horspool and Sunday - in implementation and performance.

Benchmarks

To keep track of the performance of each algorithm we created the benchmark project StringBench.

StringBench compares all available algorithms (of String Search Algorithms and other libraries) in different scenarios (differing in number of patterns, size of patterns and size of alphabet).

The benchmark results can be found here.

Authors

Stefan Mandel