String Search Algorithms

StringSearchAlgorithms provides highly efficient algorithms for (multiple) string search (string matching), e.g. Knuth-Morris-Pratt, Boyer-Moore, Aho-Corasick. StringSearchAlgorithms supports search in Strings, char-arrays, byte-arrays, InputStreams, Readers and Buffers.

Get String Search Algorithms Github

Why should I use String Search Algorithms?

Available algorithms (single string):

  • Horspool (or Boyer-Moore-Horspool)
  • Shift-And
  • Shift-Or
  • Knuth-Morris-Pratt
  • Sunday
  • BNDM (Backward Nondet. Dawg Matching)
  • BOM (Backward Oracle Matching)

Available algorithms (multiple strings):

  • Set Backward Oracle Matching
  • Aho-Corasick
  • Set Horspool
  • Wu-Manber

String searching is about finding a pattern in a longer text. The support of the Java Standard API for searching strings is limited - one may get an impression here. There exist many algorithms that process text more efficiently. Yet the majority of these algorithms is only described in papers or books or listed on websites (and the language they are specified is usually C) - and not available as library (in Java).

StringSearchAlgorithms provides a Java library of various algorithms to support you in string searching - supporting single patterns and multi patterns. While most algorithms outperform the naive algorithm provided by the Java-API (via String.indexOf), the performance of the algorithms differs in the dimensions of:

  • size of the alphabet (e.g. {A,C,G,T} vs [A-Za-z])
  • size of the pattern (e.g. CG vs ACGCTACGACGCTTTTTACGAC)
  • the number of patterns to match in one pass

Some heuristics how to choose the best algorithm for your scenario will be presented in Choosing the Best Algorithm.

How does StringSearchAlgorithms compare to other libraries or handcoded algorithms? Pretty well, but note that the main design goals of StringSearchAlgorithms were:

  • correctness of all algorithms
  • common api for all algorithms
  • common features for all algorithms

Together with StringBench StringSearchAlgorithm is the best tested library known to us. Many other libraries do not even deliver correct results in all tested scenarios. Yet there are two libraries known to us that are correct and faster than StringSearchAlgorithms (actually their outperformance is limited to certain scenarios and the feature set is not as rich).

Usage

Available StringFinderOptions:

  • MatchOption.LONGEST_MATCH:

    only the longest match starting at one char will be reported

  • MatchOption.NON_OVERLAP:

    overlapping matches are not reported

  • MatchOption.NON_EMPTY:

    empty matches (strings of length 0) are not reported

Searching Strings:

To search a single string:

  • Select a search algorithm, e.g. Horspool
  • Create a Finder
  • Horspool stringSearch = new Horspool("word");
    CharProvider text = new StringCharProvider("text with word in it", 0);
    StringFinder finder = stringSearch.createFinder(text);
  • Find the next match with:
  • StringMatch next = finder.findNext();
  • Find all matches (may overlap):
  • List<StringMatch> all = finder.findAll();

Restricting to Non-Overlapping Results:

The basic string finder reports all matches, i.e. a match may overlap or subsume another. Of course this is only relevant for algorithms searching multiple strings. To restrict their matches to the longest non-overlapping ones:

  • Select the multi-string search algorithm, e.g. AhoCorasick
  • Create a Finder
  • AhoCorasick stringSearch = new AhoCorasick(asList("word", "longer word"));
    CharProvider text = new StringCharProvider("text with longer word and word in it", 0);
    StringFinder finder = stringSearch.createFinder(text, LONGEST_MATCH, NON_OVERLAP);
  • Finding matches is the same as above

Choosing the Best Algorithm

For Searching a Single String

We compared each single-word algorithm in a benchmark, characterized by

  • pattern size (the number of chars in the pattern)
  • alphabet size (the number of different chars used in the text)
From experimental benchmarks following heuristics could be derived:
  • ShiftOr for smaller patterns (length <= 16) with small alphabet (<= 4 chars)
  • BNDM for middle-sized patterns (16 <= length <= 64) with small to mid-sized alphabet (<= 16 chars)
  • Horspool/Sunday is most efficient for large alphabets and declines with greater patterns
  • BOM is most efficient for large patterns and declines with greater alphabets

For Searching Multiple Strings

We compared each multi-word algorithm in a benchmark, characterized by

  • pattern number (the number of words to search)
  • pattern size (the number of chars in the pattern)
  • alphabet size (the number of different chars used in the text)
From experimental benchmarks following heuristics could be derived:
  • searching for a small pattern number prefers the naive implementation, yet we cannot explain why (benchmark problem?)
  • small patterns with small alphabet are dominated by AhoCorasick
  • large patterns are dominated by SetBackwardOracleMatching
  • large alphabets are dominated by SetHorspool
  • for middle sized pattern number, middle sized alphabets and middle-sized patterns WuManber is slightly more efficient than SetHorspool and SetBackwardOracleMatching

String searching with Java SDK:

The Java SDK method to find strings, String.indexOf(String), uses naive string search. This is sufficient if patterns are small and few. Yet this method does not scale for any dimension.

Yet the Java SDK offers a hidden string search in its Regexp package. java.util.Pattern with a constant pattern (i.e no regex control characters) will create java.util.Matchers that use the algorithm of Boyer-Moore. This is very similar to ours Horspool and Sunday - in implementation and performance.

Benchmarks

To keep track of the performance of each algorithm we created the benchmark project StringBench.

StringBench compares all available algorithms (of String Search Algorithms and other libraries) in different scenarios (differing in number of patterns, size of patterns and size of alphabet).

The benchmark results can be found here.

Authors

Stefan Mandel