String Search Algorithms

Efficient String Search Algorithms in Java

Get String Search Algorithms Github

Why should I use String Search Algorithms?

Available algorithms (single string):

  • Horspool (or Boyer-Moore-Horspool)
  • Shift-And
  • Shift-Or
  • Knuth-Morris-Pratt
  • Sunday
  • BNDM (Backward Nondet. Dawg Matching)
  • BOM (Backward Oracle Matching)

Available algorithms (multiple strings):

  • Set Backward Oracle Matching
  • Aho-Corasick
  • Set Horspool
  • Wu-Manber

String searching is about finding a pattern in a longer text. The support of the Java Standard API for searching strings is limited - one may get an impression here . There exist many algorithms that process text more efficiently. Yet the majority of these algorithms is only described in papers or books or listed on websites (and the language they are specified is usually C) - and not available as library (in Java).

StringSearchAlgorithms provides a Java library of various algorithms to support you in string searching - supporting single patterns and multi patterns. While most algorithms outperform the naive algorithm provided by the Java-API (via String.indexOf), the performance of the algorithms differs in the dimensions of:

  • size of the alphabet (e.g. {A,C,G,T} vs [A-Za-z])
  • size of the pattern (e.g. CG vs ACGCTACGACGCTTTTTACGAC)
  • the number of patterns to match in one pass
Some heuristics how to choose the best algorithm for your scenario will be presented in a later chapter .

Usage

Available StringFinderOptions:

  • MatchOption.LONGEST_MATCH:

    only the longest match starting at one char will be reported

  • MatchOption.NON_OVERLAP:

    overlapping matches are not reported

  • MatchOption.NON_EMPTY:

    empty matches (strings of length 0) are not reported

Searching Strings:

To search a single string:

  • Select a search algorithm, e.g. Horspool
  • Create a Finder
  • Horspool stringSearch = new Horspool("word");
    CharProvider text = new StringCharProvider("text with word in it", 0);
    StringFinder finder = stringSearch.createFinder(text);
  • Find the next match with:
  • StringMatch next = finder.findNext();
  • Find all matches (may overlap):
  • List<StringMatch> all = finder.findAll();

Restricting to Non-Overlapping Results:

The basic string finder reports all matches, i.e. a match may overlap or subsume another. Of course this is only relevant for algorithms searching multiple strings. To restrict their matches to the longest non-overlapping ones:

  • Select the multi-string search algorithm, e.g. AhoCorasick
  • Create a Finder
  • AhoCorasick stringSearch = new AhoCorasick(asList("word", "longer word"));
    CharProvider text = new StringCharProvider("text with longer word and word in it", 0);
    StringFinder finder = stringSearch.createFinder(text, LONGEST_MATCH, NON_OVERLAP);
  • Finding matches is the same as above

Choosing the Best Algorithm

For Searching a Single String

We compared each single-word algorithm in a benchmark, characterized by

  • pattern size (the number of chars in the pattern)
  • alphabet size (the number of different chars used in the text)
From experimental benchmarks following heuristics could be derived:
  • ShiftOr for smaller patterns (length <= 16) with small alphabet (<= 4 chars)
  • BNDM for middle-sized patterns (16 <= length <= 64) with small to mid-sized alphabet (<= 16 chars)
  • Horspool/Sunday for large patterns with large alphabet
  • BOM for large patterns (length >= 64) with small to mid-sized alphabet

For Searching Multiple Strings

We compared each multi-word algorithm in a benchmark, characterized by

  • pattern number (the number of words to search)
  • pattern size (the number of chars in the pattern)
  • alphabet size (the number of different chars used in the text)
From experimental benchmarks following heuristics could be derived:
  • searching for a small pattern number prefers the naive implementation, yet we cannot explain why (benchmark problem?)
  • large patterns are dominated by SetBackwardOracleMatching
  • small patterns with small alphabet are dominated by AhoCorasick
  • small patterns with large alphabets are dominated by SetHorspool
  • WuManber seems to good for middle-sized pattern number, being competitive between SetHorspool and SetBackwardOracleMatching

String searching with Java SDK:

The Java SDK method to find strings, String.indexOf(String), uses naive string search. This is sufficient if patterns are small and few. Yet this method does not scale for any dimension.

Yet the Java SDK offers a hidden string search in its Regexp package. java.util.Pattern with a constant pattern (i.e no regex control characters) will create java.util.Matchers that use the algorithm of Boyer-Moore. This is very similar to ours Horspool and Sunday - in implementation and performance.

Benchmarks

To keep track of the performance of each algorithm we created the benchmark project StringBench .

StringBench compares all available algorithms (of String Search Algorithms and other libraries) in different scenarios (differing in number of patterns, size of patterns and size of alphabet).

The benchmark results can be found here .

Authors

Stefan Mandel