String searching is about finding a pattern in a longer text. The support of the Java Standard API for searching strings is limited - one may get an impression here . There exist many algorithms that process text more efficiently. Yet the majority of these algorithms is only described in papers or books or listed on websites (and the language they are specified is usually C) - and not available as library (in Java).
StringSearchAlgorithms provides a Java library of various algorithms to support you in string searching - supporting single patterns and multi patterns. While most algorithms outperform the naive algorithm provided by
the Java-API (via
String.indexOf), the performance of the algorithms differs in the dimensions of:
only the longest match starting at one char will be reported
overlapping matches are not reported
empty matches (strings of length 0) are not reported
To search a single string:
Horspool stringSearch = new Horspool("word"); CharProvider text = new StringCharProvider("text with word in it", 0); StringFinder finder = stringSearch.createFinder(text);
StringMatch next = finder.findNext();
List<StringMatch> all = finder.findAll();
The basic string finder reports all matches, i.e. a match may overlap or subsume another. Of course this is only relevant for algorithms searching multiple strings. To restrict their matches to the longest non-overlapping ones:
AhoCorasick stringSearch = new AhoCorasick(asList("word", "longer word")); CharProvider text = new StringCharProvider("text with longer word and word in it", 0); StringFinder finder = stringSearch.createFinder(text, LONGEST_MATCH, NON_OVERLAP);
We compared each single-word algorithm in a benchmark, characterized by
We compared each multi-word algorithm in a benchmark, characterized by
The Java SDK method to find strings,
String.indexOf(String), uses naive string search. This is sufficient if patterns are small and few.
Yet this method does not scale for any dimension.
Yet the Java SDK offers a hidden string search in its Regexp package.
java.util.Pattern with a constant pattern (i.e no regex control characters) will create
java.util.Matchers that use the algorithm of Boyer-Moore. This is very similar to ours Horspool and Sunday - in implementation and performance.
To keep track of the performance of each algorithm we created the benchmark project StringBench .
StringBench compares all available algorithms (of String Search Algorithms and other libraries) in different scenarios (differing in number of patterns, size of patterns and size of alphabet).
The benchmark results can be found here .