Title: 15618 Project Parallel String Matching Algorithms

Team Members: Abhishek Kumar and Runze Wang

Project Proposal

Summary

We are going to work on Parallelelizing String Matching algorithms like Aho-Corasick, KMP, and Rabin Karp algorithms. These string search algorithms are sequential in nature, hence not efficient for large scale tasks like DNA Sequence Matching, Network Intrusion Detection (NID). We wish to explore their parallel implementations on GPUs.

Link to Project Proposal

Background

The Aho-Corasick algorithm is used to find matching substrings for a set of patterns in a large input string. For example, given dictionary of words {a, ab, bab, bc, bca, c, caa} and input text abracadabra, it must output a occurs at indexes 0, 3, 5, 7, 10, ab occurs at 0, 7, bab doesn't occur anywhere,.... and so on.

So it works on multiple patterns. It creates a Finite State Automaton which is similar to trie but with additional links which help in faster transition between states when matching fails. It was used in Unix command fgrep. It is used in various applications like network intrusion detection and bioinformatics for finding several input strings within a given large input string. If n is the length of the text and m be total number of characters in the search dictionary, the Aho-Corasick algorithm works in O(n+m+z) complexity where z is the total number of occurrences in text.

KMP algorithm is a linear time searching algorithm which given a pattern and text, finds all occurrences of the pattern in the text. If the length of the text is n and m is the length of the pattern, then KMP works in Time Complexity O(n+m).

Rabin Karp is another string matching algorithm which uses hashing. It uses a rolling hash to eliminate positions of the text that cannot match the pattern, and then checks for the exact match at the positions where hash of the pattern matches the hash of the rolling window.

All these algorithms are very good for matching patterns sequentially. In this project, we will explore their adaptations in parallel domain especially over large input text size. Network Intrusion Detection Systems (NIDS) use string matching engine to identify network attacks by inspecting packet content against thousands of predefined patterns. Due to the increasing number of attacks, traditional string matching approaches on uni-processors may not be sufficient. Parallel String matching can be very helpful for other applications too like DNA Sequence matching, large scale text mining and digital forensics.

The Challenge

One of the biggest challenges for this project is that there isn’t enough literature around it. The workload will be a large number of very large input text. We would look at different numbers of patterns (small, moderate, large). But, we will assume that size of pattern will be much less than the size of the text. Based on the implementation, there might be less locality in the data. We will get to know more about this once we do our experiments.

Resources

We will be using GPUs on GHC machines mostly for our work. We might consider PSC machines if we end up using MPI or OpenMP. But for now, the plan is to work with CUDA.

We found a parallel implementation of Parallel Failureless Aho-Corasick algorithm at Parallel Failure Less Aho-Corasick. PFAC is a modified version of Aho-Corasick where the links used in case of failed matches are absent which might be beneficial for some applications. We will explore this in our project.

We will try to run and benchmark our implementation against it. We plan to have our own implementation for both parallel and sequential versions of the algorithm. We will be taking help from this paper regarding Parallel Aho-Corasick algorithms (PFAC) Accelerating String Matching Using Multi-threaded Algorithm on GPU

Goals and Deliverables

75% Goals: Sequential and CUDA Parallel implementation of Aho-Corasick, KMP, and Rabin Karp Algorithms

100% Goals: 75% Goals + A very good analysis of the performance of sequential algorithms vs parallel algorithms, analyzing bottlenecks in workload and comparable performance against the reference implementation of PFAC.

125% Goals: 100% Goals + Trying other parallelization techniques like SIMD/Cilk/OpenMP/MPI/Charm OR achieve significantly better performance than reference implementation of PFAC.

Platform Choice

We aim to use GPU in the GHC machines. General Purpose GPUs have been very successful at a lot of parallel tasks and we feel that we should explore their use-case on the parallel string matching problem. A good performance on GPU will mean that many systems like Network Intrusion Detection and DNA Sequence matching can use GPUs for improving performance.

Schedule

Time	Work
13 Nov - 19 Nov	Implement Sequential Versions of Algorithms
20 Nov - 26 Nov	Implement Parallel version of Aho-Corasick
27 Nov - 30 Nov	Work towards Project Milestone
1 Dec - 7 Dec	Work on Parallel version of KMP and Rabin Karp
8 Dec - 14 Dec	Do Analysis over different algorithms
14 Dec - 17 Dec	Work towards Final Project Report and Poster

Milestone Check-in

Link to Milestone Report

Summary of progress

At this point, we have completed implementing sequential versions of Aho-Corasick, KMP, and Rabin Karp pattern matching algorithms. We have also completed implementing parallel versions of Aho-Corasick, KMP, and Rabin Karp via OpenMP. In particular, we parallelized the construction of AC automata by utilizing a lock-free Trie insertion algorithm and parallel BFS traversal for constructing fail links. We have also implemented an initial version of parallel failureless Aho-Corasick using CUDA as described in the project proposal. We have finished initial rounds of benchmarking to make sure the parallel implementations exhibit speedup against the sequential implementation. The current parallel implementations of string matching algorithms via OpenMP and parallel failureless Aho-Corasick via CUDA both show great speedups against the serial implementation.

Compared with the schedule in the project proposal, we believe that we are on the schedule if not being ahead of schedule. We believe that we are still able to finish the deliverables mentioned in the project proposal, namely (1) a CUDA implementation of KMP and Rabin Karp and (2) benchmarking different implementations and reasoning about the performance of different implementations under varying assumptions about the patterns and the text.

Updated Schedule

Time	Task	Who?
1 Dec - 5 Dec	Finish CUDA-based implementations of Rabin Karp and KMP	Both
6 Dec - 10 Dec	Design benchmarks to measure speedup, memory access pattern, and bandwidth utilization	Both
11 Dec - 14 Dec	Perform experiments and iteratively improve implementations	Both
15 Dec - 17 Dec	Finish Report and Poster	Both

Plan for Poster Session

In the poster session, we will present our analysis using graphs and plots and we will show different benchmarks that we use to reason about the performance of different string search implementations on different kinds of queries (patterns).

Potential Issues

We don't foresee any potential problems as of now. However, designing good benchmarks can be nontrivial. We need to analyze the algorithm theoretically, think about common applications of pattern searching algorithms, find relevant corpus for evaluation, conduct the experiment on GHC or even PSC machines, and iteratively improve our implementations based on the feedback of profilers. Our analysis will not be restricted to the speedup. We will also reason about why the data pattern makes a particular algorithm good or bad.

Finals

Link to Final Report

Link to Poster Slides

Link to Source Code

Conclusion

Our experimental results suggest that Rabin-Karp has relatively stable, though poor, performance. This arises from the fact that Rabin-Karp has regular memory access patterns. For each position of the text, Rabin-Karp does a constant amount of work, except if there are matches and we want to use brute-force to verify hashes do not give us false matches. Rabin-Karp is slow because it uses expensive operations like multiplication and taking modulus, which are costly on modern CPU architectures. Over a single pattern, KMP performs better than Rabin-Karp in general, though KMP achieves greater speedup when branching predictors can predict whether we are going to match against the pattern further or if we are not going to match against the pattern further. KMP does not perform well if at each point, we have 50% chance of having a match and 50% chance of not having a match. Even though Aho-Corasick is theoretically equivalent to KMP over a single pattern, the less compact representation leads to a higher cache miss rate than KMP, so the performance is slightly worse.

Over multiple patterns, the sequential and OpenMP implementations of Aho-Corasick have huge advantages over sequential and OpenMP implementations of Rabin-Karp and KMP. This is because Aho-Corasick does not need to iterate through the text for multiple times when there are multiple patterns. Our experimental results suggest that in general, implementing Aho-Corasick on CUDA using the strategy where each thread is responsible for one possible starting position of the match leads to best results, though this strategy seems to suffer when for most positions, we are able to match the pattern for extended length.

Finally, our experiments on parallel construction of Aho-Corasick automata using OpenMP show that lock-free insertion algorithms to Trie and parallel BFS can often lead to some, though not perfect, speedup. When the size of the patterns does not fit into L2 cache or L3 cache, though, the matching phase would slow down significantly due to cache misses and high cost of communication.