Skip to main content

Aho-Corasick Algorithm (Geeks for Geeks)

1. What is the Aho-Corasick Algorithm?​

The Aho-Corasick algorithm is a powerful and efficient string searching algorithm that can locate multiple patterns in a text simultaneously. It constructs a finite state machine that resembles a digital tree with additional links between nodes to allow fast transitions between different patterns.

2. Algorithm for Aho-Corasick​

  1. Build a Trie from the set of patterns.
  2. Construct failure links to allow the algorithm to skip unnecessary comparisons.
  3. Traverse the text using the Trie and the failure links to find all occurrences of the patterns.

3. How does the Aho-Corasick Algorithm work?​

  • Trie Construction: Insert all the patterns into a Trie.
  • Failure Links: Create links that enable skipping sections of the Trie that do not match the current character in the text.
  • Search: Traverse the text using the Trie, following the failure links whenever a mismatch occurs.

4. Problem Description​

Given a text string and multiple pattern strings, implement the Aho−CorasickAho-Corasick algorithm to find all occurrences of the patterns in the text.

5. Examples​

Example 1:

Input: text = "THIS IS A TEST TEXT", patterns = ["TEST", "TEXT"]
Output: Pattern 'TEST' found at index 10
Pattern 'TEXT' found at index 15

Example 2:

Input: text = "AABAACAADAABAABA", patterns = ["AABA", "CAAD"]
Output: Pattern 'AABA' found at index 0
Pattern 'AABA' found at index 9
Pattern 'AABA' found at index 12
Pattern 'CAAD' found at index 5

Explanation of Example 1:

  • The patterns "TEST" and "TEXT" are found in the text "THIS IS A TEST TEXT" starting from index 10 and 15, respectively.

Visual Example​

Visual Example of Aho-Corasick

6. Constraints​

  • The text and patterns can contain any number of characters.
  • All characters are ASCIIASCII characters.

7. Implementation​

Written by @AyushGovil
from collections import deque, defaultdict

class AhoCorasick:
def __init__(self, keywords):
self.trie = {}
self.out = defaultdict(list)
self.fail = {}
self.build_trie(keywords)
self.build_automaton()

def build_trie(self, keywords):
for keyword in keywords:
node = self.trie
for char in keyword:
node = node.setdefault(char, {})
node['#'] = keyword

def build_automaton(self):
queue = deque()
for key in self.trie:
self.fail[key] = self.trie
queue.append(key)
while queue:
current = queue.popleft()
for key in self.trie[current]:
if key == '#':
continue
fail_node = self.fail[current]
while key not in fail_node and fail_node is not self.trie:
fail_node = self.fail[fail_node]
self.fail[key] = fail_node.get(key, self.trie)
queue.append(key)
self.out[current].extend(self.out[self.fail[current]])
if '#' in self.trie[current]:
self.out[current].append(self.trie[current]['#'])

def search(self, text):
node = self.trie
for i, char in enumerate(text):
while char not in node and node is not self.trie:
node = self.fail[node]
node = node.get(char, self.trie)
if '#' in node:
print(f"Pattern '{node['#']}' found at index {i - len(node['#']) + 1}")

# Example usage:
text = "THIS IS A TEST TEXT"
patterns = ["TEST", "TEXT"]
ac = AhoCorasick(patterns)
ac.search(text)

8. Complexity Analysis​

  • Time Complexity:

    • Preprocessing (building Trie and failure links): O(m)O(m) where mm is the total length of all patterns.
    • Searching: O(n+z)O(n + z) where nn is the length of the text and zz is the number of occurrences of the patterns.
    • Overall: Efficient for multiple pattern matching.
  • Space Complexity: O(m)O(m) for the Trie and failure links.

9. Advantages and Disadvantages​

Advantages:

  • Efficient for searching multiple patterns simultaneously.
  • Uses Trie and failure links to minimize unnecessary comparisons.

Disadvantages:

  • More complex to implement compared to simpler algorithms.
  • Higher space complexity due to the construction of the Trie and failure links.

10. References​

  • GFG Problem: GFG Problem
  • Author's Geeks for Geeks Profile: Ayush Govil