# Advanced Algorithm Design and Analysis (Lecture 1) Advanced Algorithm Design and Analysis (Lecture 3) SW5 fall 2004 Simonas altenis E1-215b [email protected] Text-search Algorithms Goals of the lecture: Naive text-search algorithm and its analysis; Rabin-Karp algorithm and its analysis; Knuth-Morris-Pratt algorithm ideas; Boyer-Moore-Horspool algorithm and its analysis. Comparison of the advantages and disadvantages of the different text-search algorithms. AALG, lecture 3, Simonas altenis, 2004 2 Text-Search Problem

Input: Text T = at the thought of n = length(T) = 17 Pattern P = the m = length(P) = 3 Output: Shift s the smallest integer (0 s n m) such that T[s .. s+m1] = P[0 .. m1]. Returns 1, if no such s exists 0123 n-1 at the thought of s=3 the 012 AALG, lecture 3, Simonas altenis, 2004

3 Nave Text Search Idea: Brute force Check all values of s from 0 to n m Naive-Search(T,P) 01 for s 0 to n m 02 j 0 03 // check if T[s..s+m1] = P[0..m1] 04 while T[s+j] = P[j] do 05 j j + 1 06 if j = m return s 07 return 1 Let T = at the thought of and P = though What is the number of character comparisons? AALG, lecture 3, Simonas altenis, 2004

4 Analysis of Nave Text Search Worst-case: Best-case: n-m Outer loop: n m Inner loop: m Total (nm)m = O(nm) What is the input the gives this worst-case behaviuor? When? Completely random text and pattern: O(nm) AALG, lecture 3, Simonas altenis, 2004

5 Fingerprint idea Assume: We can compute a fingerprint f(P) of P in O(m) time. If f(P)f(T[s .. s+m1]), then P T[s .. s+m1] We can compare fingerprints in O(1) We can compute f = f(T[s+1.. s+m]) from f(T[s .. s+m1]), in O(1) f f AALG, lecture 3, Simonas altenis, 2004 6 Algorithm with Fingerprints Let the alphabet ={0,1,2,3,4,5,6,7,8,9} Let fingerprint to be just a decimal number, i.e., f(1045) = 1*103 + 0*102 + 4*101 + 5 = 1045

Fingerprint-Search(T,P) 01 02 03 04 05 06 fp compute f(P) f compute f(T[0..m1]) for s 0 to n m do if fp = f return s f (f T[s]*10m-1)*10 + T[s+m] return 1 T[s] new f f T[s+m] Running time 2O(m) + O(nm) = O(n)! Where is the catch? AALG, lecture 3, Simonas altenis, 2004

7 Using a Hash Function Problem: we can not assume we can do arithmetics with m-digits-long numbers in O(1) time Solution: Use a hash function h = f mod q For example, if q = 7, h(52) = 52 mod 7 = 3 h(S1) h(S2) S1 S2 But h(S1) = h(S2) does not imply S1=S2! For example, if q = 7, h(73) = 3, but 73 52 Basic mod q arithmetics:

(a+b) mod q = (a mod q + b mod q) mod q (a*b) mod q = (a mod q)*(b mod q) mod q AALG, lecture 3, Simonas altenis, 2004 8 Preprocessing and Stepping Preprocessing: fp = P[m-1] + 10*(P[m-2] + 10*(P[m-3]+ + 10*(P + 10*P))) mod q In the same way compute ft from T[0..m-1] Example: P = 2531, q = 7, what is fp? Stepping: ft = (ft T[s]*10m-1 mod q)*10 + T[s+m]) mod q 10m-1 mod q can be computed once in the preprocessing Example: Let T[] = 5319, q = 7, what is the

corresponding ft? T[s] new ft ft AALG, lecture 3, Simonas altenis, 2004 T[s+m] 9 Rabin-Karp Algorithm Rabin-Karp-Search(T,P) 01 q a prime larger than m 02 c 10m-1 mod q // run a loop multiplying by 10 mod q 03 fp 0; ft 0 04 for i 0 to m-1 // preprocessing 05 fp (10*fp + P[i]) mod q 06 ft (10*ft + T[i]) mod q 07 for s 0 to n m // matching 08 if fp = ft then // run a loop to compare strings 09 if P[0..m-1] = T[s..s+m-1] return s 10 ft ((ft T[s]*c)*10 + T[s+m]) mod q 11 return 1 How many character comparisons are done if

T = 2531978 and P = 1978? AALG, lecture 3, Simonas altenis, 2004 10 Analysis If q is a prime, the hash function distributes m-digit strings evenly among the q values Expected running time (if q > m): Thus, only every q-th value of shift s will result in matching fingerprints (which will require comparing stings with O(m) comparisons) Preprocessing: O(m) Outer loop: O(n-m) All inner loops: n m Total time: O(n-m) q m O n m

Worst-case running time: O(nm) AALG, lecture 3, Simonas altenis, 2004 11 Rabin-Karp in Practice If the alphabet has d characters, interpret characters as radix-d digits (replace 10 with d in the algorithm). Choosing prime q > m can be done with randomized algorithms in O(m), or q can be fixed to be the largest prime so that 10*q fits in a computer word. Rabin-Karp is simple and can be easily extended to two-dimensional pattern matching. AALG, lecture 3, Simonas altenis, 2004 12 Searching in n comparisons

The goal: each character of the text is compared only once! Problem with the nave algorithm: Forgets what was learned from a partial match! Examples: T = Tweedledee and Tweedledum and P = Tweedledum T = pappar and P = pappappappar AALG, lecture 3, Simonas altenis, 2004 13 General situation q State of the algorithm: Checking shift s, q characters of P are

matched, we see a non-matching character in T. Need to find: Largest prefix P- such that it is a suffix of P[0..q-1] P: T: T[s] q P: P[0..q-1]: q New q = max{k q | P[0..k1] = P[qk+1..q1]} AALG, lecture 3, Simonas altenis, 2004 14 Finite automaton search

Algorithm: Preprocess: For each q (0 q m-1) and each precompute a new value of q. Lets call it (q,) Fills a table of a size m|| Run through the text Whenever a mismatch is found (P[q] T[s+q]): Set s = s + q - (q,) + 1 and q = (q,) Analysis: Matching phase in O(n) Too much memory: O(m||), two much preprocessing: at best O(m||). AALG, lecture 3, Simonas altenis, 2004 15 Prefix function

Idea: forget unmatched character ()! State of the algorithm: Checking shift s, q characters of P are matched, we see a non-matching character in T. Need to find: q P: T: T[s] q compare this again P: T[s..s+q]:

Largest prefix P- such that q it is a suffix of P[0..q-1] New q = [q] = max{k < q | P[0..k1] = P[qk..q1]} AALG, lecture 3, Simonas altenis, 2004 16 Prefix table We can pre-compute a prefix table of size m to store values of [q] (0 q m) P p a p p a r

q 0 1 2 3 4 5 6 [q] 0 0 0 1 1 2 0

Compute a prefix table for: P = dadadu AALG, lecture 3, Simonas altenis, 2004 17 Knuth-Morris-Pratt Algorithm KMP-Search(T,P) 01 02 03 04 05 06 07 08 Compute-Prefix-Table(P) q 0 // number of characters matched for i 0 to n-1 // scan the text from left to right while q > 0 and P[q] T[i] do q [q] if P[q] = T[i] then q q + 1 if q = m then return i m + 1 return 1 Compute-Prefix-Table is the essentially the same KMP search algorithm performed on P.

AALG, lecture 3, Simonas altenis, 2004 18 Analysis of KMP Worst-case running time: O(n+m) Main algorithm: O(n) Compute-Prefix-Table: O(m) Space usage: O(m) AALG, lecture 3, Simonas altenis, 2004 19 Reverse nave algorithm Why not search from the end of P? Boyer and Moore Reverse-Naive-Search(T,P) 01 for s 0 to n m

02 j m 1 // start from the end 03 // check if T[s..s+m1] = P[0..m1] 04 while T[s+j] = P[j] do 05 j j - 1 06 if j < 0 return s 07 return 1 Running time is exactly the same as of the nave algorithm AALG, lecture 3, Simonas altenis, 2004 20 Occurrence heuristic Boyer and Moore added two heuristics to reverse nave, to get an O(n+m) algorithm, but its complex Horspool suggested just to use the modified occurrence heuristic:

After a mismatch, align T[s + m1] with the rightmost occurrence of that letter in the pattern P[0..m2] Examples: T= detective date and P= date T= tea kettle and P= kettle AALG, lecture 3, Simonas altenis, 2004 21 Shift table In preprocessing, compute the shift table of the size ||. m 1 max{i m 1| P[i ] w} if w is in P[0..m 2] shift[ w] otherwise m Example: P = kettle

shift[e] =4, shift[l] =1, shift[t] =2, shift[t] =5 shift[any other letter] = 6 Example: P = pappar What is the shift table? AALG, lecture 3, Simonas altenis, 2004 22 Boyer-Moore-Horspool Alg. BMH-Search(T,P) 01 01 02 03 04 05 06 07 08 09 10 11 12 13 14 // compute the shift table for P for c 0 to ||- 1

shift[c] = m // default values for k 0 to m - 2 shift[P[k]] = m 1 - k // search s 0 while s n m do j m 1 // start from the end // check if T[s..s+m1] = P[0..m1] while T[s+j] = P[j] do j j - 1 if j < 0 return s s s + shift[T[s + m1]] // shift by last letter return 1 AALG, lecture 3, Simonas altenis, 2004 23 BMH Analysis Worst-case running time Preprocessing: O(||+m) Searching: O(nm) What input gives this bound?

Space: O(||) Total: O(nm) Independent of m On real-world data sets very fast AALG, lecture 3, Simonas altenis, 2004 24 Comparison Lets compare the algorithms. Criteria: Worst-case running time Preprocessing Searching

Expected running time Space used Implementation complexity AALG, lecture 3, Simonas altenis, 2004 25