# Advanced Algorithm Design and Analysis (Lecture 1)

Advanced Algorithm Design and Analysis (Lecture 3) SW5 fall 2004 Simonas altenis E1-215b [email protected] Text-search Algorithms Goals of the lecture: Naive text-search algorithm and its analysis; Rabin-Karp algorithm and its analysis; Knuth-Morris-Pratt algorithm ideas; Boyer-Moore-Horspool algorithm and its analysis. Comparison of the advantages and disadvantages of the different text-search algorithms. AALG, lecture 3, Simonas altenis, 2004 2 Text-Search Problem

Input: Text T = at the thought of n = length(T) = 17 Pattern P = the m = length(P) = 3 Output: Shift s the smallest integer (0 s n m) such that T[s .. s+m1] = P[0 .. m1]. Returns 1, if no such s exists 0123 n-1 at the thought of s=3 the 012 AALG, lecture 3, Simonas altenis, 2004

3 Nave Text Search Idea: Brute force Check all values of s from 0 to n m Naive-Search(T,P) 01 for s 0 to n m 02 j 0 03 // check if T[s..s+m1] = P[0..m1] 04 while T[s+j] = P[j] do 05 j j + 1 06 if j = m return s 07 return 1 Let T = at the thought of and P = though What is the number of character comparisons? AALG, lecture 3, Simonas altenis, 2004

4 Analysis of Nave Text Search Worst-case: Best-case: n-m Outer loop: n m Inner loop: m Total (nm)m = O(nm) What is the input the gives this worst-case behaviuor? When? Completely random text and pattern: O(nm) AALG, lecture 3, Simonas altenis, 2004

5 Fingerprint idea Assume: We can compute a fingerprint f(P) of P in O(m) time. If f(P)f(T[s .. s+m1]), then P T[s .. s+m1] We can compare fingerprints in O(1) We can compute f = f(T[s+1.. s+m]) from f(T[s .. s+m1]), in O(1) f f AALG, lecture 3, Simonas altenis, 2004 6 Algorithm with Fingerprints Let the alphabet ={0,1,2,3,4,5,6,7,8,9} Let fingerprint to be just a decimal number, i.e., f(1045) = 1*103 + 0*102 + 4*101 + 5 = 1045

Fingerprint-Search(T,P) 01 02 03 04 05 06 fp compute f(P) f compute f(T[0..m1]) for s 0 to n m do if fp = f return s f (f T[s]*10m-1)*10 + T[s+m] return 1 T[s] new f f T[s+m] Running time 2O(m) + O(nm) = O(n)! Where is the catch? AALG, lecture 3, Simonas altenis, 2004

7 Using a Hash Function Problem: we can not assume we can do arithmetics with m-digits-long numbers in O(1) time Solution: Use a hash function h = f mod q For example, if q = 7, h(52) = 52 mod 7 = 3 h(S1) h(S2) S1 S2 But h(S1) = h(S2) does not imply S1=S2! For example, if q = 7, h(73) = 3, but 73 52 Basic mod q arithmetics:

(a+b) mod q = (a mod q + b mod q) mod q (a*b) mod q = (a mod q)*(b mod q) mod q AALG, lecture 3, Simonas altenis, 2004 8 Preprocessing and Stepping Preprocessing: fp = P[m-1] + 10*(P[m-2] + 10*(P[m-3]+ + 10*(P[1] + 10*P[0]))) mod q In the same way compute ft from T[0..m-1] Example: P = 2531, q = 7, what is fp? Stepping: ft = (ft T[s]*10m-1 mod q)*10 + T[s+m]) mod q 10m-1 mod q can be computed once in the preprocessing Example: Let T[] = 5319, q = 7, what is the

corresponding ft? T[s] new ft ft AALG, lecture 3, Simonas altenis, 2004 T[s+m] 9 Rabin-Karp Algorithm Rabin-Karp-Search(T,P) 01 q a prime larger than m 02 c 10m-1 mod q // run a loop multiplying by 10 mod q 03 fp 0; ft 0 04 for i 0 to m-1 // preprocessing 05 fp (10*fp + P[i]) mod q 06 ft (10*ft + T[i]) mod q 07 for s 0 to n m // matching 08 if fp = ft then // run a loop to compare strings 09 if P[0..m-1] = T[s..s+m-1] return s 10 ft ((ft T[s]*c)*10 + T[s+m]) mod q 11 return 1 How many character comparisons are done if

T = 2531978 and P = 1978? AALG, lecture 3, Simonas altenis, 2004 10 Analysis If q is a prime, the hash function distributes m-digit strings evenly among the q values Expected running time (if q > m): Thus, only every q-th value of shift s will result in matching fingerprints (which will require comparing stings with O(m) comparisons) Preprocessing: O(m) Outer loop: O(n-m) All inner loops: n m Total time: O(n-m) q m O n m

Worst-case running time: O(nm) AALG, lecture 3, Simonas altenis, 2004 11 Rabin-Karp in Practice If the alphabet has d characters, interpret characters as radix-d digits (replace 10 with d in the algorithm). Choosing prime q > m can be done with randomized algorithms in O(m), or q can be fixed to be the largest prime so that 10*q fits in a computer word. Rabin-Karp is simple and can be easily extended to two-dimensional pattern matching. AALG, lecture 3, Simonas altenis, 2004 12 Searching in n comparisons

The goal: each character of the text is compared only once! Problem with the nave algorithm: Forgets what was learned from a partial match! Examples: T = Tweedledee and Tweedledum and P = Tweedledum T = pappar and P = pappappappar AALG, lecture 3, Simonas altenis, 2004 13 General situation q State of the algorithm: Checking shift s, q characters of P are

matched, we see a non-matching character in T. Need to find: Largest prefix P- such that it is a suffix of P[0..q-1] P: T: T[s] q P: P[0..q-1]: q New q = max{k q | P[0..k1] = P[qk+1..q1]} AALG, lecture 3, Simonas altenis, 2004 14 Finite automaton search

Algorithm: Preprocess: For each q (0 q m-1) and each precompute a new value of q. Lets call it (q,) Fills a table of a size m|| Run through the text Whenever a mismatch is found (P[q] T[s+q]): Set s = s + q - (q,) + 1 and q = (q,) Analysis: Matching phase in O(n) Too much memory: O(m||), two much preprocessing: at best O(m||). AALG, lecture 3, Simonas altenis, 2004 15 Prefix function

Idea: forget unmatched character ()! State of the algorithm: Checking shift s, q characters of P are matched, we see a non-matching character in T. Need to find: q P: T: T[s] q compare this again P: T[s..s+q]:

Largest prefix P- such that q it is a suffix of P[0..q-1] New q = [q] = max{k < q | P[0..k1] = P[qk..q1]} AALG, lecture 3, Simonas altenis, 2004 16 Prefix table We can pre-compute a prefix table of size m to store values of [q] (0 q m) P p a p p a r

q 0 1 2 3 4 5 6 [q] 0 0 0 1 1 2 0

Compute a prefix table for: P = dadadu AALG, lecture 3, Simonas altenis, 2004 17 Knuth-Morris-Pratt Algorithm KMP-Search(T,P) 01 02 03 04 05 06 07 08 Compute-Prefix-Table(P) q 0 // number of characters matched for i 0 to n-1 // scan the text from left to right while q > 0 and P[q] T[i] do q [q] if P[q] = T[i] then q q + 1 if q = m then return i m + 1 return 1 Compute-Prefix-Table is the essentially the same KMP search algorithm performed on P.

AALG, lecture 3, Simonas altenis, 2004 18 Analysis of KMP Worst-case running time: O(n+m) Main algorithm: O(n) Compute-Prefix-Table: O(m) Space usage: O(m) AALG, lecture 3, Simonas altenis, 2004 19 Reverse nave algorithm Why not search from the end of P? Boyer and Moore Reverse-Naive-Search(T,P) 01 for s 0 to n m

02 j m 1 // start from the end 03 // check if T[s..s+m1] = P[0..m1] 04 while T[s+j] = P[j] do 05 j j - 1 06 if j < 0 return s 07 return 1 Running time is exactly the same as of the nave algorithm AALG, lecture 3, Simonas altenis, 2004 20 Occurrence heuristic Boyer and Moore added two heuristics to reverse nave, to get an O(n+m) algorithm, but its complex Horspool suggested just to use the modified occurrence heuristic:

After a mismatch, align T[s + m1] with the rightmost occurrence of that letter in the pattern P[0..m2] Examples: T= detective date and P= date T= tea kettle and P= kettle AALG, lecture 3, Simonas altenis, 2004 21 Shift table In preprocessing, compute the shift table of the size ||. m 1 max{i m 1| P[i ] w} if w is in P[0..m 2] shift[ w] otherwise m Example: P = kettle

shift[e] =4, shift[l] =1, shift[t] =2, shift[t] =5 shift[any other letter] = 6 Example: P = pappar What is the shift table? AALG, lecture 3, Simonas altenis, 2004 22 Boyer-Moore-Horspool Alg. BMH-Search(T,P) 01 01 02 03 04 05 06 07 08 09 10 11 12 13 14 // compute the shift table for P for c 0 to ||- 1

shift[c] = m // default values for k 0 to m - 2 shift[P[k]] = m 1 - k // search s 0 while s n m do j m 1 // start from the end // check if T[s..s+m1] = P[0..m1] while T[s+j] = P[j] do j j - 1 if j < 0 return s s s + shift[T[s + m1]] // shift by last letter return 1 AALG, lecture 3, Simonas altenis, 2004 23 BMH Analysis Worst-case running time Preprocessing: O(||+m) Searching: O(nm) What input gives this bound?

Space: O(||) Total: O(nm) Independent of m On real-world data sets very fast AALG, lecture 3, Simonas altenis, 2004 24 Comparison Lets compare the algorithms. Criteria: Worst-case running time Preprocessing Searching

Expected running time Space used Implementation complexity AALG, lecture 3, Simonas altenis, 2004 25

## Recently Viewed Presentations

• Johns Hopkins University Research. By: Kate Towsen and Lauren McIntosh. ... the first to use rubber gloves during surgery. ... Neurology at Johns Hopkins has a long-standing history with its initial founders counting Doctors Frank Ford and David Clark. The...
• European Periods of Floral Design Renaissance Period Baroque Period Flemish Period French Styles English Georgian Period Victorian Period Arrangements were large, tall, pyramidal and symmetrically balanced Arrangement was twice the height of container Flowers were loose, airy and uncrowded Symmetrical...
• Pendant la période de guerre on ne réussissait à avoir du café voilà pourquoi il était remplacé par l'orge ou par les pois chiches brûlés et moulus, telle poudre était ensuite bouillie dans un petit pot avec de l'eau et...
• Apnea del Sueño y test de sueño Dr. Pedro Serrano MD, PhD, FESC 2008 Apnea del sueño Hoja de información para pacientes. Posibles consecuencias de la apnea del sueño: Hipertensión arterial, infarto agudo de miocardio.
• "Por los pecados de Manasés" "Because of the sins of Manasseh" Otro ejemplo: Saulo de Tarso. Sus segui-dores entre los judíos no sólo seguían persiguiendo a los cristianos, sino también a Pablo mismo. Another example: Saul of Tarsus. His followers...
• arch 1290 architectural caddarch 1200 architectural drawing ii. professor paul king. it's time for research! 2. team 5
• Tools for Telling Stories - News Content Types. The "News Block" Content Type. Can be used on landing and inner pages (but better on inner) Displays news content on your site (make sure to use this on a section at...
• 3rd IMPACT Workshop UCL, Louvain-la-Neuve, Belgium November 6 - 7, 2003 PARMA UNIVERSITY SIMULATIONS OF THE ISOLATED BUILDING TEST CASE F. AURELI, A. MARANZONI & P. MIGNOSA