String Matching Problem:
Suppose a Text is an array T[1..n] of length n and pattern is also an array P[1..m] (m <= n). We say pattern P occurs with shift s in Text T if 0 <= s <= n-m and T[s+1..s+m] = P[1..m] i.e. T[s + j] = P[j] for all 1 <= j <=m.
If P occurs with shift s in T the s is a valid shift otherwise it is an invalid shift.The string matching problem is the problem of finding all valid shifts with which a given Pattern P occurs in a given Text T.
Prefix : W is a prefix of string x if x = wy for some string y.
Suffix : W is suffix of string x if x = yw for some string y.
xy – string concatenation | x | - length of string x
Knuth-Morris-Pratt Algorithm:
Knuth, Morris and Pratt proposed a linear time algorithm for the string matching problem.
A matching time of O(n) is achieved by avoiding comparisons with elements of T that have previously been involved in comparison with some elements of the pattern P to be matched. i.e., backtracking on the string T never occurs.
Components of KMP Algorithm:
Prefix Function Π: Prefix function Π for a pattern encapsulate knowledge about how the pattern matches against shifts of itself.
Knowing these q text characters allows us to determine immediately that certain shifts are invalid since the first character a would be aligned with a text character that is known to match with second pattern character b. So next shift should be s = s + 2.
Given P[1..q] matches T[s+1..s+q] what is the least shift s1 > s such that
P[1..k] = T[s1+1 . . s1+k] where s1 + k = s + q
Prefix function Π [q] = max ( k : k < q and Pk is suffix of Pq ) where Pk is P[1..k] i.e. " Π[q] is the length of longest prefix of P that is a proper suffix Pq ."
My implementation of Prefix function is as follows:
int* Prefix(char *pattern)
{
int len = strlen(pattern);
int *pre = new int[len];
pre[0] = 0;
int k = 0;
for(int i=1; i<len; ++i)
{
while(k>0 && pattern[k] != pattern[i])
k = pre[k];
if(pattern[k] == pattern[i])
++k;
pre[i] = k;
}
return pre;
}
KMP Matcher: With text T, pattern P and prefix function Π as inputs, it finds all the occurrence of P in T and returns the number of shifts of P after which occurrence is found.
Implementation of KMP Match function is as follows:
bool Match(char *text, char *pattern, vector<int>& shifts)
{
int textLength = strlen(text);
int patrnLength = strlen(pattern);
int* prefixArray = Prefix(pattern); // Π function
bool found = false;
int q = 0;
for(int i=0; i<textLength; ++i)
{
while( q>0 && pattern[q] != text[i])
q = prefixArray[q-1]; //Next character does not macth
if(pattern[q] == text[i])
++q; //Next character Matched
if(q == patrnLength) //All matched
{
shifts.push_back(i-patrnLength);
found = true;
q = prefixArray[q-1]; //Look for next match
}
}
return found;
}
No comments:
Post a Comment