String Matching

An important sub algorithm used in the sequence predictor is the string matching.

The problem consists to search a string s1 inside a string s2 where length of s2 > length s1 .

The best theoretical complexity I know is O(NLogN) . To obtain this complexity is simple I need only to construct the tree for the first sequence and search the second in the first tree .

Ok. This is the theoretical worst case analysis.

From a practical point of view if the second sequence is relatively small is possible to approximate it to a small constant value K and the complexity is more near to O(N) with the simple algorithm with 2 nested cycles , one for each sequence .

The minimum complexity expected is to examine each value of each string  like “for each subs1 in s1  check if subs1 =s2” .

Intuitively seem impossible to have a complexity that check less than one time each element. This is possible in this theoretical way . Construct a set of the elemts of s2 and then check only one subelement of s1 every sizeof(s2) step because for the exact match all the elements of sus1 must be inside the set of elements of s2 if you get only one elements outside the set you can jump for a sizeof(s2) . This trick let to divide the the number of complexity N by a value that depends on the difference of the size of s1 and s2 .

In the theoretical domain to implement a set for the string s2 I need a tree with a construction complexity O(NLogN)  and a complexity of O(LogN) for each check if an element is inside the set .

In a practical point of view the structure of Pc let to use often a useful trick.

The random access memory let to access a Nth cell in O(1) this in many case let to enlarge logarithm base and in this case to verify if an element is inside the set need only O(1) and also the construction of the set need only O(N) steps.

To construct a structure of a set of this type is simple you need only a vector where each element of the set is the index of the vector and the elements of the vector are boolean .

This is not a worst case complexity , this is a best case complexity but in practical point of view using this improvement I get a program about 5 times faster!

To get this improvement it is important to use a little bit statistical stuff.

It is possible to identify 2 different situations : a worst case situation (state1) where elements of s1 are inside s2 and a best case situation (state2) where elements of s1 are not inside of s2 . In the first case we lose all the advantage because we need to compare the substring of s1 with s2 and in the second case we can skip to next group.

To obtain an improvement I need P(state1)

In my case I have

Len(s1)=64 Kbytes

Len(s2)=10 bytes

To Match at bit level the best solution is to use 8 different shifted s1 strings so the set of s2 become

Len(Set(s2))=10*8=80 bytes

using elements of 2 bytes for the Set(s2) need a vector of 65536 bools with a single check probability of 40/65536 .

The bad news is that for each the whole s2 the probability to find one match is 1 – ((1 – 40/65536)^(65536/8))=0.993272 .

This statistical result defeat all the improvement because one worst case check has a very high time consumption .

Increasing the size of the elements of s2 I get better result .

using elements of 3 bytes for the Set(s2) need a vector of 16Mbytes with a single check probability of (80/3)/(2^24)=1.58946*10^-6 and a full sequence check of 1 – ((1 – (80/3)/(2^24))^(65536/8))=0.0129364 .

Is also possible to get bigger difference but increasing the size of the elements of Set(s2) the memory requirements increase exponentially and with 4 bytes elements 4Gbytes is a too much expensive memory resource.

Now I am using a 3 bytes solution with a relatively low memory consumption and a speed up factor of 5 times.

Follow the c++ source code file used to define the Set used for the string matching improvement.

//FILE CmpLookUp.h
//------------------------------------------------------------------------------
//  DENIS   05/2009
//  LookUp Table Management to Speed Up String Matching
//------------------------------------------------------------------------------

/*
Class Objective   ->  SpeedUp average time consumption

TH=2 bytes
Table size = 2^16 =65536 bools

SourceData = 10 Bytes
SourceData One block Shifted        = 11 Bytes
SourceData One block CheckedBytes   = 9  Bytes

Number of SourceData Words Lower Bounds =9/2 =4

Number Of Blocks of ShiftedData     = 8

NumBer of Total Source Words = 8*4  = 32

Match Probability   = 32/2^16   = 2^5 * 2^(-16) = 2^(-11)

65536*8*10  and 65536/4

IMPROVEMENTS

-Instead of using bool use an bit ID , a set where each element in the set
identify the Add group so for each Add it is necessary to associate an ID.
This change let to identify for example wich bit shift level is found .

*/
#ifndef CMPLOOKUP_H
#define CMPLOOKUP_H

//Usage
//TCmpLookUp
//Init(step)
//Add 1
//Add ..
//Add 8

//Check
//Check ...

template<class TH_>
class TCmpLookUp
{
public:

typedef TH_     TH;

TCmpLookUp()
{
WordStep=0;
TableSize=1<<(8*sizeof(TH));
Table=new bool[TableSize];
Reset();
}

~TCmpLookUp()
{
delete [] Table;
}

void Reset()
{
int i;
for (i=0;i
Table[i]=false;
}

void Init(int WordStep_)
{
WordStep=WordStep_;
Reset();
}

//add the set vector to the table
//use this function for each shift value
//attention skip the masked bytes! ( all the possibility cover too mush case and vanish the table with too frequent matchs)
void Add(TH *begin,TH *end)
{
while (begin<end)
{
Table[*begin]=true;
begin++;
}
}

//return the first matching step
//if not found return pass-the-end pointer
TH *Check(TH *begin,TH *end)
{
#ifdef _DEBUG
if (WordStep<=0)
return NULL;
#endif
while (begin<end&&(!Table[*begin]))
{
begin+=WordStep;
}
return begin;
}

private:
bool *Table;
int TableSize;
int WordStep;

};
//------------------------------------------------------------------------------

template<class TH_>
class TCmpLookUpBB
{
public:

typedef unsigned char      TByte;
typedef TH_     TH;         //TH here is an upperbound type

TCmpLookUpBB(int NumBytes_)
{
WordStep=0;
SizeOfTH=NumBytes_;
TableSize=1<<(8*NumBytes_);
UTM=TableSize-1;            //All LSB at 1
Table=new bool[TableSize];
Reset();
}

~TCmpLookUpBB()
{
delete [] Table;
}

void Reset()
{
int i;
for (i=0;i<TableSize;i++)
Table[i]=false;
}

void Init(int WordStep_)
{
WordStep=WordStep_*SizeOfTH;
Reset();
}

//add the set vector to the table
//use this function for each shift value
//attention skip the masked bytes! ( all the possibility cover too mush case and vanish the table with too frequent matchs)
void Add(TByte *begin,TByte *end)
{
while (begin<end)
{
Table[(*((TH*)begin))&UTM]=true;
begin+=SizeOfTH;
}
}

//return the first matching step
//if not found return pass-the-end pointer
TByte *Check(TByte *begin,TByte *end)
{
#ifdef _DEBUG
if (WordStep<=0)
return NULL;
#endif
while (begin<end&&(!Table[(*((TH*)begin))&UTM ]))
{
begin+=WordStep;
}
return begin;
}

private:
bool *Table;
int TableSize;
int WordStep;   //NumberOfBytes
int SizeOfTH;
int UTM;        //UpperTypeMask
};

#endif
Advertisements

One thought on “String Matching

  1. Pingback: Improvements with new hardware « Breakingkeyboards's Blog

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s