I made a lot of improvements since the first implementation of the exact string matching algorithm especially for its parallelization with the cuda implementation but now is time to make some speed up again.
In the best case the program run the search in O(M/N) so the idea is to maximize the frequency of this best case .
To make the search I build a look up boolean table with for each word as index I set true when the word is inside the string to search then I search the words through the first string.
So to maximize the speed of the algorithm I need to maximize the size of the word . Now I use 2 different word size , 3 bytes and 4 bytes . The first for very short strings and the second in all the other cases.
The limits in the usage of 4 bytes word was 10 bytes because the first byte and the last are subject to mask due to the 8 bits shift to manage all the bits possibilities inside the bytes and I need 2 word to for the word misalignment possibility.
This was not the best limit because I can enlarge the step in the research of the words in the first string. The max step in bytes is :
So the limit now is 6 bytes for each string starting from a size of 6 bytes I can use a 4 bytes word map .
Implementing this on cuda there is a problem when the step in the research is less than the word size , is less than 4 bytes because there are conflicts in the memory access ( more threads access the same memory for read ) and this slow down the computation.
To avoid this problem I use more than one parallel job every job with a research step of a word and each job a different starting point in the research , a phase that depend on
After the implementation I made some test and with 7 bytes strings the program now run 2x time faster!
p.s. it is incredible every time I put the hands in the code I get a 2x faster program or better … is there an end to this process?