Exact string matching

A wordsearch

Image via Wikipedia

I made a lot of improvements since the first implementation of the exact string matching algorithm especially for its parallellization with the cuda implementation but now is time to make some speed up again.

In the best case the program run the search in O(M/N) so the idea is to maximize the frequency of this best case .

To make the search I build a look up boolean table with for each word as index I set true when the word is inside the string to search then I search the words through the first string.

So to maximize the speed of the algorithm I need to maximize the size of the word . Now I use 2 different word size , 3 bytes and 4 bytes . The first for very short strings and the second in all the other cases.

The limits in the usage of 4 bytes word was 10 bytes because the first byte and the last are subject to mask due to the 8 bits shift to manage all the bits possibilities inside the bytes and I need 2 word to for the word misalignment possibility.

This was not the best limit because I can enlarge the step in the research of the words in the first string. The max step in bytes is :

(INT(|S|/|W|)-1)*|W|+( |S| mod |W| )+1

So the limit now is 6 bytes for each string starting from a size of 6 bytes I can use a 4 bytes word map .
Implementing this on cuda there is a problem when the step in the research is less than the word size , is less than 4 bytes because there are conflicts in the memory access ( more threads access the same memory for read ) and this slow down the computation.
To avoid this problem I use more than one parallel job every job with a research step of a word and each job a different starting point in the research , a phase that depend on ( |S| mod |W| )+1

After the implementation I made some test and with 7 bytes strings the program now run 2x time faster!

p.s. it is incredible every time I put the hands in the code I get a 2x faster program or better … is there an end to this process?

A first Test on Optix

Created by Jason Hise with Maya and Macromedia...

Image via Wikipedia

Probably nowadays we are ready to widely use realtime rendering applications instead of old rasterization tricks.

A very interesting tool to investigate in this possibility is Optix by nvidia that give a great flexibility level and also give a standard let you to go in the correct direction.

One first interesting example can be the tutorial of the sdk and I could not resist to modify the realtime ray tracing to a progressive improvement raytracing .

Why only one ray for a pixel ?

I want one ray for one pixel in the first frame and for every next frame ( without a changing point of view of the camera )  I want to improve the rendered image .

So I build a progressive ray tracing where I fill a pixel changing the point inside the pixel for the ray filling the square of the pixel with a fractal routine .

In a next post I will post the source code.

This is the image with one ray for pixel

And this is the image after a few frames

and this is a link to the binary.

145 millions rows

Cellular automaton

Image via Wikipedia

I am again experimenting with rule 30 I reach 145 *10^6  rows of evolution .

Evolving 2*10^16 cells!

So for everyone wants to know what happen after 145 millions I save some images in BMP format with 8192×2048 pixel for each image.

This is the link to the gallery here

I am sorry there is not a title for the images and they are not correctly sorted but the name of the images is in the format “1_POS.bmp” where POS is the byte number from the left to the right.

Experimenting compression

Informationstheorie - Maximale Entropie; infor...

Image via Wikipedia

Times ago I made some test on the compression using my inference engine and now I recollect the statistical data.

First of all my objective is not to develop a compressor because this need an estimation of “compression ratio”-”files types”-”compression time” , it is not to develop a compressor for “random data” because random data is not compressible by definition , it is not to compress exclusively one specific file because is always possible to compress one specific file to 0 and it is not about perpetual compression.

My objective is to evaluate how much good my inference engine is.

To do this a well and perhaps the best one empirical test is to compress pseudo-random files .

So I  tried to compress the famous million random digit file from the Mark Nelson blog .

I tried to split the file in blocks of 5 bytes each one and then I run the inference engine trying to compress each block in a 36 bits ( in other words the sequence predictor must predict correctly 4 bits in a sequence of 40 bits ).
Sometimes the compressor give me the compressed sequence and sometimes it give me no result for simplicity I define “1″ when the compressor find the sequence and “0″ when the compressor is not able to find the sequence.
If the compressor and the blocks are “statistically independent”  ( and I hope it is not the case )  the frequency of the compressed sequences must be
2^{36} / 2^{40} =2^{-4}=1/16 .

I run the compressor and my frequency is 1/12!

It seem a great result!  The problem is how to code the data without losing everything?
The idea is to write the compressed sequences where the compressor find a 36 bits sequence and the original 40 bits when the compressor is not able to find the sequence. So the problem is how to identify compressed sequences from original sequences.

The million random digit file has a size of 415241 bytes so I use  83048 blocks of 5 bytes and I can code about 6920 blocks with 36 bits instead of 40 bits so I can use only less than 6920*4=27680 bits to identify the position of 6920 compressed blocks and I want to compress at least 1 byte so my limit is 27672 bits . I can not use more than 27672 bits.
With this value I can not add simply one flag bit in the front of each block because this require 83048 bits.
A first idea is to write a header part where I write one bit for each block where “1″ mean compressed block and “0″ not compressed block and then compress the sequence of 83048 bits but I need a compression ratio of 27672/83048≈1/3.
By the entropy I have -((1/12)*Log[2, (1/12)] + (11/12)*Log[2, (11/12)]) =0.41 so maximizing the entropy I can not reach enough compression .

A way to reach max entropy is using an arithmetic coding or also a binomial enumeration for example with the combinadic program.

Using Mathematica I answer to some questions like which frequency let me coding the compression?

N[FindInstance[-((x)*Log[2,(x)]+(1-x)*Log[2,(1-x)])==(83048*x*4)/83048,x]]

N[FindInstance[-((x)*Log[2,(x)]+(1-x)*Log[2,(1-x)])==(x*4),x]]

0.156417354964272015502158337639=1/6.66667

How much can I compress with a frequency of 1/6 ?

p=1/6

N[(83048*p*4-Log[2,Binomial[83048,83048*p]])/8]=173 bytes

And with 1/7 ?

p=1/7

N[(83048*p*4-Log[2,Binomial[83048,83048*p]])/8]=-209 bytes

Concluding this is only one problem to deal with before to claim that it is possible to compress the million digit , another is to be sure to get a 1/12 frequency because I test only 1/100 blocks and it is really too small .

However it seem incredible to me that I need a frequency of 1/6 to be able to code the compression! I need to be able to predict correctly 4 bits in a sequence of 36 bits against a probability of 1/16, it seem a deficiency in the coding phase but the entropy tell me the contrary it is not possible to coding out this dependence and also doing this more practically with the binomial enumeration I can gain bits only with a frequency of 1/6!

This does not mean that having a frequency of 1/12 I can not code a compression because if I have for example all the “1″ bits together it is possible to compress the bits and I can make a total compression but this mean I need a not uniform distribution of compressible blocks but my experiment is on the compression of a pseudo-random bits so I suppose this is not the case and if the distribution let me to make compression  it is not interesting for me and probably I need to use another file as test.

So my final question is : is it really true that is not possible to code efficiently this statistical dependence?

Improvements with the new fermi hardware

How can I improve the CEV predictor with the new NVIDIA 480 ?

One possibility is to increase the size of the table I use in the compare session where now in the 8800 GT I use a table of 24 bit index of boolean with a matching probability of about 1 – ((1 – (80/3)/(2^(8*3)))^((2^16)/8)) (see String Matching ) .

With the new card I can use a 2^32 booleans table using 2^(32-3) =536870912 bytes of memory ( 3 bits to manage 1 byte )  available on this card so the matching probability become 1 – ((1 – (80/4)/(2^(8*4)))^((2^16)/8)) and to obtain the same matching probability I can increase the size of the block from 2^16 to 2^25 so I can exam 2^(25-16) =512 times data for each cycle .

This does not means that I can improve by a factor of 512 but that I can reduce slow operations by a factor of 512 .

My New PC

It is time to change my 5 years old PC.

This is my new computational machine !

  • case  Cooler Master HAF-X
  • motherboard Asus with usb 3.0 and sata 3
  • cpu Intel i7 930
  • ram 12GB Kingston 1333
  • videocard NVIDIA GeForce 480
  • harddisk Kingston SSD 128GB

Self Improvements ( the program that change itself ! )

A changing program is an ill definition!

A program never change.

I found always over the years people speaking about a “changing program”  something like a magic thing with the ability to improve itself .

How can a program change ? You can define how a program change using another program so we are speaking about U(yx) where y is a language and what really change is the program x the program in the language y is there something more than a universality ? No .

Every universal program has a parameter and if the parameter change its value we are watching an “automatic changing program”. The only thing you gain with an “automatic changing program” is to save the space of the code of the program that is a constant C so also space complexity never change .

So when someone is speaking about a changing program simply does not understand how the program change , does not understand which is the real program.

When you see a changing program remember you are watching the wrong program! Your are watching a state variation.

Universality and Entropy

One problem of the universality is the possibility  to have universal programs not able to output some bit strings.

For some universal programs is necessary a post processor , a conversion to let the program able to output some type of bit strings.

An example can be the first “universal” rule of cellular automata . But in general is possible to construct a universal program that work with numbers “01″ and “00″ instead of “0″ and “1″ we can literally replace the first values with the second and with this exchange we don’t  lose the universality .

The problem is that in an inverse research a universal program working with “01″ and “oo” is not good as the “0″ , “1″ because we can have binary string like “1111″ that are not representable by the first program .

For this problem the entropy can help because a way to choose a better universal program is to watch at its entropy and with a program with high entropy is possible to output more bit strings combinations,  so the entropy can be used as a good parameter for a classification of programs.

Project Road Map

1) Implementing the engine on CUDA / THRUST  ( 100 time faster ?!)

2) 4GBytes LookUp Tables for CPU Engine ( for a total of ~8GBytes used )

2) Implementing an external new User Interface ( perhaps using Optix )

3) Implementing the engine using OpenMP for Multi Core CPU

4) Converting in native 64 Bits