Experimenting compression

Image via Wikipedia

Times ago I made some test on the compression using my inference engine and now I recollect the statistical data.

First of all my objective is not to develop a compressor because this need an estimation of “compression ratio”-“files types”-“compression time” , it is not to develop a compressor for “random data” because random data is not compressible by definition , it is not to compress exclusively one specific file because is always possible to compress one specific file to 0 and it is not about perpetual compression.

My objective is to evaluate how much good my inference engine is.

To do this a well and perhaps the best one empirical test is to compress pseudo-random files .

So I  tried to compress the famous million random digit file from the Mark Nelson blog .

I tried to split the file in blocks of 5 bytes each one and then I run the inference engine trying to compress each block in a 36 bits ( in other words the sequence predictor must predict correctly 4 bits in a sequence of 40 bits ).
Sometimes the compressor give me the compressed sequence and sometimes it give me no result for simplicity I define “1” when the compressor find the sequence and “0” when the compressor is not able to find the sequence.
If the compressor and the blocks are “statistically independent”  ( and I hope it is not the case )  the frequency of the compressed sequences must be
$2^{36} / 2^{40} =2^{-4}=1/16$ .

I run the compressor and my frequency is 1/12!

It seem a great result!  The problem is how to code the data without losing everything?
The idea is to write the compressed sequences where the compressor find a 36 bits sequence and the original 40 bits when the compressor is not able to find the sequence. So the problem is how to identify compressed sequences from original sequences.

The million random digit file has a size of 415241 bytes so I use  83048 blocks of 5 bytes and I can code about 6920 blocks with 36 bits instead of 40 bits so I can use only less than 6920*4=27680 bits to identify the position of 6920 compressed blocks and I want to compress at least 1 byte so my limit is 27672 bits . I can not use more than 27672 bits.
With this value I can not add simply one flag bit in the front of each block because this require 83048 bits.
A first idea is to write a header part where I write one bit for each block where “1” mean compressed block and “0” not compressed block and then compress the sequence of 83048 bits but I need a compression ratio of 27672/83048≈1/3.
By the entropy I have -((1/12)*Log[2, (1/12)] + (11/12)*Log[2, (11/12)]) =0.41 so maximizing the entropy I can not reach enough compression .

A way to reach max entropy is using an arithmetic coding or also a binomial enumeration for example with the combinadic program.

Using Mathematica I answer to some questions like which frequency let me coding the compression?

N[FindInstance[-((x)*Log[2,(x)]+(1-x)*Log[2,(1-x)])==(83048*x*4)/83048,x]]

N[FindInstance[-((x)*Log[2,(x)]+(1-x)*Log[2,(1-x)])==(x*4),x]]

0.156417354964272015502158337639=1/6.66667

How much can I compress with a frequency of 1/6 ?

p=1/6

N[(83048*p*4-Log[2,Binomial[83048,83048*p]])/8]=173 bytes

And with 1/7 ?

p=1/7

N[(83048*p*4-Log[2,Binomial[83048,83048*p]])/8]=-209 bytes

Concluding this is only one problem to deal with before to claim that it is possible to compress the million digit , another is to be sure to get a 1/12 frequency because I test only 1/100 blocks and it is really too small .

However it seem incredible to me that I need a frequency of 1/6 to be able to code the compression! I need to be able to predict correctly 4 bits in a sequence of 36 bits against a probability of 1/16, it seem a deficiency in the coding phase but the entropy tell me the contrary it is not possible to coding out this dependence and also doing this more practically with the binomial enumeration I can gain bits only with a frequency of 1/6!

This does not mean that having a frequency of 1/12 I can not code a compression because if I have for example all the “1” bits together it is possible to compress the bits and I can make a total compression but this mean I need a not uniform distribution of compressible blocks but my experiment is on the compression of a pseudo-random bits so I suppose this is not the case and if the distribution let me to make compression  it is not interesting for me and probably I need to use another file as test.

So my final question is : is it really true that is not possible to code efficiently this statistical dependence?