# Deep Learning THE LIMITS

Nowadays there is a new name going around in the AI field the “deep learning” the new revolutionary algorithm to solve everything… I am bored to read deep deeep deeeep everywhere . Why deep ? It is Large not deep !? It is only a new name for old stuff with a more powerful hardware. Let me explain where are the troubles. There are 2 main problems or limits in what a Deep Learning can achieve

1. The fitness of the objective
2. The low computing possibilities of the agent ( is it only a tree ? Is it universal? )

(1) The learning phase need a function to get a value on how good the behavior of the agent is and like what the NFL theorems say it is not possible to solve the problem for every fitness , it is possible to solve the problem for only special fitness ( if there is not a gradually increasing fitness I think the trainer has some trouble ) (2)  As every programmer know there are very different programs and very different problems and it is difficult to have a program as solution that work fine for different problems . An agent is a program and if it is a NN with many layers is something like a tree it is only one step forward from a table . A tree-program is very limited . It is difficult to run a sat solver program implemented by a tree…( so don’t try to teach a deep learning how to solve the sat ) . Perhaps the “deep learning” use a RNN , ok here we have a graph another step forward but again which complexity ? You can not solve problems with high complexity with low complexity programs … if you want you can do the opposite for example it is possible to implement a sorting solver using a sat-solver but solving an higher complexity with a lower complexity is really a bad idea and in older post I explain why.

# Lines of Code

How  is it possible to measure how much work there is in a program? How much work is needed to build a program ?

It is clear , the solution is its Kolmogorov Complexity . This value gives you exactly how much work there is into the program . There is only a problem it is not computable.

The interesting thing is that from a strictly pragmatic point of view the size of the program , the lines of source code is a good K.C. approximation so the simple idea to watch on the size of the program is not a so bad idea. The problem is if the programmer know this measure , in this case the program will be written to be longest as possible and in this case the size is not a good approximation .

The conclusion is that the measure of the size of a program is a good measure if it is not written to be long.

p.s. despite the conclusion I find the short programs are so beautiful

# Parallelizing again

This is the last release I compiled of The Cellular Automata 1D evolver on GPU by CUDA

This is more than 1 year old , optimized for GPU with about 500 cores . The new gpus by nvidia have “only” about 2500 “only” a factor of 5 but for the same price is possible to buy 4000~5000 cores from the amd gpus like 7xxx and soon the 8xxx series . There is also an advancing of the OpenCL supported by AMD , NVIDIA , INTEL ,  I am not sure if it is possible to reach the same computational power by OpenCL and I am sure there are problems on how the different brands implement the language so it is not so easy to write OpenCL programs working for different gpu of the same brand and for different brand but OpenCL let me work with different hardware solutions and perhaps I can reach a factor of 10x using OpenCL ( my main doubt is if it is possible to implement synchronization tricks I use on CUDA  tricks that let me gain a 5x factor of speedup! ).

The first amd card I bought is the gigabyte 7970

And with its 2048 overclock-able cores can reach about 6~7  times the computational work of my good geforce 480 ( it is a very good card , it worked nights and days for years without hesitations ).

Ok 6x time faster is not enough  for me , not enough to reimplement everything so my plan is to buy another one, a 7970 or 8970 when available and to work with a minimum of 2048+2560 ( or ~2300 )  and an increasing speed of about 10x . This configuration of multi-gpu give me the opportunity to implement another level of parallelization.

The current implementation of the evolver is a multicore level where the memory of the gpu is shared ( there are different levels of shared memory) for all the cores this feature let to each thread to compute value reading the result of other threads (every cell is the result of previous 3 cells ) .  The management of computational resource without shared memory let to expand the system to many levels of parallelization.

The above image show different triangles each one representing a computational job where the information shared by the computational job is the perimeter of the triangle . The red triangles to be computed need the base of the yellow triangle and the computation proceed reducing its size so there is no other information required. The base of the red triangles can be computed by a yellow triangle which need the information of the 2 side computed by 2 red triangles so we have a dependence where each red triangle need one yellow triangle which need 2 red triangles.

The size of the triangles will depend on the power of the computational units and the power of the transmission channels . It is also possible to recursively split a triangle into sub-triangles and this can be useful if there are different levels of computational units ( multi-gpu , multi-pc , computing grid ).

Given a triangle with a base size of B its perimeter is B*2 and this is the size of the communication in/out for this triangle . The number of cells computed in the triangle is (B/2)^2 .

So given C cells computation over 1 cell communication the size of the triangle by its base B should be B=8*C . This size let you to have no idle time due to synchronization.

# Last Kurzweil book : How to create a mind

I read the last book of Ray Kurzweil , it is a good book explaining in a simple way the ideas the work and the progress of Kurzweil in the effort to build a human , to build a Strong Artificial Intelligence .

I agree in the major part of what Kurzweil say specially in the main idea of a mind as Pattern Recognition PRTM (Pattern recognition theory of mind) because a pattern recognizer , a classificator is a problem solver to the “top level” if you have an engine to solve a general classificator you can solve every problem.

Kurzweil implement the classificator using an HHMM ( hierarchical hidden Markov models )

That is a graph of states linked by different values of probabilities . There is a big problem in this model : how to define the topology of the graph? How many states do you need ? How many levels do you need for the hierarchical structure? I think this model is not enough flexible to let the emergence of all algorithm.

Also Kurzweil is aware of such limitations and implement into the system also a Genetic/Evolutionary Algorithm to solve this flexibility problem.

Before to continue is better to clarify a concept .Why we need flexibility ? How much flexibility we need?

Why not to use a simple table to solve every problem? We can build a table for a problem where for every possible input we have a corresponding output so we have the correct answer for every instance of the problem , it is very simple .

The problem is not only a matter of space the problem is that we don’t know the answer for every instance of the problem . The learning process of a table require a training for every input/output of the system , in some sense we have to spend too much in the learning ( programming ) phase to construct a solver implemented by a table . And a tree ? What about a tree structure a tree require less training data to learn but again it can not be enough small . As example you can try to implement the sorting problem using table, tree and graph with training data .

In the opposite side of the flexibility there is the ILS ( inverse levin search )   that give the best solution , the best program for the given training data. So why not to use an ILS ? The problem with this solver is that it is too much flexible it does not make assumption so the search space become too much wide . In general if you know some constraints some assumption some characteristic about the problem making restriction in the search space can be very useful (using a tree instead of a graph in a learning systems can be useful if we know that there is not the possibility of connections from bottom node to the top etc… ).

So how much flexible/generic a solver must be? I prefer to watch at the problem to understand if there are some constraints some inference we can do to restrict the genericity of the solver . A generic classificator has not restriction but for sure a mind is not based on a generic classificator , it is only a restricted classificator .

Kurzweil seem to prefer to watch at the solver ( the brain ) to implement these restrictions implementing a system similar to the structure of the brain (like many other projects is doing ) but without using a neural network (!?) because it is not enough flexible …

And worst he insert a GA because the HHMM is not again enough flexible . I think it can be better to define the level of flexibility required before and then implementing a solver with the correct level of flexibility.

In the book Kurzweil describe its speech recognizer that seem to work very well but is based on very strong restrictions and it can be very dangerous to make such restrictions without strong evidence . The system is done using a classificator of points in 16 dimensions where each cluster is defined as a circle in the graph that include all the points of the class . There are also a lot of restrictions on the size of the data etc… but for them I don’t see big problems . The problem is the assumption of the relation of the 16 dimensions like spatial dimension! Why ? And why to use a circle to identify a cluster? why not 16 dimensions plane or a curve of N dimensions ? An answer to these questions can come from the acoustic physic and probably there is an answer to these questions due to the effective good performance of the system but my doubt is not there my doubt is how such system work for a completely different set of problems .I am sure that the pattern recognizer of the brain are very similar but I think the restrictions made in the speech recognizer are too strong . I think it is very difficult that a system with that restrictions can work in a totally different domain.

Another question that come to me is if a system with that restrictions is enough for a good performance why don’t to use a SVM ( Supported Vector Machine ) it seem perfect for that problem !

Now about the GA . Why GA works in the nature? Or better why evolutionary algorithms works in the nature? As showed by NFL  theorems ( 1 , 2 ) there are some problems . A good answer can be found in Investigations by Stuart Kauffman  where is explained simply that the fitness evolve in the nature like the object of the evolution , in other words the problem change . This is not the case if we fix an arbitrary problem ( but we can not do this for large problems ) . I am not saying that a GA can not be a good solver but that it can be a bad solver ( genetic algorithm open  deep questions… for example every problem existing are natural ? trained to be solved by an evolutionary algorithm ? the laws of nature constrain an evolutionary environment ? )  .

Before to use an evolutionary algorithm I make always a question why the algorithm should be a good solution , on which criteria can I claim the evolutionary algorithm can work ? I watch on the fitness of the problem and if it is possible to represent the fitness such that “good solution” are enough “closed” together it can be a good choice.

I can not find an answer to these questions into the book I don’t know if there are answers and probably the objective of the book is to be  “non-technical” enough avoiding the possibility of these explanations .

Ok I can not close the post without report these 2 issue .

Self-reference

Very often Kurzweil speak about “self-…” as a powerful feature . How much powerful a self-reference can be? In the page 188 he explicitly say “…as well as for self-modifying code ( if the program store is writable ) , which enables a powerful form of recursion.” . A self-modifying code … and if the program can not be writable what we can not do ? What is impossible to do if we have a universal language to write a program in a readonly memory ? Nothing!

To explain better we can try to think on how a self-modifying code change itself , to change itself it must follow a program ( otherwise we have an oracle ) but we can make a self-modifying code that self-modify also the program ( how it self-modifying the code )  and again we can build a self-self-self-modify , where is the end of this self references? The end is the universality! you can never do something better than a universal program . The only thing you can save with a self-modifying code is the constant C ( the space of the code ) .

For every self-….-self-rewriting program exist a not rewriting program that use (in the worst case ) a constant C of more space.

The cellular automata

In the chapter 9 Kurzweil speak about the cellular automata describing how the class 4 behave and here seem there is a misunderstanding of the theory of Wolfram and on what happen in the cellular automata evolution.

It is true that given an arbitrary cell of a cellular automata there is the possibility that we can not know what is the value of that cell after N steps without executing the entire evolution for N steps ( page 238 ) but it is also possible that we can say exactly what is the value in less than N steps in general without executing N steps . The point is that we can know if  it is possible to claim the value of a cell without executing N steps that is different from saying that we never know the value without executing the cellular automata .

This is absolutely not in contrast with the scientific laws because the missing point of Kurzweil is that we can also find theorems with proofs in a cellular automata system and these theorems can claim that a cell must have a specific value after N steps or that a group of cells must have a special value after M steps . We can find also a group of theorems equivalent to the Newtonian Laws . There is not  contradiction. We can not predict the value of a cell in the same way we can not predict a position of subatomic particle of a satellite orbiting the Earth even if we can predict the position of the satellite. The point is that there are proofs that assert there are cells for which we can not predict the value without executing the entire evolution .

# The Measure of Intelligence

The measure of intelligence of an object X is exactly the Kolmogorov complexity of X .

This come from the simple observation that difficult problems has high KC and for every high KC there are difficult problems .

The interesting thing is to compare an utility function on solving problems . The are a lot of definitions of intelligence based on a measure of utility . The idea is to ask how much can be useful to solve a problem X ? The answer come from the Universal Distribution

From here is simple to understand that simple problems has high probability and so a solution for these problems is more important . It is more useful the ability to solve a simple problem than a difficult one!

This is a utility measure but we can not accept that solving a simple problem is more intelligent than to solve a difficult one so we must split the 2 definitions of Utility and of Intelligence.

Anyway there is this incredible fact that the Universal Distribution tell : The Intelligence is not so useful!

Despite this deduction my attraction and research is reserved for the intelligence .

# Proving Darwin

Gregory Chaitin is one of my preferred author and perhaps the best one, I read all his books, I read all his papers . He is the man who discover Ω , who discover the algorithmic randomness and I don’t believe in the existence of stochastic processes so I think the correct definition of random is the algorithmic randomness.

The objective of Chaitin is to prove that a random walk in the software space can increase the complexity . Trying to do this what he reach is a very interesting result.

I don’t agree with Chaitin in the starting point not in the conclusion.

To have an infinite increasing complexity evolving software we need the existence of

1. a random stochastic source
2. an oracle to solve the halting problem

The point 1 is a main problem , I think the universe is deterministic . It is very difficult to construct a stochastic universe that appear so deterministic. Why should we introduce a stochastic source where everything can be explained by deterministic systems? Without a stochastic source we can not define an “evolving program” , in this case the program will change following deterministic rules so the program never change what really change is its state. A program never change by definition.

Without a stochastic source we can not have a random walk in the software space , what we can do is to use an algorithmic random string S as a finite source of random data but in this case the random walk has a limit in the increasing of the complexity that an evolving software can reach.

If we have a stochastic random source the injection of the complexity from the random source to the evolving programs is not the most interesting part , we should move the attention to the stochastic random source with an infinite complexity , this strange object become really more interesting than the evolving programs .

Using the dictate of Occam’s razor isn’t it more simple to believe in a deterministic universe that follow deterministic rules and so using a finite algorithmic complexity the execution of such program express evolutionary characteristics without strange things like oracles and stochastic source?

Why the universe expose evolutionary behaviour? Can this evolutionary scenario be only a point of view of the behaviour of a deterministic program?

What Chaitin develop is an interesting tool to use with approximations , approximating our low knowledge of the deterministic behaviour of the evolutionary walk in the software space such that we can approximate it with a random walk, and other approximations like the oracle because I really don’t think this is the reality.

When Chaitin move to the field of biology I had a lot of doubts on the relevance of the results that was possible to reach but I am amazingly surprised by these results I trust the Chaitin development in this field will give me a lot of suprises.

I hope to give to the reader enough curiosity to read the new Proving Darwing book.

# Exact string matching

Image via Wikipedia

I made a lot of improvements since the first implementation of the exact string matching algorithm especially for its parallelization with the cuda implementation but now is time to make some speed up again.

In the best case the program run the search in O(M/N) so the idea is to maximize the frequency of this best case .

To make the search I build a look up boolean table with for each word as index I set true when the word is inside the string to search then I search the words through the first string.

So to maximize the speed of the algorithm I need to maximize the size of the word . Now I use 2 different word size , 3 bytes and 4 bytes . The first for very short strings and the second in all the other cases.

The limits in the usage of 4 bytes word was 10 bytes because the first byte and the last are subject to mask due to the 8 bits shift to manage all the bits possibilities inside the bytes and I need 2 word to for the word misalignment possibility.

This was not the best limit because I can enlarge the step in the research of the words in the first string. The max step in bytes is :

$(INT(|S|/|W|)-1)*|W|+( |S| mod |W| )+1$

So the limit now is 6 bytes for each string starting from a size of 6 bytes I can use a 4 bytes word map .
Implementing this on cuda there is a problem when the step in the research is less than the word size , is less than 4 bytes because there are conflicts in the memory access ( more threads access the same memory for read ) and this slow down the computation.
To avoid this problem I use more than one parallel job every job with a research step of a word and each job a different starting point in the research , a phase that depend on $( |S| mod |W| )+1$

After the implementation I made some test and with 7 bytes strings the program now run 2x time faster!

p.s. it is incredible every time I put the hands in the code I get a 2x faster program or better … is there an end to this process?