Evolution in a nutshell

an alternative outline on evoution

and some consequences concerning valuations

by

Gregor Kjellström

 

            References          

5 On the efficiency of random search

We will now look at another example where information is gained in a somewhat different way. We have a process that generates a number of individual trial points according to some normal p. d. f. Some of these points are assumed to belong to some region of acceptability (A, see definition 5.3) while the others do not. This process must not necessarily make any reductions, but in a large number of trials the average probability of success may be equal to P.

If the mutation rate is very low then P will become close to its maximum possible value q (for instance if A is covered with a probability carpet having probability = q) and the corresponding process is inefficient because it is unable to move. A very high mutation rate gives a P close to zero and a chaotic process that is inefficient because lots of resources are wasted on individuals that can never be selected. Between these extremes we expect to find a most economic compromise making the process most efficient. See figure 5.1.

A reasonable assumption is that if the process can gain more information by a smaller number of trials (less time or work), then the process should be more efficient. In fact, we may define a measure of efficiency based on the concept of information as: 

Definition 5.1. The efficiency E of a random search process is defined as information divided by the time (or work, i. e. proportional to the inverse of P) needed to get the information, i. e. E should be proportional to the function E = - P log(P/q) where q £ 1 is the maximum attainable value of P.

It may be that we still think that information and efficiency are rather fuzzy concepts.  Let us therefore look at some examples and a certain theorem of efficiency.

In an example due to Kjellström (1969) the speed of a random walk in a simplex region was proportional to the formula given by definition 5.1.

                                      

Figure 5.1. Most efficient compromise when P = q/e = 0.36q.

In a similar example due to Rechenberg (1973), a random walk is pushed thru a corridor as in figure 5.2 below. In this case the region of acceptability is defined as a (n-1)-dimensional interval in the parameters x2, x3, ..., xn but a x1-value below the last accepted value, will never be accepted. Since q equals 1/2 in this case, the maximum speed towards higher x1-values is reached for P = 0.18, in agreement with the findings of Rechenberg.

Figure 5.2. A random walk pushed in the direction x1.

In the next example we want to see under what circumstances a variance in a normal distribution may be adapted most efficiently.

Let A be defined by

     s(x) = exp(Q);                 Q = å xi2/2;  i = 1, 2, ¼ n.

a probability function that represents a very uncertain environment to any kind of stochastic adaptive process and let N(x) be a perfectly adapted normal p. d. f., i. e.

     N(x) = C exp( Q/s2 );                  C = (2p)-n/2sn.

Then, the p. d. f. of the pass samples is also normal, and P may easily be found. We have

     P = C ò exp(Q + Qs2) dx = (r/s)n,

where we have introduced the standard deviation of the pass samples,

     r = s/sqrt(1 + s2).

Since the process is in a state of equilibrium, it may be seen as something which strives to increase the volume of the ellipsoid of concentration of N by factor (s/r)n = P in every iteration, while the function s(x) acts to decrease it by the same factor. But for example, if the restriction due to s(x) is suddenly removed in some specific direction, then the volume will increase by the factor s/r in every iteration. If we prefer to use an additive measure instead of this factor, the volume will be increased by log(s/r) = - log(P)/n in a logaritmic scale. Because this displacement of volume in logaritmic scale will be obtained at the expense of time or work proportional to 1/P, we may say that the efficiency of the process is proportional to  – P log(P). Again, efficiency may be seen as information divided by the time (work) needed to get the information and the result will be independent of n in this case.

More generally, the theorem 5.1 below shows that the same definition of efficiency is valid for a large class of measures fulfilling the postulates below. These postulates seem to limit the applicability of the theorem, but in combination with the normal adaptation the postulates may be approximately fulfilled also when the processes are not statistically independent in different parameters. An additional necessary condition is that the derivative of E with respect to P is limited negative at P = q £ 1.

Postulate 5.1. P is the average hitting probability on A (definition 5.3). The time (or work) needed to test a point x for membership in A is equal for all points. 

postulate 5.2. The process is statistically independent and separable in the n different parameters and equally efficient in all parameters, i. e. P = pn, where p is the average hitting probability as far as one parameter is concerned. 

postulate 5.3. The efficiency E is a continuous function of P in the interval 0 £ P £ q £ 1, (where q > 0 is the maximum attainable P-value) and is positive in the P-interval, except at the end points where E = 0. The derivative [dE(P)/dP]P=q is limited negative, i. e. -¥ < [dE(P)/dP]P=q < 0.            

Theorem 5.1. All measures of efficiency, that satisfy the postulates above, are asymptotically proportional to – P log(P/q) when n ® ¥. The maximum value of E is obtained with P = q/e

Proof: Let us first have q = 1, and let

     E1 = - P å gi [log(P)]i,    (sum extended over i = 1, 2, ..., ¥)

be the efficiency of the process in the parameter x1. Provided that g1 is > 0, then [dE1(P)/dP]P=1 < 0. Furthermore E1 = 0 when P = 0 or P = 1, which means that postulate 5.3 is satisfied. Therefore E1 may approximate any measure of efficiency, to any degree of precision, without violating the postulates.

     Suppose that a new parameter xn+1 is added without any changes to the process characteristics in the existing n parameters. In principle (due to the postulates 5.1 and 5.2) the same points as before may be found in A, as far as the existing n parameters are concerned. Evidently, these points give us a certain amount of information about A, even though the concept of information should not yet have been defined. But P will decrease by factor p. Thus, because more time (work) is needed to get the same amount of information, the efficiency will be decreased; by factor p in this case. This leads to the recursion:

     En+1(pP) = pEn(P)          or       En(P) = P1/nEn-1(P1-1/n).

Starting with E1(P), the efficiency may now be calculated for any n. we have:

     E2(P) = - P1/2P1/2 å gi logi(P1/2) = - P å gi [log(P)/2]i;

     E3(P) = - P1/3P2/3 å gi logi(P2/3) = - P å gi [log(P)/3]i;

and finally

     En(P) = - P å gi[log(P)/n]i;

where all sums are extended over i = 1, 2, ..., ¥. And when n ® ¥ we get

     nEn(P) ® - g1 P log(P).

In the case when q is < 1, which means that A is covered by some probability carpet with probability = q, we will have a shift in P-scale from P to P/q and the time (work) will be increased by factor 1/q to give the same result as before. Thus we get:

     nEn(P) ® - g1 P log(P/q)

and the theorem is proved.

A possible interpretation of this is that efficiency, in a situation of very high complexity with a large n, equals the information log(P/q) divided by the time (work) = 1/P needed to get the information. This is another indication that information may be seen as the logarithm of a probability.

Theorem 5.2. The average information of a normal p. d. f. is

     H = log{ (2pe)n/2 s1s2 ... sn ) }.           

Proof:

Assume first that we have one normally distributed parameter x with m = 0 and standard deviation = s. Thus

     N(x) = C exp( -x2/2s2 );      C = (2p)-1/2/s.

The average information of N(x) is

     H = - ò N(x) log[ N(x) ] dx   (interval of integration is -¥ < x < ¥).

     = - ò C exp( -x2/2s2 ) log[ C exp( -x2/2s2 ) ] dx

     = log[ (2p)1/2s ] ò C exp( -x2/2s2 ) dx  +

                      +  ò (x2/2s2) C exp(  -x2/2s2 ) dx

     = log[ (2p)1/2 s  ] + 1/2

     = log[ (2pe)1/2 s ].

Now, if we have n statistically independent parameters with standard deviations s1, s2, ..., sn, then – because of the additive property of average information – we have

     H = log{ (2pe)n/2 s1s2 ... sn ) }

Since orthogonal rotations of the coordinate system can not have any effect on the information, it follows that this expression is valid also when the parameters depend on each other provided that the standard deviations are measured along the main axes of the distribution.

To summarize, average information, disorder and biological diversity are in principle equivalent entities defined by

      H = - å pi log( pi );   å pi = 1 for i = 1, 2, ..., r; in the discrete case, and as

     H = - ò f(x) log[ f(x) ] dx;   in the continuous case.