MEP description

Home
    News
Description
Papers
Source code
MEPX Software
    Videos
    Screenshots
    User manual
Links
Contact

Abstract

Multi Expression Programming (MEP) is a method for automatic generation of computer programs. Particularly, MEP can be used for generating mathematical expressions for data analysis (regression, classification and time series). MEP differentiates from other GP techniques by encoding multiple solutions in the same chromosome.

MEP strengths

MEP has several strong advantages compared to other techniques:

  • It encodes multiple solution in the same chromosome. This means that the search space is better explored. Evaluating such chromosome is done (in most of the cases) in the same complexity as in the case of a single solution per chromosome.
  • Evaluates the computer programs instruction by instruction and stores the partial results for all training data. In this way MEP has a huge speed (less if, switch instructions) without needing to work at low level (with processors registers) like Linear Genetic Programming does.

Representation of solutions

MEP representation is not specific. Linear and tree-based representations have been tested (see the list of papers). Here we exemplify with a linear representation.
  • Each MEP chromosome has a constant number of genes. This number defines the length of the chromosome.
  • Each gene contains an instruction (also name function) or variable (also named terminal).
  • A gene that encodes a function includes pointers towards the function arguments. Function arguments always have indices of lower values than the position of the function itself in the chromosome. This representation ensures that no cycle arises while the chromosome is decoded (phenotypically transcripted). According to the proposed representation scheme, the first symbol of the chromosome must be a terminal symbol. In this way, only syntactically correct programs (MEP individuals) are obtained.

Example

Consider a representation where the numbers on the left positions stand for gene labels. Labels do not belong to the chromosome, as they are provided only for explanation purposes.

For this example we use the set of functions:

F = {+, *},

and the set of terminals:

T = {a, b, c, d}.

An example of chromosome using the sets F and T is given below:

  1. a
  2. b
  3. + 1, 2
  4. c
  5. d
  6. + 4, 5
  7. * 3, 5

The maximum number of symbols in MEP chromosome is given by the formula:

Number_of_Symbols = (n + 1) * (Number_of_Genes - 1) + 1,

where n is the number of arguments of the function with the greatest number of arguments.

The maximum number of effective symbols is achieved when each gene (excepting the first one) encodes a function symbol with the highest number of arguments. The minimum number of effective symbols is equal to the number of genes and it is achieved when all genes encode terminal symbols only.

Decoding MEP chromosome and the fitness assignment process

Now we are ready to describe how MEP individuals are translated into computer programs. This translation represents the phenotypic transcription of the MEP chromosomes.

Phenotypic translation is obtained by parsing the chromosome top-down. A terminal symbol specifies a simple expression. A function symbol specifies a complex expression obtained by connecting the operands specified by the argument positions with the current function symbol.

For instance, genes 1, 2, 4 and 5 in the previous example encode simple expressions formed by a single terminal symbol. These expressions are:

E1 = a,

E2 = b,

E4 = c,

E5 = d,

Gene 3 indicates the operation + on the operands located at positions 1 and 2 of the chromosome. Therefore gene 3 encodes the expression:

E3 = a + b.

Gene 6 indicates the operation + on the operands located at positions 4 and 5. Therefore gene 6 encodes the expression:

E6 = c + d.

Gene 7 indicates the operation * on the operands located at position 3 and 5. Therefore gene 7 encodes the expression:

E7 = (a + b) * d.

E7 is the last expression encoded by in chromosome.

There is neither practical nor theoretical evidence that one of these expressions is better than the others. This is why each MEP chromosome is allowed to encode a number of expressions equal to the chromosome length (number of genes). The chromosome described above encodes the following expressions:

E1 = a,

E2 = b,

E3 = a + b,

E4 = c,

E5 = d,

E6 = c + d,

E7 = (a + b) * d.

The value of these expressions may be computed by reading the chromosome top down. Partial results are computed by dynamic programming and are stored in a conventional manner.

Due to its multi expression representation, each MEP chromosome may be viewed as a forest of trees rather than as a single tree, which is the case of Genetic Programming.

As MEP chromosome encodes more than one problem solution, it is interesting to see how the fitness is assigned.

The chromosome fitness is usually defined as the fitness of the best expression encoded by that chromosome.

For instance, if we want to solve symbolic regression problems, the fitness of each sub-expression Ei may be computed using the formula:

fitness(Ei) = sum(|obtainedk,i - targetk,i|), k=1,... ,n

where obtainedk,i is the result obtained by the expression Ei for the fitness case k and targetk is the targeted result for the fitness case k. In this case the fitness needs to be minimized.

The fitness of an individual is set to be equal to the lowest fitness of the expressions encoded in the chromosome:

fitness(C) = min fitness(Ei).

When we have to deal with other problems, we compute the fitness of each sub-expression encoded in the MEP chromosome. Thus, the fitness of the entire individual is supplied by the fitness of the best expression encoded in that chromosome.

Why encoding multiple solutions within a chromosome?

When you compute the value of an expression encoded as a GP tree you have the compute the value of all subtrees. This means that all GP subtrees can be viewed as a potential solution of the problem being solved. Most of the GP techniques considers only the one tree while ignoring all the other subtrees. However, the value/fitness for all subtrees is computed by GP. The biggest difference between MEP and other GP techniques is that MEP outputs the best subtree encoded in a chromosome. Note that the complexity (roughly speaking - the running time) is the same for MEP and other GP techniques encoding 1 solution/chromosome.

The second reason for the this question is motivated by the No Free Lunch Theorems for Search. There is neither practical nor theoretical evidence that one of the solutions encoded in a chromosome is better than the others. More than that, Wolpert and McReady proved that we cannot use the search algorithm's behavior so far for a particular test function to predict its future behavior on that function.

How to efficiently encode multiple solutions within a chromosome is sometimes difficult. There is no general prescription on how to encode multiple solutions in the same chromosome. In most cases this involves creativity and imagination. However, some general suggestions can be given.

In order to obtain some benefits from encoding more than one solution in a chromosome we have to spend a similar effort (computer time, memory etc) as in the case of encoding a single solution in a chromosome. For instance, if we have 10 solutions encoded in chromosome and the time needed to extract and decode these solutions is 10 times larger than the time needed to extract and process one solution we got nothing. In this case we can not talk about an useful encoding.

We have to be careful when we want to encode a multiple solutions in a variable length chromosome (for instance in a standard GP chromosome), because this kind of chromosome will tend to increase its size in order to encode more and more solutions. And this could lead to bloat.

Usually encoding multiple solutions in a chromosome might require storing the partial results. Sometimes this can be achieved by using the Dynamic Programming technique. This kind of model is employed by the Multi Expression Programming.

Genetic operators

Crossover

One cutting point or uniform crossover have been tested with similar results. The cut points are chosen between instructions not inside them.

One-point recombination operator in MEP representation is similar to the corresponding binary representation operator. One crossover point is randomly chosen and the parent chromosomes exchange the sequences at the right side of the crossover point.

Mutation

Each symbol (terminal, function of function pointer) in the chromosome may be the target of the mutation operator. Some symbols in the chromosome are changed by mutation. To preserve the consistency of the chromosome, its first gene must encode a terminal symbol.

We may say that the crossover operator occurs between genes and the mutation operator occurs inside genes.

If the current gene encodes a terminal symbol, it may be changed into another terminal symbol or into a function symbol. In the later case, the positions indicating the function arguments are randomly generated. If the current gene encodes a function, the gene may be mutated into a terminal symbol or into another function (function symbol and pointers towards ar- guments).

Main algorithm

The MEP algorithm starts by creating a random population of individuals.

The following steps are repeated until a given number of generations is reached:

  • Two parents are selected using a standard selection procedure.
  • The parents are recombined in order to obtain two offspring.
  • The offspring are mutated.
  • Each offspring O replaces the worst individual W in the current population if O is better than W.

Read more

Read more about MEP and MEP for symbolic regression problems here.

Read more about MEP for classification problems here.