Day 18: Creating a self-learning AI

Pierre G.
4 min readJun 24, 2021

Read the story from the start

This blog is officially online

Wait a moment, wasn't that the case already? Well, yes and no: obviously it was already online, but I hadn't sent the link to anyone but a very few people, so the possibility of having other people read it was still mostly theoretical.

Yesterday I sent the link to a some friends and family members. I also made it easier to read by adding links between the different sections, and added tags to the Posts so that in theory people can find them through search engines or medium suggestions.

This is obviously not going to attract massive traffic overnight, but still, it is one extra step towards having people (even possibly strangers) read my thoughts online. This is both exciting and terrifying! (so the good kind of terrifying :) )

On to the update:

Looking deeper into the loss function

After having spent a lot of time on improving my data (ie. the quality of my self-play games), and getting some first results, it was time to look into the training phase. To do that I focused on one training iteration of my model.

The main questions I was trying to answer were:

  • Does the trained model fit my data correctly? (ie. reproduce the target outputs we give it during training?)
  • Does the trained model overfit? (ie. specialize so much on the training data that it can't generalize well to new data)

I started with the first question. A good way to see if the model fits the training data is to look at the loss function (as a reminder, the loss function is the mathematical function that evaluates the gap between the actual output of the model vs the target output, given by the training labels).

The loss function we use is taken from the AlphaGo Zero paper and looks like this:

Where:

  • z is the target value (the end result of the game)
  • v is the value returned by the model
  • 𝜋 is the target policy (the policy returned by the MCTS)
  • p is the policy returned by the model

The first term defines a loss on the value (using a mean-squared error loss). The second term defines a loss on the policy (using a cross entropy loss). The third term adds a penalty for having big weights in the model (an optimisation used to limit overfitting), and we can ignore it for now. So the total loss is the sum of the loss on the value and the loss on the policy.

After doing a series of test, I noticed that both of these losses would hit a lower bound (0.64 for the policy loss and 0.78 for the value loss). The lower the loss the closer the model is fitting to the training data, so I wanted to see what caused this lower bound on the loss.

I thought the lower bound on the value loss could be explained by inconsistency in the data: because we use the end result of the game as a target for the value, if the same game state appeared both in a winning game and losing game, it would come up twice in the training data with different labels. Such inconsistencies can explain why the model can't fit the training data 100% perfectly. This wouldn't explain the bound on the policy loss though, as the policy is supposed to be consistent.

The first thing I tried was removing one of the losses from the equation, in order to see if by focusing only on one individual loss the model would get better results on it. I got the same lower bounds as before.

I then crafted myself a training dataset by running the MCTS model on all possible Tic-Tac-Toe positions, and used it to train the model (instead of the self-play game dataset). Again I got the same lower bound.

If this lower bound was the same even on very different datasets, I then thought it may be a mathematical limitation of the loss function. For the policy, we use a cross-entropy loss function (a good explanation is available there). It happens that this function does have a lower bound (the entropy of the target probability distribution, which in our case is not 0).

Realising that, I calculated what this lower bound would be (ie. what the loss would be in the perfect scenario where our model's output would match the target exactly). I got a value of 0.55 for the self-play dataset, and 0.4 for my hand-crafted dataset. This was still lower than the bound we observed during training (0.64).

I want to test further what is causing this lower bound on the loss. Today I plan to start testing more complex models to see if the simplicity of the model is the limiting factor (contrary to the assumption I had made before).

Read the next Post

--

--