Jekyll2021-02-10T23:08:34+00:00https://mauicv.com/feed.xmlmauicv’s blogInformal blog about things i'm interested in. Currently mostly Maths and Reinforcement learning. Stuff I've learnt, or failed to. I make no apologies for spelling.DDPG bugs2021-02-09T23:00:00+00:002021-02-09T23:00:00+00:00https://mauicv.com/reinforcement-learning/2021/02/09/ddpg-bugs<p><sup><strong>note</strong>: <em>Relevant code for this post is <a href="https://github.com/mauicv/BipedalWalker-v2-ddpg">here</a></em></sup></p>
<hr />
<p><br /></p>
<p>In a previous <a href="/reinforcement-learning/2020/12/22/deep-deterministic-policy-gradients.html">post</a> I gave a rough explanation of DDPG theory. Here for prosperity, I list the stupid mistakes that confounded me for far longer than they should while implementing the DDPG algorithm:</p>
<h3 id="accidentally-bounding-the-critic-️">Accidentally bounding the critic 🤦♂️</h3>
<p>When creating the critic network I copied and pasted the actor network. In doing so I accidentally forgot to remove the <code class="language-plaintext highlighter-rouge">tanh</code> activation that the final layer of the actor uses. This means the critic could at most predict a total of reward between <code class="language-plaintext highlighter-rouge">-1</code> to <code class="language-plaintext highlighter-rouge">1</code> for the entire episode given any state and action pair! The reward for the bipedal walker environment much greater than 1 so if the critic is only ever returning at most 1 then it’s ability to guide the actor is severely stunted.</p>
<h3 id="mismatched-actor-and-target-actor-">Mismatched actor and target actor 🤦</h3>
<p>In order to debug the issues with DDPG applied to the bipedal walker environment I implemented the cart pole environment as well as it’s a simpler environment in which it’s easier to spot errors. For whatever reason the range of values the actions can take are different between each environment. When I went back to the bipedal environment I kept the high action bound from the pendulum environment in the target actor by mistake. This value was used to scale the actor outputs which are between -1 and 1 to the range of admissible values. This meant the target actor was providing values double that of the actor.</p>
<h3 id="returning-correct-values-from-memory-buffer-️">Returning correct values from memory buffer 🤦♂️</h3>
<p>This one was the realisation that tipped the algorithm over the edge from not working to finally working. It was also the stupidest thing. On each iteration of the environment you you need to store the state, the next_state, the action that moved the state to the next state and finally the reward obtained. When you run the learning step you take a sample of these recorded values and then update the critic and actor networks. If you accidentally return the state twice instead of the state and the next state then the critic will never learn anything and nothing the actor does makes a difference to the environment. I spent way too long trying to figure out why nothing was being learnt only to discover this was the issue!</p>
<h2 id="lesson">Lesson:</h2>
<p>The purpose of listing these errors is to illustrate that non of them threw an exception or gave results that one could easily use to follow to the source of the problem. This is the main difference between RL/ML say something like web dev. In web development mistakes are kind of obvious in that it works or it doesn’t. If it doesn’t then either an error gets thrown or the app breaks in some reasonably obvious way. In reinforcement learning if something doesn’t work then it’ll fail but it’ll fail in a manner indistinguishable from the way it would fail for most other errors.</p>
<p>I found I had a bias towards assuming mistakes I’d make where more likely to be resultant from misunderstandings of the algorithm theory rather than with my programming. I think this is because the algorithm is inherently opaque. It’s hard to say that as a result of a small number of training steps weather the actor subsequently performs better than it did before. As well as this a lot of the logic is hidden behind the <a href="https://www.tensorflow.org/">tensorflow</a> and <a href="https://keras.io/">keras</a> api. Hence I spent a lot of time hypothesising what types of issue might result in the algorithm failing the way it was rather than just looking for the typical set of errors all programmers make. The kinds of things I would have found in a second if this was written into web app or something. I guess the lesson here is write tests for everything!</p>
<hr />
<p><br /></p>
<h3 id="another-error-">Another Error 🤦:</h3>
<p><strong>TLDR</strong>: check numpy operation output and input array shapes are correct!</p>
<p>This error didn’t actually prevent the algorithm from working but I thought I’d mention it for those who are interested…</p>
<p>Numpy allows certain operations between arrays of different size. In the case with two arrays of the same size it just multiplies the elements pairwise. If the array sizes don’t match then if possible it will “copy” a vector lying along one dimension along the missing dimension/s until the two arrays are the same size. Once this is the case it can then processed with the multiplication operation. “copy” is in quotation marks because it doesn’t really copy the array out as that would waste memory.</p>
<p>So as an example:</p>
<div class="language-py highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="o">>></span> <span class="n">a</span> <span class="o">=</span> <span class="n">np</span><span class="p">.</span><span class="n">array</span><span class="p">([</span><span class="mi">1</span><span class="p">,</span> <span class="mi">2</span><span class="p">,</span> <span class="mi">3</span><span class="p">])</span>
<span class="o">>></span> <span class="n">b</span> <span class="o">=</span> <span class="n">np</span><span class="p">.</span><span class="n">array</span><span class="p">([[</span><span class="mi">1</span><span class="p">,</span> <span class="mi">2</span><span class="p">,</span> <span class="mi">3</span><span class="p">],</span> <span class="p">[</span><span class="mi">4</span><span class="p">,</span> <span class="mi">5</span><span class="p">,</span> <span class="mi">6</span><span class="p">],</span> <span class="p">[</span><span class="mi">7</span><span class="p">,</span> <span class="mi">8</span><span class="p">,</span> <span class="mi">9</span><span class="p">]])</span>
<span class="o">>></span> <span class="n">a</span><span class="o">*</span><span class="n">b</span>
<span class="n">array</span><span class="p">([[</span> <span class="mi">1</span><span class="p">,</span> <span class="mi">4</span><span class="p">,</span> <span class="mi">9</span><span class="p">],</span>
<span class="p">[</span> <span class="mi">4</span><span class="p">,</span> <span class="mi">10</span><span class="p">,</span> <span class="mi">18</span><span class="p">],</span>
<span class="p">[</span> <span class="mi">7</span><span class="p">,</span> <span class="mi">16</span><span class="p">,</span> <span class="mi">27</span><span class="p">]])</span>
</code></pre></div></div>
<p>Here the <code class="language-plaintext highlighter-rouge">[1, 2, 3]</code> vector is broadcast out to become <code class="language-plaintext highlighter-rouge">[[1, 2, 3],[1, 2, 3],[1, 2, 3]]</code> and then each element is multiplied with it’s corresponding one in the other array to get the result. Sometimes the array shapes don’t satisfy the rules required to do the copying operation but you can add dimensions in in order to ensure they do. For instance if you have an array <code class="language-plaintext highlighter-rouge">a</code> that’s length <code class="language-plaintext highlighter-rouge">3</code> and an array <code class="language-plaintext highlighter-rouge">b</code> that’s length <code class="language-plaintext highlighter-rouge">4</code> then these can’t be copied but if you add a dimension to each side alternating so that they go from being <code class="language-plaintext highlighter-rouge">(3),(4)</code> \(\rightarrow\) <code class="language-plaintext highlighter-rouge">(3, 1), (1, 4)</code> then they can:</p>
<div class="language-py highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">a</span> <span class="o">=</span> <span class="n">np</span><span class="p">.</span><span class="n">array</span><span class="p">([</span><span class="mi">1</span><span class="p">,</span> <span class="mi">2</span><span class="p">,</span> <span class="mi">3</span><span class="p">])</span>
<span class="n">b</span> <span class="o">=</span> <span class="n">np</span><span class="p">.</span><span class="n">array</span><span class="p">([</span><span class="mi">1</span><span class="p">,</span> <span class="mi">2</span><span class="p">,</span> <span class="mi">3</span><span class="p">,</span> <span class="mi">4</span><span class="p">])</span>
<span class="n">a</span><span class="p">[:,</span> <span class="bp">None</span><span class="p">]</span><span class="o">*</span><span class="n">b</span><span class="p">[</span><span class="bp">None</span><span class="p">,</span> <span class="p">:]</span>
<span class="n">array</span><span class="p">([[</span> <span class="mi">1</span><span class="p">,</span> <span class="mi">2</span><span class="p">,</span> <span class="mi">3</span><span class="p">,</span> <span class="mi">4</span><span class="p">],</span>
<span class="p">[</span> <span class="mi">2</span><span class="p">,</span> <span class="mi">4</span><span class="p">,</span> <span class="mi">6</span><span class="p">,</span> <span class="mi">8</span><span class="p">],</span>
<span class="p">[</span> <span class="mi">3</span><span class="p">,</span> <span class="mi">6</span><span class="p">,</span> <span class="mi">9</span><span class="p">,</span> <span class="mi">12</span><span class="p">]])</span>
</code></pre></div></div>
<p>The process of “copying” along the axis is called <a href="https://numpy.org/doc/stable/user/basics.broadcasting.html">broadcasting</a> and it can be a little hard to think about so when it doesn’t throw errors there’s a tendency to assume it must be correct.</p>
<p>Consider:</p>
<div class="language-py highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">rewards</span> <span class="o">=</span> <span class="n">np</span><span class="p">.</span><span class="n">ones</span><span class="p">((</span><span class="mi">64</span><span class="p">,</span> <span class="mi">1</span><span class="p">))</span>
<span class="n">dones</span> <span class="o">=</span> <span class="n">np</span><span class="p">.</span><span class="n">ones</span><span class="p">((</span><span class="mi">64</span><span class="p">))</span>
</code></pre></div></div>
<p>then what shape is <code class="language-plaintext highlighter-rouge">rewards * dones</code>? The answer is <code class="language-plaintext highlighter-rouge">(64, 64)</code> and not as I had been expecting <code class="language-plaintext highlighter-rouge">64</code>. In fairness <code class="language-plaintext highlighter-rouge">rewards</code> was actually the predicted rewards and was returned from a Keras model and I’d forgotten they return an array for which the first dimension indexes the batch size. The solution was just to use <code class="language-plaintext highlighter-rouge">dones * rewards[:, 0]</code> which effectively pops of the last dimension so that the shapes are now both just <code class="language-plaintext highlighter-rouge">64</code>. This wasn’t actually a massive issue and was absorbed later due to <a href="https://www.tensorflow.org/api_docs/python/tf/math/reduce_mean"><code class="language-plaintext highlighter-rouge">reduce_mean</code></a> which stopped me noticing the shape was larger than I thought it was. At worse it was just more computation than needed!</p>Reinforcement learning and Continuous control systemsReinforcement learning, Deep Deterministic Policy Gradient2020-12-22T23:00:00+00:002020-12-22T23:00:00+00:00https://mauicv.com/reinforcement-learning/2020/12/22/deep-deterministic-policy-gradients<p><sup><strong>note</strong>: <em>Relevant code for this post is <a href="https://github.com/mauicv/BipedalWalker-v2-ddpg">here</a></em></sup></p>
<hr />
<h2 id="continuous-control">Continuous Control</h2>
<p>It’s been a bit of time since I last posted. I’ve had a lot of work on recently but I’ve started working on a new project that’s been the motivation for learning a lot of this reinforcement learning material. Namely I want to build robots and train the control systems for these robots using reinforcement learning algorithms. Thus far the algorithms I’ve discussed here have been for discrete action spaces. This doesn’t work for continuous control systems of the type I’m hoping to create. This post will be a rundown of deep deterministic policy gradients (DDPG) which is a reinforcement learning algorithm suited for continuous control tasks. This means stuff like motors that can apply a torque with a continuous range of values rather than one that is either on or off. In this post I’m going to talk about theory but mostly only to explain the stuff that it took me a while to understand.</p>
<hr />
<h2 id="why-ddpg">Why DDPG?</h2>
<p>So I think it is actually possible to use <a href="/reinforcement-learning/2020/05/24/policy-gradient-methods.html">REINFORCE</a> to do continuous control. What you’d do is have a policy (function that given a state suggests an action to take) that generates a continuous probability distribution. You could do this by having the model output a mean and a variance and then sample an action from the distribution those describe. You’d have to use importance sampling to ensure the actions the policy is more likely to suggest aren’t overweighted in the reinforcement learning updates. Then at the end instead of using that probability distribution to sample actions you’d just take the mean itself as the action and ignore the variance. This is very similar to the discrete action case but instead of a probability vector your having to get the model to describe a continuous distribution instead.</p>
<p>I tried this and struggled with it. I’m not sure why exactly but in the end I read about DDPG and the consensus seemed to be that it was best for continuous control tasks. DDPG differs from REINFORCE in a number of ways. It’s similar however in that it still uses a policy. The policy is the core object we’re interested in here. Initially it’s just going to suggest random actions but by the end of the training it should have learned to solve the environment in order to maximise rewards for the agent. The differences are as follows.</p>
<ul>
<li>Firstly DDPG requires using a critic. A critic is a function that takes a state and action and then returns the expected discounted reward the agent will receive over the whole orbit if they take that action and then from then on out take the action dictated by the policy. So the critic at any state is a function of just the possible actions at that state. Everything else is fixed including all the future actions as given by the policy. What this means is that the critic should give an estimate of the outcome for each possible action given a state. To find the best action you’d just choose the action that maximizes the critic. The critic aims to estimate \(C(s_i, a_i) = \mathbb{E_{p}}(\sum(\gamma^i *r_i)| s_i, a_i)\).
\(\gamma\) here is the discount factor. It weights rewards in the immediate future more heavily that rewards in the distant future. The idea being that a reward received soon after an action is more likely to be as a result of that action than a rewards received long after. The important thing to note here is that if you know the the values \((s_i,a_i)\), \((s_{i+1},a_{i+1})\) and \(r_i\) which are a pair of consecutive state action pairs in an orbit and the true reward for transitioning from state \(s_i\) to \(s_{i+1}\) then the critics estimate should satisfy:</li>
</ul>
\[C(s_i, a_i) = r_i + \gamma*C(s_{i+1}, a_{i+1})\]
<ul>
<li>
<p>Secondly instead of recording a set of memories for just one episode and then updating your policy on the basis of just those you instead have a memory buffer which stores the relevant states, actions and rewards over many episodes. This ends up being a <a href="https://en.wikipedia.org/wiki/Circular_buffer">circular buffer</a> where the entries look like: <code class="language-plaintext highlighter-rouge">(state, next_state, action, reward)</code>. When ever you want to train the policy and critic your going to take a random batch of samples from this buffer as the training data.</p>
</li>
<li>
<p>Finally DDPG uses target models which are basically copies of the actor (policy) and critic that are updated slower than the actual actor and critic. I’m going to ignore these for now and mention them at the end but basically they deal with the fact that the actual actor and critic can change quite a lot on each training step which leads to instability in the learning.</p>
</li>
</ul>
<p>When training DDPG We use the policy model, plus some noise for exploration, to sample from the environment the values <code class="language-plaintext highlighter-rouge">(state, next_state, action, reward)</code> that we then place in the replay buffer. You might ask why we need the policy model seeing as we can select actions to take given any state by finding the action that maximises the critic. The reason this typically isn’t feasible is due to the size of the action space. If you have a discrete action space there are usually a smaller set of possible actions you can take, think 2 engines for the Lunar-lander each is on or off, making 4 possible combinations of values in the action space to maximize the critic over. In this case it’s feasible to maximise over. Whereas in a 2 dimensional continuous action space to maximise correctly you’d have to subdivide each dimension into small bits and then consider all the combinations which would end up being \(n^2\) where \(n\) is the number of subdivisions you make. If now you consider \(m\) dimensions it’s then \(n^m\). This quickly becomes prohibitive.</p>
<p>So maximising over the critic takes too long and instead what you do is use this policy model that tries to maximise the critic as it changes over the training period by climbing it using gradient ascent. This basically means that instead of starting from scratch every time you search for that maximising value you instead use the policy to track the correct value over time. The issue with this approach is that the policy may get stuck in a local maximum.</p>
<hr />
<h2 id="training">Training:</h2>
<p>Training with DDPG uses two rounds, the first updates the critic using temporal differences and the second updates the policy by gradient ascent. Bear in mind that at each training step the set of values we have is sampled from the replay buffer and takes the form: <code class="language-plaintext highlighter-rouge">(state, next_state, action, reward)</code></p>
<h3 id="critic">Critic:</h3>
<p>To train the critic we just take the initial prediction of the critic for a selected <code class="language-plaintext highlighter-rouge">(state, action)</code> pair. Call this value \(c(s_{i}, a_{i})\). We then get the <code class="language-plaintext highlighter-rouge">next_action</code> value by plugging in the <code class="language-plaintext highlighter-rouge">next_state</code> into the policy and then computing the next critic value given as \(c(s_{i+1},a_{i+1})\) where \(a_{i+1} = P(s_{i+1})\) and \(s_{i+1}\) is the <code class="language-plaintext highlighter-rouge">next_state</code>. Now we have the critic value for <code class="language-plaintext highlighter-rouge">(state, action)</code> and the critic value for <code class="language-plaintext highlighter-rouge">(next_state, next_action)</code>. If \(c(s_{i}, a_{i})\) is accurate then it should equal \(\gamma * c(s_{i+1},a_{i+1})\) plus the reward obtained for that state. The full equation:</p>
\[c(s_{i}, a_{i}) = r_{i} + \gamma * c(s_{i+1},a_{i+1})\]
<p>If the critic is untrained then it won’t equal the above but we now have a target to train the critic towards. Namely the difference in the above:</p>
\[t_i = r_{i} + \gamma * c(s_{i+1},a_{i+1}) - c(s_{i}, a_{i})\]
<p>To update the critic on each training run your going to minimize the above using gradient descent over the batch of samples you’ve taken from the replay buffer.</p>
<h3 id="actor">Actor:</h3>
<p>To train the actor/policy is simply just a case of asking it to climb the critic function. So if you take the gradient of \(C(s_i, P_{\omega}(s_i))\) with respect to the policy parameters given by \(\omega\) you can compute how best to change \(P\) in order to increase the value of \(C\). Again you do this over the batch of samples take from the replay buffer.</p>
<h3 id="target-actor-and-critic">Target Actor and critic</h3>
<p>So the above is a slight simplification in that it doesn’t talk about the target models we use to add stability. Basically when you compute</p>
\[t_i = r_{i} + \gamma * c(s_{i+1},a_{i+1}) - c(s_{i}, a_{i})\]
<p>instead of computing \(c(s_{i+1},a_{i+1}) = c(s_{i+1},p(s_{i+1}))\) we use:</p>
\[c_{targ}(s_{i+1},a_{i+1}) = c_{targ}(s_{i+1},p_{targ}(s_{i+1}))\]
<p>where \(c_{targ}\) and \(p_{targ}\) are the target critic and target actor and are just copies of the critic and actor that are updated much slower. By this I mean whenever you update the actor and critic, you then update the target actor and target critic like so:</p>
\[\omega_{a_{targ}} \leftarrow \omega_{a}\tau + \omega_{a_{targ}}(1 - \tau)\]
\[\omega_{c_{targ}} \leftarrow \omega_{c}\tau + \omega_{c_{targ}}(1 - \tau)\]
<p>where \(\omega_{c_{targ}}\) and \(\omega_{a_{targ}}\) denote the model parameters for the target actor and target critic and \(\tau\) is some small value usually \(0.05\)</p>
<hr />
<h3 id="full-algorithm">Full algorithm:</h3>
<p>The full algorithm taken from the <a href="https://arxiv.org/pdf/1509.02971.pdf">original paper</a> by Timothy P. Lillicrap, Jonathan J. Hunt, Alexander Pritzel, Nicolas Heess,Tom Erez, Yuval Tassa, David Silver & Daan Wierstra and researched at google Deepmind is as follows:</p>
<p><img src="/assets/deep-deterministic-policy-gradients/ddpg-algo.png" alt="ddpg-algo" /></p>
<hr />
<h2 id="outcomes">Outcomes</h2>
<p>It’s kind of hard to believe the above works. So here is proof it does:</p>
<p><img src="/assets/deep-deterministic-policy-gradients/ending.gif" alt="ddpg-bipedal-walker" /></p>Reinforcement learning and Continuous control systemsMachine Learning, Computer Generated Faces2020-07-21T23:00:00+00:002020-07-21T23:00:00+00:00https://mauicv.com/machine-learning/2020/07/21/generative-adversarial-networks-faces<hr />
<h2 id="motivation">Motivation</h2>
<p>I started my Computer Generated pictures adventure by looking at trying to get the computer to generate images of doodles I’d drawn. I have this giant doodle I’d been noodling away on for a while and by taking a set of pictures of it and then subsampling bits of it I hoped that this would be enough to generate new doodles in the same style. I got a little lost. Remember I’m new to all this stuff and the main thing I’m coming up against is not really knowing what to expect from the process of computer learning. Like how long before you give up on a model learning something, how a model can fail, what it looks like for a model to fail and so on. I’m intending on coming back to the doodle example but in order to better understanding what was going on I decided to opt for a dataset which better benchmarks, namely faces. The sole aim here is to end up with a computer generated picture of a face that is at least vaguely believably human.</p>
<p><strong>Note</strong>: I trained everything bellow using this <a href="https://github.com/muxspace/facial_expressions">dataset</a>.</p>
<hr />
<h2 id="naive-approach">Naive Approach</h2>
<p>So the first question I had was why can you not just train something like a reversed categorization model. So instead of taking images are returning a category you take a category and return an image. The principle should be the same you just reverse the images and instead of computing the categorical cross entropy between the predicted category probabilities and the real category you use a loss function that computes the difference between two images. I tried this out and I got this sightly disturbing pair.</p>
<p><img src="/assets/generating-faces/happy-and-sad.png" alt="happy-and-sad" /></p>
<p>The main thing to note here is that it’s obviously generating an image that’s in some sense an average of all smiley faces or non-smiley faces rather than a specific instance of a smiley face. This makes sense because there is only one input either 0 or 1 and it’s expected to map each of these values into an image that best represents a wide range of pictures of faces. If it where to create a face that is very face like then it may be close to some faces within that set of faces but it will necessarily also be far away from others. So instead of a instance of an identifiably human face you instead get a eerie blurred mask that represents lots of faces all at once.</p>
<p>So the I kind of expected this would happen. The issue is that the inputs don’t give any room for manoeuvrer. Your asking these two values to describe the entirety of the dataset so of course the output will be a blended version of all the images. The natural solution to this is to make the space that represents the faces larger. There are some obvious difficulties to navigate here. Suppose you had a better labelling system. So instead of 0 or 1 lets make it a continuum. In the above example 0 was not-happy and 1 was happy. So with a continuum we could represent a sliding scale of happiness where some faces are more happy than other faces. Then we can add other dimensions. So instead of just a happiness dimension we can have a hair dimension and a nose size dimension and so on… If your willing to spend the time going through the data set and labelling each image by where you think it falls within this set of feature dimensions you’ve defined then maybe you’ll get more human like faces out the other end. I’ve obviously not tried this because labelling large datasets to this degree of complexity is going to be very time intensive.</p>
<hr />
<h2 id="generative-adversarial-networks">Generative Adversarial Networks</h2>
<p>So one way to solve the above problem is to ask the model to extract enough features from the dataset that make a face a face and then combine those somehow to generate a face. You don’t stipulate the set of features or require they be exhaustive, you just ask that it collect enough and combine them so as to be accurate. This means that you may end up ignoring parts of the data set and just focus on faces that have short hair for instance. By doing this we remove the constraint the naive approach suffers from. Namely that it minimizes loss between output and the whole dataset. Instead we just ask the model create something that passably belongs to the dataset.</p>
<p>I’m going to focus on these types of model, known as Generative Adversarial Networks (GANs). With a GAN you create two networks, one that generates images and one that tries to distinguish generated from real images. You then set them in competition. So you ask the generator network to create a batch of fake images and then make there labels 0 as in False or Fake. You also sample a batch of real images with labels 1 as in True or real. You train the discriminator against this labelled data and in turn you train the generator by defining it’s loss to be dependent on the number of generated images the discriminator correctly detected. So the generators trying to get as many fake images past the discriminator as possible.</p>
<p>For the generator we give as input a random point in some latent space. The idea here is that your telling it to map these points to faces and by giving it room to manoeuvrer it can generate as much variation in faces as it can pack into the size of latent space you give it. The eventual mapping is going to be randomly assigned, so one area may end up encoding faces with glasses another without and another beards or smiles and so on… We don’t require that the generator be trying to reproduce the dataset in it’s entirety instead we just want at least one instance of a face and this means that the generator may just decide to use the entire space to create a single instance of a face. This is known as mode collapse and it’s an issue generally with GANs but in my case if we get something human looking I’m happy.</p>
<p>Here are the best results I got with this approach:</p>
<p><img src="/assets/generating-faces/simple-gan.png" alt="simple-gan-results" /></p>
<p>Yeah so not great! I tried this approach for a while before I got quite annoyed by the intermittency of what it seemed to be producing. Not really knowing what to expect from this learning process I gave up on this model and tried something I read about <a href="https://arxiv.org/abs/1903.06048">here</a>.</p>
<hr />
<h2 id="multiple-scale-gradient-gans">Multiple Scale Gradient GANs</h2>
<p>This approach maps the different resolutions of the data between the relevant layers of each of the generator and discriminator networks. This means the generator produces images of multiple different resolutions and the discriminator looks at each of these different resolution images and tries to figure out how real or fake they each are. I wouldn’t say this is super clear to me what it’s doing except that intuitively it makes the network prefer learning lower resolution features before building higher resolution ones on top. By building up a hierarchy of learnt features like this you aid stability in learning.</p>
<p>Anyway using this approach I started getting stuff that was actually passible.</p>
<p><img src="/assets/generating-faces/bald-man-with-glasses.png" alt="bald-man-with-glasses" />
<img src="/assets/generating-faces/happy-chappy.png" alt="happy-chappy" />
<img src="/assets/generating-faces/camera-flash.png" alt="camera-flash" />
<img src="/assets/generating-faces/big-head.png" alt="big-head" /></p>
<p><img src="/assets/generating-faces/tiled-faces-msg-gan.png" alt="msg-gan-faces" />
<img src="/assets/generating-faces/tiled-faces-msg-gan-2.png" alt="msg-gan-faces" /></p>
<p>So this is a significant improvement on the initial attempt. I was pleased with these results but do note that they’re not where near close to <a href="https://thispersondoesnotexist.com/">what’s possible</a>! Training took a long time which I think is typical but also my laptop is slooooooooowwwww.</p>
<hr />
<h2 id="how-it-learns">How it Learns</h2>
<p>This section is hypothesis but I think what ends up going on in the learning process must be that the discriminator network picks up low resolution features on which to focus to detect face like qualities. This very quickly means that it can detect and differentiate the random noise initially outputted by the generator and the real faces in the dataset. In competition with this the generator has to work to match the set of features the discriminator is looking for. Once it’s done so the discriminator now has to find a new feature to use in order to distinguish between the generated and real data. I think this must continue in a progression in which the each pair learns to detect and create separate features. From watching the sequence of images emitted during training it would definitely seem like this process happens in order of feature resolution. So for instance the first thing the discriminator learns is that there is a grey blob in the middle of the screen, and then it starts to see darker blobs where the eyes and mouth would be, and so on until its furnishing the finer details such as painting in the whites of the eyes. Because of this you’d expect the generator and discriminator loss functions to oscillate in competition with each other. So when a new feature is discovered by the discriminator it should outperform the generator and when the generator learns to match the feature set the discriminator has derived it should push the discriminators loss up. This seems to be what happens:</p>
<p><img src="/assets/generating-faces/gans-losses.png" alt="gans-losses" /></p>
<p>The above also illustrates the major frustration with these networks in that there is no strong measure of how much has been learnt because each of the above loss functions exists relative to the other. Hence the only thing you can really do is display a picture of a face at each stage of learning and decide whether or not it is more or less facey than previous outputs. This is compounded by the fact that the network is learning over the input space so some areas by virtue of being less visited will be less trained and so a single generated image doesn’t capture everything the network has leant.</p>
<p>It also seems that because the generator is learning to fool the discriminator what it learns is very dependent on what the discriminator is choosing to focus on. It’s not clear to me that the discriminator doesn’t unlearn features if they no longer provide good indication of real or fake. For instance if the discriminator learns to detect noses as indicative of real images and in turn the generator learns to perfectly create noses to fool the discriminator then when the discriminator moves on to some other feature to focus on I don’t think there’s any reason to assume it preserves what it’s learnt about noses. It may sacrifice the nose knowledge in pursuit of the next feature. Clearly it must capture everything on average otherwise this method wouldn’t work but by surveying the images the generator produces over the course of training it seems like they sort of oscillates in and out of levels of accuracy in different features and sometimes it’s as if the generator and discriminator are sort of wandering over the dataset rather than focusing on a particular member of the dataset. This all makes it hard to know if its learning.</p>
<p>So yeah that basically concludes my experience thus far using GANs. One final thing however that I was not entirely prepared for. As the generator gets better and better at producing pictures of faces it also spends more time wondering around in the uncanny valley. Here are some of the horrors summoned in the process of getting the above:</p>
<p><img src="/assets/generating-faces/screaming-ghost.png" alt="screaming-ghost" />
<img src="/assets/generating-faces/erm-msg-gan.png" alt="demon" width="100" />
<img src="/assets/generating-faces/angry-dude-msg-gan.png" alt="angry-dude" />
<img src="/assets/generating-faces/demon-msg--gan.png" alt="demon" />
<img src="/assets/generating-faces/ghostly-chap.png" alt="demon" width="100" />
<img src="/assets/generating-faces/black-eye.png" alt="demon" width="100" /></p>
<p><img src="/assets/generating-faces/skeletal.png" alt="skeletal" />
<img src="/assets/generating-faces/uncanny-valley.png" alt="uncanny-valley" />
<img src="/assets/generating-faces/weird-eyes-1.png" alt="weird-eyes" />
<img src="/assets/generating-faces/locals.png" alt="locals" /></p>
<hr />
<h2 id="next-steps">Next steps</h2>
<p>Next I want to see how the discriminator has learnt features differently to how typical categorization model do. You can get an idea of the features a network has learnt by giving it an image and asking it to change the image to maximise activation of certain layers. I tired this with model that I trained to discriminate happy and not-happy faces as a very naive attempt at generative modelling and was disappointed by the results. I Figured the set of features that these networks might be learning doesn’t have to be the set of features you’d expect. Certainly it’s going to ignore large amounts of the data if it’s peripheral to the task at hand. So it’s not going to learn anything about a persons hair because that’s common across photos of happy and not-happy people. Anyway it’ll be interesting to do the same thing with the discriminator trained above as it should have extracted a diverse range of features.</p>
<p>Now that I know what to expect from this process I also want to see how the above crosses over to the doodle example I was initially exploring. I Don’t have super high expectations simply because the data set in that case is pretty small but we’ll see what happens.</p>Trying with differing degrees of success to train the computer to generate pictures of faces. Some look kind of human, some look like the regulars at a bar I used to work.Machine Learning, Recovering Obscured Handwritten Numbers2020-07-14T23:00:00+00:002020-07-14T23:00:00+00:00https://mauicv.com/machine-learning/2020/07/14/unobscuring-handwritten-numbers<hr />
<p>So this was mostly just a test to see how well we can train a model to take obscured handwriting and output a cleaned version of the image. I was just starting out at this stage coming to understand the kinds of models that machine learning allows you to build. I wanted to see if this would be as easy as I thought it would be.</p>
<p>I’m using the MNIST handwritten digits data set. For each image I’m going to build a pipeline that spits out as input for the model an image that’s been partially obscured by blobs of different sizes. The label that we want the model to be matching against is then just the unblemished article.</p>
<p>This is some typical output on a unseen test data set. The first column is the handwritten digits with randomly positioned spheres on top. The second is the output from the neural network after training and the third is the unblemished letters.</p>
<p><img src="/assets/unobscuring-handwritten-letters/blobbed-numbers.png" alt="demon" height="220" width="40" />
<img src="/assets/unobscuring-handwritten-letters/unblobbed-numbers.png" alt="demon" height="220" width="40" />
<img src="/assets/unobscuring-handwritten-letters/clean-numbers.png" alt="demon" height="220" width="40" /></p>
<p>So this worked pretty well and was super fast. This whole experience was pretty encouraging and also I’m afraid completely unrepresentative of everything that followed.</p>Training a model to recover obscured handwritingMetropolis Hastings Algorithm, Numerical Evidence2020-05-25T23:00:00+00:002020-05-25T23:00:00+00:00https://mauicv.com/network-theory/2020/05/25/metropolis-walks-on-graphs<p><sup><strong>note</strong>: <em>A <a href="https://github.com/mauicv/graph-notebooks">jupyter notebook</a> version of this post is available here.</em></sup></p>
<hr />
<h2 id="metropolis-hastings-random-walk-on-graphs">Metropolis Hastings Random Walk on Graphs</h2>
<p>We’re going to build a slightly simpler notion of graph here than in part 1. Firstly its going to be bidirectional so a connection between nodes a and b corresponds to a connection between b and a. Instead of building a graph out of text I’m going to instead build one by taking a collection of nodes and then connecting each with some probability, we’re also going to assume the weights on a single node are equally distributed. So for any random walker moving on this graph if its in a particular state and that state is connected to $n$ other states then the probability that the walker moves to each of those states is \(1/n\)</p>
<p>The following wall of code (sorry) is basically the same as in part 1 except the <code class="language-plaintext highlighter-rouge">TransitionMartix</code> matrix multiplication method now has an extra case included for a new class I’ve defined called <code class="language-plaintext highlighter-rouge">MetropolisWalker</code>. This way the matrix is applied differently to Metropolis Walkers than to normal walkers.</p>
<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="kn">import</span> <span class="nn">random</span>
<span class="kn">import</span> <span class="nn">numpy</span> <span class="k">as</span> <span class="n">np</span>
<span class="kn">import</span> <span class="nn">matplotlib.pyplot</span> <span class="k">as</span> <span class="n">plt</span>
<span class="kn">import</span> <span class="nn">math</span>
<span class="c1"># For running random walks on the graph
</span><span class="k">class</span> <span class="nc">Walker</span><span class="p">:</span>
<span class="k">def</span> <span class="nf">__init__</span><span class="p">(</span><span class="bp">self</span><span class="p">,</span> <span class="n">state</span><span class="p">):</span>
<span class="bp">self</span><span class="p">.</span><span class="n">state</span> <span class="o">=</span> <span class="n">state</span>
<span class="c1"># For running Metropolis Hastings algorithm on the graph
</span><span class="k">class</span> <span class="nc">MetropolisWalker</span><span class="p">:</span>
<span class="k">def</span> <span class="nf">__init__</span><span class="p">(</span><span class="bp">self</span><span class="p">,</span> <span class="n">state</span><span class="p">):</span>
<span class="bp">self</span><span class="p">.</span><span class="n">state</span> <span class="o">=</span> <span class="n">state</span>
<span class="c1"># Abstraction of single state such as in the Walker class to a distribution of states.
</span><span class="k">class</span> <span class="nc">State</span><span class="p">:</span>
<span class="k">def</span> <span class="nf">__init__</span><span class="p">(</span><span class="bp">self</span><span class="p">,</span> <span class="n">states</span><span class="p">):</span>
<span class="bp">self</span><span class="p">.</span><span class="n">states</span> <span class="o">=</span> <span class="p">{</span><span class="n">state</span><span class="p">:</span> <span class="mi">0</span> <span class="k">for</span> <span class="n">state</span> <span class="ow">in</span> <span class="n">states</span><span class="p">}</span>
<span class="k">def</span> <span class="nf">draw</span><span class="p">(</span><span class="bp">self</span><span class="p">):</span>
<span class="n">people</span> <span class="o">=</span> <span class="bp">self</span><span class="p">.</span><span class="n">states</span><span class="p">.</span><span class="n">keys</span><span class="p">()</span>
<span class="n">density</span> <span class="o">=</span> <span class="p">[</span><span class="bp">self</span><span class="p">.</span><span class="n">states</span><span class="p">[</span><span class="n">person</span><span class="p">]</span> <span class="k">for</span> <span class="n">person</span> <span class="ow">in</span> <span class="n">people</span><span class="p">]</span>
<span class="n">y_pos</span> <span class="o">=</span> <span class="n">np</span><span class="p">.</span><span class="n">arange</span><span class="p">(</span><span class="nb">len</span><span class="p">(</span><span class="n">people</span><span class="p">))</span>
<span class="c1"># density.sort(reverse = True)
</span> <span class="n">plt</span><span class="p">.</span><span class="n">bar</span><span class="p">(</span><span class="n">y_pos</span><span class="p">,</span> <span class="n">density</span><span class="p">,</span> <span class="n">align</span><span class="o">=</span><span class="s">'center'</span><span class="p">,</span> <span class="n">alpha</span><span class="o">=</span><span class="mf">0.5</span><span class="p">)</span>
<span class="n">plt</span><span class="p">.</span><span class="n">ylabel</span><span class="p">(</span><span class="s">'density'</span><span class="p">)</span>
<span class="n">plt</span><span class="p">.</span><span class="n">title</span><span class="p">(</span><span class="s">'Node Visiting Rates'</span><span class="p">)</span>
<span class="n">plt</span><span class="p">.</span><span class="n">show</span><span class="p">()</span>
<span class="o">@</span><span class="nb">classmethod</span>
<span class="k">def</span> <span class="nf">from_orbit</span><span class="p">(</span><span class="n">cls</span><span class="p">,</span> <span class="n">orbit</span><span class="p">):</span>
<span class="n">instance</span> <span class="o">=</span> <span class="n">cls</span><span class="p">(</span><span class="n">orbit</span><span class="p">)</span>
<span class="k">for</span> <span class="n">point</span> <span class="ow">in</span> <span class="n">orbit</span><span class="p">:</span>
<span class="n">instance</span><span class="p">.</span><span class="n">states</span><span class="p">[</span><span class="n">point</span><span class="p">]</span> <span class="o">=</span> <span class="n">instance</span><span class="p">.</span><span class="n">states</span><span class="p">.</span><span class="n">get</span><span class="p">(</span><span class="n">point</span><span class="p">,</span> <span class="mi">0</span><span class="p">)</span> <span class="o">+</span> <span class="mi">1</span>
<span class="k">for</span> <span class="n">key</span><span class="p">,</span> <span class="n">value</span> <span class="ow">in</span> <span class="n">instance</span><span class="p">.</span><span class="n">states</span><span class="p">.</span><span class="n">items</span><span class="p">():</span>
<span class="n">instance</span><span class="p">.</span><span class="n">states</span><span class="p">[</span><span class="n">key</span><span class="p">]</span> <span class="o">=</span> <span class="n">instance</span><span class="p">.</span><span class="n">states</span><span class="p">[</span><span class="n">key</span><span class="p">]</span><span class="o">/</span><span class="nb">len</span><span class="p">(</span><span class="n">orbit</span><span class="p">)</span>
<span class="k">return</span> <span class="n">instance</span>
<span class="o">@</span><span class="nb">classmethod</span>
<span class="k">def</span> <span class="nf">from_uniform</span><span class="p">(</span><span class="n">cls</span><span class="p">,</span> <span class="n">states</span><span class="p">):</span>
<span class="n">instance</span> <span class="o">=</span> <span class="n">cls</span><span class="p">(</span><span class="n">states</span><span class="p">)</span>
<span class="k">for</span> <span class="n">key</span><span class="p">,</span> <span class="n">value</span> <span class="ow">in</span> <span class="n">instance</span><span class="p">.</span><span class="n">states</span><span class="p">.</span><span class="n">items</span><span class="p">():</span>
<span class="n">instance</span><span class="p">.</span><span class="n">states</span><span class="p">[</span><span class="n">key</span><span class="p">]</span> <span class="o">=</span> <span class="mi">1</span><span class="o">/</span><span class="nb">len</span><span class="p">(</span><span class="n">instance</span><span class="p">.</span><span class="n">states</span><span class="p">)</span>
<span class="k">return</span> <span class="n">instance</span>
<span class="k">def</span> <span class="nf">__sub__</span><span class="p">(</span><span class="bp">self</span><span class="p">,</span> <span class="n">other</span><span class="p">):</span>
<span class="n">difference</span> <span class="o">=</span> <span class="n">State</span><span class="p">(</span><span class="bp">self</span><span class="p">.</span><span class="n">states</span><span class="p">.</span><span class="n">keys</span><span class="p">())</span>
<span class="k">for</span> <span class="n">key</span><span class="p">,</span> <span class="n">_</span> <span class="ow">in</span> <span class="n">difference</span><span class="p">.</span><span class="n">states</span><span class="p">.</span><span class="n">items</span><span class="p">():</span>
<span class="n">difference</span><span class="p">.</span><span class="n">states</span><span class="p">[</span><span class="n">key</span><span class="p">]</span> <span class="o">=</span> <span class="nb">abs</span><span class="p">(</span><span class="bp">self</span><span class="p">.</span><span class="n">states</span><span class="p">.</span><span class="n">get</span><span class="p">(</span><span class="n">key</span><span class="p">,</span> <span class="mi">0</span><span class="p">)</span> <span class="o">-</span> <span class="n">other</span><span class="p">.</span><span class="n">states</span><span class="p">.</span><span class="n">get</span><span class="p">(</span><span class="n">key</span><span class="p">,</span> <span class="mi">0</span><span class="p">))</span>
<span class="k">return</span> <span class="n">difference</span>
<span class="k">def</span> <span class="nf">dist</span><span class="p">(</span><span class="bp">self</span><span class="p">,</span> <span class="n">other</span><span class="p">):</span>
<span class="n">diff</span> <span class="o">=</span> <span class="n">other</span> <span class="o">-</span> <span class="bp">self</span>
<span class="k">return</span> <span class="n">math</span><span class="p">.</span><span class="n">sqrt</span><span class="p">(</span><span class="nb">sum</span><span class="p">([</span><span class="n">v</span><span class="o">**</span><span class="mi">2</span> <span class="k">for</span> <span class="n">_</span><span class="p">,</span> <span class="n">v</span> <span class="ow">in</span> <span class="n">diff</span><span class="p">.</span><span class="n">states</span><span class="p">.</span><span class="n">items</span><span class="p">()]))</span>
<span class="c1"># Going to map between states with this class
</span><span class="k">class</span> <span class="nc">TransitionMatrix</span><span class="p">:</span>
<span class="k">def</span> <span class="nf">__init__</span><span class="p">(</span><span class="bp">self</span><span class="p">,</span> <span class="n">states</span><span class="p">):</span>
<span class="bp">self</span><span class="p">.</span><span class="n">p</span> <span class="o">=</span> <span class="p">{</span><span class="n">state</span><span class="p">:</span> <span class="p">{}</span> <span class="k">for</span> <span class="n">state</span> <span class="ow">in</span> <span class="n">states</span><span class="p">}</span>
<span class="o">@</span><span class="nb">classmethod</span>
<span class="k">def</span> <span class="nf">uniform_bidirected</span><span class="p">(</span><span class="n">cls</span><span class="p">,</span> <span class="n">states</span><span class="p">,</span> <span class="n">alpha</span><span class="p">):</span>
<span class="n">instance</span> <span class="o">=</span> <span class="n">cls</span><span class="p">(</span><span class="n">states</span><span class="p">)</span>
<span class="k">for</span> <span class="n">key</span><span class="p">,</span> <span class="n">node</span> <span class="ow">in</span> <span class="n">instance</span><span class="p">.</span><span class="n">p</span><span class="p">.</span><span class="n">items</span><span class="p">():</span>
<span class="n">num_connections</span> <span class="o">=</span> <span class="n">random</span><span class="p">.</span><span class="n">randint</span><span class="p">(</span><span class="mi">0</span><span class="p">,</span> <span class="nb">len</span><span class="p">(</span><span class="n">instance</span><span class="p">.</span><span class="n">p</span><span class="p">))</span>
<span class="n">connections</span> <span class="o">=</span> <span class="n">random</span><span class="p">.</span><span class="n">sample</span><span class="p">(</span><span class="n">instance</span><span class="p">.</span><span class="n">p</span><span class="p">.</span><span class="n">keys</span><span class="p">(),</span> <span class="n">num_connections</span><span class="p">)</span>
<span class="k">for</span> <span class="n">c</span> <span class="ow">in</span> <span class="n">connections</span><span class="p">:</span>
<span class="k">if</span> <span class="n">random</span><span class="p">.</span><span class="n">random</span><span class="p">()</span> <span class="o">></span> <span class="n">alpha</span><span class="p">:</span>
<span class="n">instance</span><span class="p">.</span><span class="n">p</span><span class="p">[</span><span class="n">key</span><span class="p">][</span><span class="n">c</span><span class="p">]</span> <span class="o">=</span> <span class="mi">1</span>
<span class="n">instance</span><span class="p">.</span><span class="n">p</span><span class="p">[</span><span class="n">c</span><span class="p">][</span><span class="n">key</span><span class="p">]</span> <span class="o">=</span> <span class="mi">1</span>
<span class="n">instance</span><span class="p">.</span><span class="n">_normalize</span><span class="p">()</span>
<span class="k">return</span> <span class="n">instance</span>
<span class="k">def</span> <span class="nf">_normalize</span><span class="p">(</span><span class="bp">self</span><span class="p">):</span>
<span class="k">for</span> <span class="n">key</span><span class="p">,</span> <span class="n">val</span> <span class="ow">in</span> <span class="bp">self</span><span class="p">.</span><span class="n">p</span><span class="p">.</span><span class="n">items</span><span class="p">():</span>
<span class="n">row_total</span> <span class="o">=</span> <span class="nb">sum</span><span class="p">([</span><span class="n">count</span> <span class="k">for</span> <span class="n">_</span><span class="p">,</span> <span class="n">count</span> <span class="ow">in</span> <span class="n">val</span><span class="p">.</span><span class="n">items</span><span class="p">()])</span>
<span class="k">for</span> <span class="n">target</span><span class="p">,</span> <span class="n">count</span> <span class="ow">in</span> <span class="n">val</span><span class="p">.</span><span class="n">items</span><span class="p">():</span>
<span class="bp">self</span><span class="p">.</span><span class="n">p</span><span class="p">[</span><span class="n">key</span><span class="p">][</span><span class="n">target</span><span class="p">]</span> <span class="o">=</span> <span class="bp">self</span><span class="p">.</span><span class="n">p</span><span class="p">[</span><span class="n">key</span><span class="p">][</span><span class="n">target</span><span class="p">]</span><span class="o">/</span><span class="n">row_total</span>
<span class="k">def</span> <span class="nf">__matmul__</span><span class="p">(</span><span class="bp">self</span><span class="p">,</span> <span class="n">other</span><span class="p">):</span>
<span class="s">"""If applying the transition matrix class to a Walker class then select the next state at random.
If applying to a State class we generate a new distribution. """</span>
<span class="k">if</span> <span class="nb">isinstance</span><span class="p">(</span><span class="n">other</span><span class="p">,</span> <span class="n">State</span><span class="p">):</span>
<span class="n">new_state</span> <span class="o">=</span> <span class="n">State</span><span class="p">([</span><span class="n">s</span> <span class="k">for</span> <span class="n">s</span><span class="p">,</span> <span class="n">_</span> <span class="ow">in</span> <span class="n">other</span><span class="p">.</span><span class="n">states</span><span class="p">.</span><span class="n">items</span><span class="p">()])</span>
<span class="k">for</span> <span class="n">s_1</span><span class="p">,</span> <span class="n">p</span> <span class="ow">in</span> <span class="n">other</span><span class="p">.</span><span class="n">states</span><span class="p">.</span><span class="n">items</span><span class="p">():</span>
<span class="n">sum_p_s2_s1</span> <span class="o">=</span> <span class="mi">0</span>
<span class="k">for</span> <span class="n">s_2</span><span class="p">,</span> <span class="n">P</span> <span class="ow">in</span> <span class="bp">self</span><span class="p">.</span><span class="n">p</span><span class="p">.</span><span class="n">items</span><span class="p">():</span>
<span class="n">sum_p_s2_s1</span> <span class="o">=</span> <span class="n">sum_p_s2_s1</span> <span class="o">+</span> <span class="n">other</span><span class="p">.</span><span class="n">states</span><span class="p">[</span><span class="n">s_2</span><span class="p">]</span><span class="o">*</span><span class="n">P</span><span class="p">.</span><span class="n">get</span><span class="p">(</span><span class="n">s_1</span><span class="p">,</span> <span class="mi">0</span><span class="p">)</span>
<span class="n">new_state</span><span class="p">.</span><span class="n">states</span><span class="p">[</span><span class="n">s_1</span><span class="p">]</span> <span class="o">=</span> <span class="n">sum_p_s2_s1</span>
<span class="k">return</span> <span class="n">new_state</span>
<span class="k">if</span> <span class="nb">isinstance</span><span class="p">(</span><span class="n">other</span><span class="p">,</span> <span class="n">Walker</span><span class="p">):</span>
<span class="n">ps</span> <span class="o">=</span> <span class="bp">self</span><span class="p">.</span><span class="n">p</span><span class="p">[</span><span class="n">other</span><span class="p">.</span><span class="n">state</span><span class="p">]</span>
<span class="n">choices</span> <span class="o">=</span> <span class="p">[</span><span class="o">*</span><span class="n">ps</span><span class="p">.</span><span class="n">keys</span><span class="p">()]</span>
<span class="n">weights</span> <span class="o">=</span> <span class="p">[</span><span class="o">*</span><span class="n">ps</span><span class="p">.</span><span class="n">values</span><span class="p">()]</span>
<span class="n">choice</span> <span class="o">=</span> <span class="n">random</span><span class="p">.</span><span class="n">choices</span><span class="p">(</span><span class="n">choices</span><span class="p">,</span> <span class="n">weights</span><span class="o">=</span><span class="n">weights</span><span class="p">).</span><span class="n">pop</span><span class="p">()</span>
<span class="k">return</span> <span class="n">Walker</span><span class="p">(</span><span class="n">choice</span><span class="p">)</span>
<span class="k">if</span> <span class="nb">isinstance</span><span class="p">(</span><span class="n">other</span><span class="p">,</span> <span class="n">MetropolisWalker</span><span class="p">):</span>
<span class="n">ps</span> <span class="o">=</span> <span class="bp">self</span><span class="p">.</span><span class="n">p</span><span class="p">[</span><span class="n">other</span><span class="p">.</span><span class="n">state</span><span class="p">]</span>
<span class="n">choices</span> <span class="o">=</span> <span class="p">[</span><span class="o">*</span><span class="n">ps</span><span class="p">.</span><span class="n">keys</span><span class="p">()]</span>
<span class="n">weights</span> <span class="o">=</span> <span class="p">[</span><span class="o">*</span><span class="n">ps</span><span class="p">.</span><span class="n">values</span><span class="p">()]</span>
<span class="n">choice</span> <span class="o">=</span> <span class="n">random</span><span class="p">.</span><span class="n">choices</span><span class="p">(</span><span class="n">choices</span><span class="p">,</span> <span class="n">weights</span><span class="o">=</span><span class="n">weights</span><span class="p">).</span><span class="n">pop</span><span class="p">()</span>
<span class="n">p_c_s</span> <span class="o">=</span> <span class="bp">self</span><span class="p">.</span><span class="n">p</span><span class="p">[</span><span class="n">choice</span><span class="p">][</span><span class="n">other</span><span class="p">.</span><span class="n">state</span><span class="p">]</span>
<span class="n">p_s_c</span> <span class="o">=</span> <span class="bp">self</span><span class="p">.</span><span class="n">p</span><span class="p">[</span><span class="n">other</span><span class="p">.</span><span class="n">state</span><span class="p">][</span><span class="n">choice</span><span class="p">]</span>
<span class="n">p_a</span> <span class="o">=</span> <span class="n">random</span><span class="p">.</span><span class="n">random</span><span class="p">()</span>
<span class="k">if</span> <span class="n">p_a</span> <span class="o"><</span> <span class="nb">min</span><span class="p">(</span><span class="mi">1</span><span class="p">,</span> <span class="n">p_c_s</span><span class="o">/</span><span class="n">p_s_c</span><span class="p">):</span>
<span class="k">return</span> <span class="n">MetropolisWalker</span><span class="p">(</span><span class="n">choice</span><span class="p">)</span>
<span class="k">else</span><span class="p">:</span>
<span class="k">return</span> <span class="n">MetropolisWalker</span><span class="p">(</span><span class="n">other</span><span class="p">.</span><span class="n">state</span><span class="p">)</span>
</code></pre></div></div>
<h2 id="the-metropolis-hastings-algorithm">The Metropolis Hastings Algorithm</h2>
<hr />
<p>The Metropolis Hastings algorithm is a random walk on the graph that aims to obtain a representative sample of the graphs node independent of there connectivity. In a typical random walk a walker on a node <code class="language-plaintext highlighter-rouge">a</code> goes to a node <code class="language-plaintext highlighter-rouge">b</code> with probability \(1/n_{a}\) where \(n_{a}\) is the number of nodes to which <code class="language-plaintext highlighter-rouge">a</code> is connected. In the Metropolis Hastings random walk the probabilities have been adjusted. It works like so:</p>
<ol>
<li>On node <code class="language-plaintext highlighter-rouge">a</code> pick a connected node <code class="language-plaintext highlighter-rouge">b</code> at random.</li>
<li>Let \(p_{a,b}\) be the probability of going from <code class="language-plaintext highlighter-rouge">a</code> to <code class="language-plaintext highlighter-rouge">b</code>. This will be \(1/n_{a}\). Let \(p_{b,a}\) be the probability of going from <code class="language-plaintext highlighter-rouge">b</code> to <code class="language-plaintext highlighter-rouge">a</code>, or \(1/n_{b}\).</li>
<li>Let \(p\) be the minimum of \(1\) and \(\frac{p_{a,b}}{p_{b,a}}\)</li>
<li>With probability $p$ go from <code class="language-plaintext highlighter-rouge">a</code> to <code class="language-plaintext highlighter-rouge">b</code> otherwise stay at <code class="language-plaintext highlighter-rouge">a</code></li>
<li>Repeat</li>
</ol>
<p>The trick here happens at point 3. If \(p_{a,b}\) is greater that \(p_{b,a}\) then you go to <code class="language-plaintext highlighter-rouge">b</code> with probability 1. So if <code class="language-plaintext highlighter-rouge">a</code> is less connected than <code class="language-plaintext highlighter-rouge">b</code>… By making this adjustment you reverse the natural tendency for random walks in graphs to visit highly connected nodes more than less connected nodes.</p>
<p>It’s slightly hard to believe this works. Lets look first at the numerical simulations and see what happens:</p>
<h2 id="numerical-simulations">Numerical Simulations:</h2>
<hr />
<p>Below we model a network made up of 500 nodes that represent people. The graph displays the number of social links per person</p>
<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="kn">from</span> <span class="nn">faker</span> <span class="kn">import</span> <span class="n">Faker</span>
<span class="n">fake</span> <span class="o">=</span> <span class="n">Faker</span><span class="p">()</span>
<span class="n">states</span> <span class="o">=</span> <span class="p">[</span><span class="n">fake</span><span class="p">.</span><span class="n">name</span><span class="p">()</span> <span class="k">for</span> <span class="n">_</span> <span class="ow">in</span> <span class="nb">range</span><span class="p">(</span><span class="mi">500</span><span class="p">)]</span>
<span class="n">T</span> <span class="o">=</span> <span class="n">TransitionMatrix</span><span class="p">.</span><span class="n">uniform_bidirected</span><span class="p">(</span><span class="n">states</span><span class="p">,</span> <span class="mf">0.9</span><span class="p">)</span>
<span class="n">people</span> <span class="o">=</span> <span class="n">T</span><span class="p">.</span><span class="n">p</span><span class="p">.</span><span class="n">keys</span><span class="p">()</span>
<span class="n">num_friends</span> <span class="o">=</span> <span class="p">[</span><span class="nb">len</span><span class="p">(</span><span class="n">T</span><span class="p">.</span><span class="n">p</span><span class="p">[</span><span class="n">person</span><span class="p">])</span> <span class="k">for</span> <span class="n">person</span> <span class="ow">in</span> <span class="n">people</span><span class="p">]</span>
<span class="n">y_pos</span> <span class="o">=</span> <span class="n">np</span><span class="p">.</span><span class="n">arange</span><span class="p">(</span><span class="nb">len</span><span class="p">(</span><span class="n">people</span><span class="p">))</span>
<span class="n">num_friends</span><span class="p">.</span><span class="n">sort</span><span class="p">(</span><span class="n">reverse</span> <span class="o">=</span> <span class="bp">True</span><span class="p">)</span>
<span class="n">plt</span><span class="p">.</span><span class="n">bar</span><span class="p">(</span><span class="n">y_pos</span><span class="p">,</span> <span class="n">num_friends</span><span class="p">,</span> <span class="n">align</span><span class="o">=</span><span class="s">'center'</span><span class="p">,</span> <span class="n">alpha</span><span class="o">=</span><span class="mf">0.5</span><span class="p">)</span>
<span class="n">plt</span><span class="p">.</span><span class="n">ylabel</span><span class="p">(</span><span class="s">'density'</span><span class="p">)</span>
<span class="n">plt</span><span class="p">.</span><span class="n">title</span><span class="p">(</span><span class="s">'Number Connection Per Person'</span><span class="p">)</span>
<span class="n">plt</span><span class="p">.</span><span class="n">show</span><span class="p">()</span>
</code></pre></div></div>
<p><img src="/assets/metropolis-walks-on-graphs/metropolis-walks-on-graphs_4_0.png" alt="png" /></p>
<p>Lets see what happens now if run a random walker orbit on the network. The graph displayed shows the rates at which the orbit visits each node. As you can see it’s not uniform and some nodes are visited far more than others.</p>
<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">uniform</span> <span class="o">=</span> <span class="n">State</span><span class="p">.</span><span class="n">from_uniform</span><span class="p">(</span><span class="n">states</span><span class="p">)</span>
<span class="n">person</span> <span class="o">=</span> <span class="n">random</span><span class="p">.</span><span class="n">choice</span><span class="p">(</span><span class="n">states</span><span class="p">)</span>
<span class="n">orbit</span> <span class="o">=</span> <span class="p">[</span><span class="n">Walker</span><span class="p">(</span><span class="n">person</span><span class="p">)]</span>
<span class="k">for</span> <span class="n">i</span> <span class="ow">in</span> <span class="nb">range</span><span class="p">(</span><span class="mi">3000000</span><span class="p">):</span>
<span class="n">orbit</span><span class="p">.</span><span class="n">append</span><span class="p">(</span><span class="n">T</span><span class="o">@</span><span class="n">orbit</span><span class="p">[</span><span class="o">-</span><span class="mi">1</span><span class="p">])</span>
<span class="n">orbit_dist</span> <span class="o">=</span> <span class="n">State</span><span class="p">.</span><span class="n">from_orbit</span><span class="p">([</span><span class="n">o</span><span class="p">.</span><span class="n">state</span> <span class="k">for</span> <span class="n">o</span> <span class="ow">in</span> <span class="n">orbit</span><span class="p">])</span>
<span class="k">print</span><span class="p">(</span><span class="s">'Node Visiting Rates: '</span><span class="p">,</span> <span class="n">orbit_dist</span><span class="p">.</span><span class="n">dist</span><span class="p">(</span><span class="n">uniform</span><span class="p">))</span>
<span class="n">orbit_dist</span><span class="p">.</span><span class="n">draw</span><span class="p">()</span>
</code></pre></div></div>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>Node Visiting Rates: 0.013756605858492924
</code></pre></div></div>
<p><img src="/assets/metropolis-walks-on-graphs/metropolis-walks-on-graphs_6_1.png" alt="png" /></p>
<p>Now instead of a normal random walk lets try a Metropolis Hastings random walk and see what we get:</p>
<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">person</span> <span class="o">=</span> <span class="n">random</span><span class="p">.</span><span class="n">choice</span><span class="p">(</span><span class="n">states</span><span class="p">)</span>
<span class="n">orbit</span> <span class="o">=</span> <span class="p">[</span><span class="n">MetropolisWalker</span><span class="p">(</span><span class="n">person</span><span class="p">)]</span>
<span class="k">for</span> <span class="n">i</span> <span class="ow">in</span> <span class="nb">range</span><span class="p">(</span><span class="mi">3000000</span><span class="p">):</span>
<span class="n">orbit</span><span class="p">.</span><span class="n">append</span><span class="p">(</span><span class="n">T</span><span class="o">@</span><span class="n">orbit</span><span class="p">[</span><span class="o">-</span><span class="mi">1</span><span class="p">])</span>
<span class="n">orbit_dist</span> <span class="o">=</span> <span class="n">State</span><span class="p">.</span><span class="n">from_orbit</span><span class="p">([</span><span class="n">o</span><span class="p">.</span><span class="n">state</span> <span class="k">for</span> <span class="n">o</span> <span class="ow">in</span> <span class="n">orbit</span><span class="p">])</span>
<span class="k">print</span><span class="p">(</span><span class="s">'distance from uniform: '</span><span class="p">,</span> <span class="n">orbit_dist</span><span class="p">.</span><span class="n">dist</span><span class="p">(</span><span class="n">uniform</span><span class="p">))</span>
<span class="n">orbit_dist</span><span class="p">.</span><span class="n">draw</span><span class="p">()</span>
</code></pre></div></div>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>distance from uniform: 0.0007634611742639561
</code></pre></div></div>
<p><img src="/assets/metropolis-walks-on-graphs/metropolis-walks-on-graphs_8_1.png" alt="png" /></p>
<p>So as you can see the Metropolis Hastings random walk looks like it’s converging to the uniform distribution. This means that the orbit is visiting the nodes representatively instead of being biased towards nodes which greater numbers of connections. We can compare the orbits more explicitly</p>
<p><strong>Warning</strong>: If your running the following it will take a long time and is not the best way of doing this:</p>
<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">person</span> <span class="o">=</span> <span class="n">random</span><span class="p">.</span><span class="n">choice</span><span class="p">(</span><span class="n">states</span><span class="p">)</span>
<span class="n">uniform</span> <span class="o">=</span> <span class="n">State</span><span class="p">.</span><span class="n">from_uniform</span><span class="p">(</span><span class="n">states</span><span class="p">)</span>
<span class="n">orbit_1</span> <span class="o">=</span> <span class="p">[</span><span class="n">MetropolisWalker</span><span class="p">(</span><span class="n">person</span><span class="p">)]</span>
<span class="n">orbit_2</span> <span class="o">=</span> <span class="p">[</span><span class="n">Walker</span><span class="p">(</span><span class="n">person</span><span class="p">)]</span>
<span class="n">differences_1</span> <span class="o">=</span> <span class="p">[]</span>
<span class="n">differences_2</span> <span class="o">=</span> <span class="p">[]</span>
<span class="k">for</span> <span class="n">i</span> <span class="ow">in</span> <span class="nb">range</span><span class="p">(</span><span class="mi">30000</span><span class="p">):</span>
<span class="n">orbit_1</span><span class="p">.</span><span class="n">append</span><span class="p">(</span><span class="n">T</span><span class="o">@</span><span class="n">orbit_1</span><span class="p">[</span><span class="o">-</span><span class="mi">1</span><span class="p">])</span>
<span class="n">orbit_2</span><span class="p">.</span><span class="n">append</span><span class="p">(</span><span class="n">T</span><span class="o">@</span><span class="n">orbit_2</span><span class="p">[</span><span class="o">-</span><span class="mi">1</span><span class="p">])</span>
<span class="n">orbit_dist_1</span> <span class="o">=</span> <span class="n">State</span><span class="p">.</span><span class="n">from_orbit</span><span class="p">([</span><span class="n">o</span><span class="p">.</span><span class="n">state</span> <span class="k">for</span> <span class="n">o</span> <span class="ow">in</span> <span class="n">orbit_1</span><span class="p">])</span>
<span class="n">orbit_dist_2</span> <span class="o">=</span> <span class="n">State</span><span class="p">.</span><span class="n">from_orbit</span><span class="p">([</span><span class="n">o</span><span class="p">.</span><span class="n">state</span> <span class="k">for</span> <span class="n">o</span> <span class="ow">in</span> <span class="n">orbit_2</span><span class="p">])</span>
<span class="n">differences_1</span><span class="p">.</span><span class="n">append</span><span class="p">(</span><span class="n">orbit_dist_1</span><span class="p">.</span><span class="n">dist</span><span class="p">(</span><span class="n">uniform</span><span class="p">))</span>
<span class="n">differences_2</span><span class="p">.</span><span class="n">append</span><span class="p">(</span><span class="n">orbit_dist_2</span><span class="p">.</span><span class="n">dist</span><span class="p">(</span><span class="n">uniform</span><span class="p">))</span>
<span class="n">plt</span><span class="p">.</span><span class="n">plot</span><span class="p">(</span><span class="n">differences_1</span><span class="p">[</span><span class="mi">1000</span><span class="p">:])</span>
<span class="n">plt</span><span class="p">.</span><span class="n">plot</span><span class="p">(</span><span class="n">differences_2</span><span class="p">[</span><span class="mi">1000</span><span class="p">:])</span>
<span class="n">plt</span><span class="p">.</span><span class="n">title</span><span class="p">(</span><span class="s">'Distance between orbit distrubutions'</span><span class="p">)</span>
<span class="n">plt</span><span class="p">.</span><span class="n">ylabel</span><span class="p">(</span><span class="s">'difference'</span><span class="p">)</span>
<span class="n">plt</span><span class="p">.</span><span class="n">xlabel</span><span class="p">(</span><span class="s">'time'</span><span class="p">)</span>
<span class="n">plt</span><span class="p">.</span><span class="n">show</span><span class="p">()</span>
<span class="k">print</span><span class="p">(</span><span class="s">'Metropolis walker: {}'</span><span class="p">.</span><span class="nb">format</span><span class="p">(</span><span class="n">differences_1</span><span class="p">[</span><span class="o">-</span><span class="mi">1</span><span class="p">]))</span>
<span class="k">print</span><span class="p">(</span><span class="s">'Normal walker: {}'</span><span class="p">.</span><span class="nb">format</span><span class="p">(</span><span class="n">differences_2</span><span class="p">[</span><span class="o">-</span><span class="mi">1</span><span class="p">]))</span>
</code></pre></div></div>
<p><img src="/assets/metropolis-walks-on-graphs/metropolis-walks-on-graphs_10_0.png" alt="png" /></p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>Metropolis walker: 0.0073974708134618376
Normal walker: 0.014881319869752028
</code></pre></div></div>
<p>The blue line above shows the convergence of the metropolis walker towards the uniform distribution while the orange is the normal random walker. You can clearly see that the orange line is bounded away from zero whereas the metropolis walker is convergent.</p>Computing numerical evidence of the Metropolis Hastings AlgorithmReinforcement learning, Policy Gradient Methods2020-05-24T23:00:00+00:002020-05-24T23:00:00+00:00https://mauicv.com/reinforcement-learning/2020/05/24/policy-gradient-methods<p><sup><strong>note</strong>: <em>Relevant code for this post is <a href="https://github.com/mauicv/openai-gym-solns">here</a></em></sup></p>
<hr />
<h2 id="policy-gradient-methods">Policy Gradient Methods</h2>
<p>So the reinforcement learning problem is a state space, and actor and a policy. The actor transitions between states in the state space on the basis of the actions it’s taking at those states. The actions it chooses to take are selected by the policy. The policy is just some function that takes the state as input and spits out advice on the best action to take. The policy is usually modelled by a neural network which means we can lean on machine learning algorithms to improve it’s performance over time. The class of methods I’m going to introduce here are Policy Gradient Methods. In particular I’m implementing REINFORCE.</p>
<p><sup><strong>Note</strong>: <em>Here I assume you know rudimentary details of how neural networks work and machine learning work. Like what they look like, that they’re parametrised by lots of weights and biases. How you can compute the derivative w.r.t. these parameters using back-propagation. How the derivative of a loss function w.r.t. these parameters tells you what direction to change the parameters in order to improve the networks performance… These kinds of things.</em></sup></p>
<p>So one potential solution to the above problem is to take a policy function that takes as input the state the actor is in and then gives as output a distribution that tells us which actions are most likely to result in fulfilling the goal. Initially this function is just a guess, it just gives random probabilities for the best action. The goal of training is to encourage the policy function to get better and better at suggesting the best action to take at each state. So suppose your solving the <a href="https://gym.openai.com/envs/LunarLander-v2/">lunar lander</a> environment and you record the actions that this function dictates at each state in the actors (the spaceships) trajectory. At the end you look through each of the actions and each of the states and ask which actions resulted in positive outcomes and which resulted in negative outcomes. You then want to encourage those that resulted in success and discourage those that didn’t.</p>
<p>At each state we’ll give the policy the location of the shuttle and it’ll pass that data through each of it’s layers and output a vector of values. We’re going to assume the set of actions are discrete so, engine on or off, rather than continuous, engine fire at 60% or maybe 65% or 65.562%. This means that the vector of values above corresponds to probability of which engine to fire. So given a state we compute the action probabilities and then sample from these probabilities the action we take. Because of this during training we don’t always do the same thing, we do some things much more often but occasionally we try stuff that the policy is mostly advising against. We then use back propagation to obtain the parameter change in the policy weights and biases that’s going to result in the network suggesting actions that do well more than actions that do badly.</p>
<p>Denote a policy by \(\pi\) and the set of parameters underlying it, a vector of weight and biases, \(\theta\). The set of states and actions generated by taking a state, computing the policy, sampling a action, transition to a new state and repeating is an orbit, denote such an object \(\tau\). These look like:</p>
\[\tau_{\theta} = [(s_{0}, a_{0}), (s_{1}, a_{1}), ..., (s_{i}, a_{i}), (s_{i+1}, a_{i+1}), ....]\]
<p>Where \(a_{i}\) is an action sampled from the policy probability distribution \(\pi_{\theta}(s_{i})\). If the system is deterministic then given a state and an action we can directly compute the subsequent state. So \((s_{i}, a_{i}) \rightarrow s_{i+1}\). If the system is not deterministic then you conduct some experiment and sample the next state from the probability distribution of states given by taking action \(a_{i}\) in state \(s_{i}\).</p>
<p>At the end of the orbit the actor has either achieved it’s goal or failed to do so. If it’s achieved it we have to go back through the set of actions it’s taken and find a way of allocating rewards to each on the basis of how well it’s performed.</p>
<p>When we decide we’re going to encourage an action in a given state then we need first to know how the network changes with respect to it’s parameters, the weights and biases. This change in parameter space is given by the derivative of the model output with respect to those parameters. Knowing this we can use an update rule that looks something like:</p>
\[\theta \rightarrow \theta + \sigma A\nabla \pi_{\theta}( a_{i}\| s_{i})\]
<p>Where \(\nabla \pi_{\theta}( a_{i}\| s_{i})\) is the derivative of the policy w.r.t. \(\theta\) <strong>at the chosen action for that state</strong>. Adding it to \(\theta\) is like saying walk up or down the probability density hill so as to make that action more or less likely. A is the reward we decided to allocate that action and is positive if we think, \(a_{i}\), resulted in a good outcome and negative if not. So then the above should make \(a_{i}\) more likely if it was good and less if it was bad. Finally \(\sigma\) is the learning rate.</p>
<h4 id="the-problem-with-the-above">The problem with the above</h4>
<p>We don’t always get the optimal solution. Suppose we have a system with only one state and two actions. One of those actions has a big reward and the other a little reward. Suppose the way the policy network is initialized means that it suggests the low reward over the high reward with a high probability. Ideally this shouldn’t be a problem. Over the training the network should end up reassigning the probability towards the better reward action. Unfortunately because the policy is initially incorrectly biased towards the poor reward option we’re going to get far more samples of this action than the other and because it has a positive, albeit lower reward the training will end up encouraging this action more. This is simply because it gets more samples for it. A sufficiently large amount of a small thing can be more than a small number of a big thing. It’s a bit like failing to learn to do something a new and better way because you want to prioritize doing it a worse way that you know well. This means we have to counteract this behaviour by incorporating the policy probabilities themselves into the update rule.</p>
<p>So if an policy suggests an action and it returns a positive reward then we update in favour of that action dependent on how likely the policy was to suggest that reward. This way updates to low reward actions that the policy suggests a lot are balanced by the higher probabilities of the policy suggesting them. High reward actions that the policy isn’t likely to suggest are boosted to make up for the smaller likelihood of sampling them. The way we do this is just to divide by the probability of selecting an action.</p>
\[\theta \rightarrow \theta + \sigma A\frac{\nabla \pi_{\theta}(a_{i}\|s_{i})}{ \pi_{\theta}(a_{i}\|s_{i})}\]
<p>There are a couple of ways to set A. It’s the amount we’re going to encourage the network to take that action next time it finds itself in the same state. So in other words is how you evaluate the quality of the action taken with respect to the outcome received. For example a naive approach would just have it be the a constant positive number if the task is completed correctly and a constant negative number if incorrectly.</p>
<p>The final issue we have is that the above function contains the derivative \(\nabla \pi_{\theta}(a_{i}\|s_{i})\) which is inconvenient if we’re using a machine learning frame work like <a href="https://www.tensorflow.org/">TensorFlow</a> to implement this algorithm. This is because TensorFlow expects a loss functions that returns a scaler value. To solve this we can use the following:</p>
\[\nabla log(f) = \nabla {f}/f\]
<p>to get:</p>
\[\theta \rightarrow \theta + A\nabla log(\pi_{\theta}(a_{i}\|s_{i}))\]
<p>Which would make the loss function:</p>
\[log(\pi_{\theta}(a_{i}\|s_{i}))\]
<hr />
<h2 id="the-algorithm">The Algorithm</h2>
<p>Assuming naive constant positive or negative rewards:</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>1. Sample an initial random state
2. Initialize and empty array to store episodic memory
3. For n steps:
- Sample an action from the policy dependent on current state,
- Take the action and move the actor into the new environment state
- Record the action and new state in the episodic memory
4. If the actor was successful set A = 1 if unsuccessful set A = -1
5. Update the policy
6. Repeat for as many episodes as needed
</code></pre></div></div>
<p>Alternatives of the above discount the rewards back in time from the actor achieving reward by some value \(\gamma < 1\). So if the actor is successful then the update on the action \(n\) steps before the end of the episode is weighted by \(\gamma^n\). This represents the fact that actions the actor takes just before it is successful should be rewarded more than actions taken further back in time.</p>
<p>You can also assign rewards not just for completing the task at hand but also at intermittent stages in the process. In this case you’d record those rewards in episodic memory at the same time as the state and action that led to them. You’d then discount that reward back in time from when it was obtained.</p>Intuition on Policy Gradient MethodsMetropolis Hastings Algorithm, Random walks on Graphs2020-05-21T23:00:00+00:002020-05-21T23:00:00+00:00https://mauicv.com/network-theory/2020/05/21/random-walks-on-graphs<p><sup><strong>note</strong>: <em>A <a href="https://github.com/mauicv/graph-notebooks">jupyter notebook</a> version of this post is available here.</em></sup></p>
<hr />
<h2 id="intro">Intro</h2>
<p>So graphs are made up of nodes and edges. We’re going to build on top of a graph a random dynamical system that’s going to move around on it known as a random walk. By doing so and recording each of the steps the random dynamical system takes we will manage to extract samples of the population of nodes in the graph.</p>
<p>Typically if you allow the random walk to move between nodes by selecting the next node uniformly from the set of connections the collection of nodes you end up with in your orbit isn’t a uniform distribution of the network nodes. Instead you get something biased towards nodes with higher connection counts. There is a trick to modifying the nature of the random walk so that it gets a uniform distribution across all the network nodes. This is known as the Metropolis Hastings algorithm which basically chooses the subsequent nodes in the orbit with higher probability if they’re less well connected.</p>
<p>In the case of the small graphs we’re going to look at this sampling method isn’t really needed because you could just choose a random sample from the nodes. This type of technique is useful for sampling networks from which you cannot just draw a sample. In the case of twitter for instance you cannot select a user at random. In order to fix this you could run a Metropolis Hastings random walk on the network and after a long time you’d get the sample your want. (It would be a very long time. Twitters API rate limits basically make this impossible in reality, <a href="https://github.com/mauicv/Metropolis-Hastings-Random-Walk">I tried</a>)</p>
<p>Suppose you have a collection of states \({s_1,...,s_n}\) these will be the nodes on the graph. Imagine a little chap we’ll call a walker whose sat at one of these nodes/states. We’re going to denote the probability of the walker at \(s_i\) moving to \(s_j\) as given by \(p_{i,j}\). I’m going to use node and state somewhat interchangeably. The difference is just that a node is a specific node in the network whereas I guess a state refers to the state of the dynamical system. So the location of the walker for instance, which would be at a specific node.</p>
<p>The states/nodes can really be anything, for instance they could encode letters and the probabilities \(p_{a, b}\) would be the probability for finding the letters ‘a’ and ‘b’ next to each other in a sentence.</p>
<h2 id="imports">Imports</h2>
<hr />
<p>Before we start I’m going to import a tun of stuff. The pairwise function will be useful for turning orbits into pairwise transitions in points. We can then use this to count frequencies of transitions and then approximate probabilities. I’m going to use it to train approximations of transition matrices for sentences where a sentence is an orbit on a network with letters for nodes.</p>
<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="kn">from</span> <span class="nn">itertools</span> <span class="kn">import</span> <span class="n">tee</span>
<span class="kn">import</span> <span class="nn">random</span>
<span class="kn">import</span> <span class="nn">matplotlib.pyplot</span> <span class="k">as</span> <span class="n">plt</span><span class="p">;</span> <span class="n">plt</span><span class="p">.</span><span class="n">rcdefaults</span><span class="p">()</span>
<span class="kn">import</span> <span class="nn">numpy</span> <span class="k">as</span> <span class="n">np</span>
<span class="kn">import</span> <span class="nn">matplotlib.pyplot</span> <span class="k">as</span> <span class="n">plt</span>
<span class="kn">import</span> <span class="nn">math</span>
<span class="c1"># For generating pairwise elements in a list. [1,2,3,4,5] becomes (1,2), (2,3), (3,4), (4,5)
</span><span class="k">def</span> <span class="nf">pairwise</span><span class="p">(</span><span class="n">iterable</span><span class="p">):</span>
<span class="n">a</span><span class="p">,</span> <span class="n">b</span> <span class="o">=</span> <span class="n">tee</span><span class="p">(</span><span class="n">iterable</span><span class="p">)</span>
<span class="nb">next</span><span class="p">(</span><span class="n">b</span><span class="p">,</span> <span class="bp">None</span><span class="p">)</span>
<span class="k">return</span> <span class="nb">zip</span><span class="p">(</span><span class="n">a</span><span class="p">,</span> <span class="n">b</span><span class="p">)</span>
<span class="c1"># example:
</span><span class="n">orbit</span> <span class="o">=</span> <span class="s">'hello'</span>
<span class="k">for</span> <span class="n">s_1</span><span class="p">,</span> <span class="n">s_2</span> <span class="ow">in</span> <span class="n">pairwise</span><span class="p">(</span><span class="n">orbit</span><span class="p">):</span>
<span class="k">print</span><span class="p">(</span><span class="n">s_1</span><span class="p">,</span> <span class="s">'->'</span><span class="p">,</span> <span class="n">s_2</span><span class="p">)</span>
</code></pre></div></div>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>h -> e
e -> l
l -> l
l -> o
</code></pre></div></div>
<p>I’m going to represent three objects of interest using classes.</p>
<p>The first is a walker class. We’re going to use this to generate random walks from the graph. It stores only a single state so for instance perhaps the letter ‘l’.</p>
<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="c1"># For running random walks on the graph
</span><span class="k">class</span> <span class="nc">Walker</span><span class="p">:</span>
<span class="k">def</span> <span class="nf">__init__</span><span class="p">(</span><span class="bp">self</span><span class="p">,</span> <span class="n">state</span><span class="p">):</span>
<span class="bp">self</span><span class="p">.</span><span class="n">state</span> <span class="o">=</span> <span class="n">state</span>
</code></pre></div></div>
<p>The next is called State and represents distributions of walkers. So picture each walker as an individual in a population. Suppose for this population 50% of the walker individuals are sat on the letter ‘l’ then we represent that by setting the dictionary value for ‘l’ to 0.5. This is a vector representation of how things, in this case our walkers, are distributed across the network. Most of the methods I’ve defined on this class are for visualizing the distribution or creating specific forms of the distribution. We’ll see more later.</p>
<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="c1"># Abstraction of single state such as in the Walker class to a distribution of states.
</span><span class="k">class</span> <span class="nc">State</span><span class="p">:</span>
<span class="k">def</span> <span class="nf">__init__</span><span class="p">(</span><span class="bp">self</span><span class="p">,</span> <span class="n">states</span><span class="p">):</span>
<span class="bp">self</span><span class="p">.</span><span class="n">states</span> <span class="o">=</span> <span class="p">{</span><span class="n">state</span><span class="p">:</span> <span class="mi">0</span> <span class="k">for</span> <span class="n">state</span> <span class="ow">in</span> <span class="n">states</span><span class="p">}</span>
<span class="k">def</span> <span class="nf">__str__</span><span class="p">(</span><span class="bp">self</span><span class="p">):</span>
<span class="n">string</span> <span class="o">=</span> <span class="s">''</span>
<span class="n">count</span> <span class="o">=</span> <span class="mi">0</span>
<span class="k">print</span><span class="p">(</span><span class="s">'-------------------------------------------------------------------------------'</span><span class="p">)</span>
<span class="k">for</span> <span class="n">i</span><span class="p">,</span> <span class="n">k</span> <span class="ow">in</span> <span class="bp">self</span><span class="p">.</span><span class="n">states</span><span class="p">.</span><span class="n">items</span><span class="p">():</span>
<span class="n">count</span> <span class="o">=</span> <span class="n">count</span> <span class="o">+</span> <span class="mi">1</span>
<span class="n">string</span> <span class="o">=</span> <span class="n">string</span> <span class="o">+</span> <span class="s">'{} : {:.5f} | '</span><span class="p">.</span><span class="nb">format</span><span class="p">(</span><span class="n">i</span><span class="p">,</span> <span class="n">k</span><span class="p">)</span>
<span class="k">if</span> <span class="n">count</span> <span class="o">></span> <span class="mi">5</span><span class="p">:</span>
<span class="n">string</span> <span class="o">=</span> <span class="n">string</span> <span class="o">+</span> <span class="s">'</span><span class="se">\n</span><span class="s">'</span>
<span class="n">count</span> <span class="o">=</span> <span class="mf">0.</span>
<span class="k">return</span> <span class="n">string</span>
<span class="k">def</span> <span class="nf">draw</span><span class="p">(</span><span class="bp">self</span><span class="p">):</span>
<span class="n">objects</span> <span class="o">=</span> <span class="bp">self</span><span class="p">.</span><span class="n">states</span><span class="p">.</span><span class="n">keys</span><span class="p">()</span>
<span class="n">y_pos</span> <span class="o">=</span> <span class="n">np</span><span class="p">.</span><span class="n">arange</span><span class="p">(</span><span class="nb">len</span><span class="p">(</span><span class="n">objects</span><span class="p">))</span>
<span class="n">performance</span> <span class="o">=</span> <span class="p">[</span><span class="n">val</span> <span class="k">for</span> <span class="n">k</span><span class="p">,</span> <span class="n">val</span> <span class="ow">in</span> <span class="bp">self</span><span class="p">.</span><span class="n">states</span><span class="p">.</span><span class="n">items</span><span class="p">()]</span>
<span class="n">plt</span><span class="p">.</span><span class="n">bar</span><span class="p">(</span><span class="n">y_pos</span><span class="p">,</span> <span class="n">performance</span><span class="p">,</span> <span class="n">align</span><span class="o">=</span><span class="s">'center'</span><span class="p">,</span> <span class="n">alpha</span><span class="o">=</span><span class="mf">0.5</span><span class="p">)</span>
<span class="n">plt</span><span class="p">.</span><span class="n">xticks</span><span class="p">(</span><span class="n">y_pos</span><span class="p">,</span> <span class="n">objects</span><span class="p">)</span>
<span class="n">plt</span><span class="p">.</span><span class="n">ylabel</span><span class="p">(</span><span class="s">'density'</span><span class="p">)</span>
<span class="n">plt</span><span class="p">.</span><span class="n">title</span><span class="p">(</span><span class="s">'Word occurence orbit frequency'</span><span class="p">)</span>
<span class="n">plt</span><span class="p">.</span><span class="n">show</span><span class="p">()</span>
<span class="o">@</span><span class="nb">classmethod</span>
<span class="k">def</span> <span class="nf">from_orbit</span><span class="p">(</span><span class="n">cls</span><span class="p">,</span> <span class="n">orbit</span><span class="p">):</span>
<span class="n">instance</span> <span class="o">=</span> <span class="n">cls</span><span class="p">(</span><span class="n">orbit</span><span class="p">)</span>
<span class="k">for</span> <span class="n">point</span> <span class="ow">in</span> <span class="n">orbit</span><span class="p">:</span>
<span class="n">instance</span><span class="p">.</span><span class="n">states</span><span class="p">[</span><span class="n">point</span><span class="p">]</span> <span class="o">=</span> <span class="n">instance</span><span class="p">.</span><span class="n">states</span><span class="p">.</span><span class="n">get</span><span class="p">(</span><span class="n">point</span><span class="p">,</span> <span class="mi">0</span><span class="p">)</span> <span class="o">+</span> <span class="mi">1</span>
<span class="k">for</span> <span class="n">key</span><span class="p">,</span> <span class="n">value</span> <span class="ow">in</span> <span class="n">instance</span><span class="p">.</span><span class="n">states</span><span class="p">.</span><span class="n">items</span><span class="p">():</span>
<span class="n">instance</span><span class="p">.</span><span class="n">states</span><span class="p">[</span><span class="n">key</span><span class="p">]</span> <span class="o">=</span> <span class="n">instance</span><span class="p">.</span><span class="n">states</span><span class="p">[</span><span class="n">key</span><span class="p">]</span><span class="o">/</span><span class="nb">len</span><span class="p">(</span><span class="n">orbit</span><span class="p">)</span>
<span class="k">return</span> <span class="n">instance</span>
<span class="o">@</span><span class="nb">classmethod</span>
<span class="k">def</span> <span class="nf">from_uniform</span><span class="p">(</span><span class="n">cls</span><span class="p">,</span> <span class="n">states</span><span class="p">):</span>
<span class="n">instance</span> <span class="o">=</span> <span class="n">cls</span><span class="p">(</span><span class="n">states</span><span class="p">)</span>
<span class="k">for</span> <span class="n">key</span><span class="p">,</span> <span class="n">value</span> <span class="ow">in</span> <span class="n">instance</span><span class="p">.</span><span class="n">states</span><span class="p">.</span><span class="n">items</span><span class="p">():</span>
<span class="n">instance</span><span class="p">.</span><span class="n">states</span><span class="p">[</span><span class="n">key</span><span class="p">]</span> <span class="o">=</span> <span class="mi">1</span><span class="o">/</span><span class="nb">len</span><span class="p">(</span><span class="n">instance</span><span class="p">.</span><span class="n">states</span><span class="p">)</span>
<span class="k">return</span> <span class="n">instance</span>
<span class="k">def</span> <span class="nf">__sub__</span><span class="p">(</span><span class="bp">self</span><span class="p">,</span> <span class="n">other</span><span class="p">):</span>
<span class="n">difference</span> <span class="o">=</span> <span class="n">State</span><span class="p">(</span><span class="bp">self</span><span class="p">.</span><span class="n">states</span><span class="p">.</span><span class="n">keys</span><span class="p">())</span>
<span class="k">for</span> <span class="n">key</span><span class="p">,</span> <span class="n">_</span> <span class="ow">in</span> <span class="n">difference</span><span class="p">.</span><span class="n">states</span><span class="p">.</span><span class="n">items</span><span class="p">():</span>
<span class="n">difference</span><span class="p">.</span><span class="n">states</span><span class="p">[</span><span class="n">key</span><span class="p">]</span> <span class="o">=</span> <span class="nb">abs</span><span class="p">(</span><span class="bp">self</span><span class="p">.</span><span class="n">states</span><span class="p">.</span><span class="n">get</span><span class="p">(</span><span class="n">key</span><span class="p">,</span> <span class="mi">0</span><span class="p">)</span> <span class="o">-</span> <span class="n">other</span><span class="p">.</span><span class="n">states</span><span class="p">.</span><span class="n">get</span><span class="p">(</span><span class="n">key</span><span class="p">,</span> <span class="mi">0</span><span class="p">))</span>
<span class="k">return</span> <span class="n">difference</span>
<span class="k">def</span> <span class="nf">dist</span><span class="p">(</span><span class="bp">self</span><span class="p">,</span> <span class="n">other</span><span class="p">):</span>
<span class="n">diff</span> <span class="o">=</span> <span class="n">other</span> <span class="o">-</span> <span class="bp">self</span>
<span class="k">return</span> <span class="n">math</span><span class="p">.</span><span class="n">sqrt</span><span class="p">(</span><span class="nb">sum</span><span class="p">([</span><span class="n">v</span><span class="o">**</span><span class="mi">2</span> <span class="k">for</span> <span class="n">_</span><span class="p">,</span> <span class="n">v</span> <span class="ow">in</span> <span class="n">diff</span><span class="p">.</span><span class="n">states</span><span class="p">.</span><span class="n">items</span><span class="p">()]))</span>
</code></pre></div></div>
<p>The third is transition Matrix class encodes the probabilities of transitions in states. In this case we’re looking at transitions between letters. This class acts on each of the Walker and State classes via the <code class="language-plaintext highlighter-rouge">__matmul__</code> method (more bellow). In the case of the Walker class the matrix will randomly update the node/state the walker is at to one of the nodes it is connected to. Where the likely hood of it choosing a particular connected node is proportional to the transition probability \(p_{i,j}\). This class can also acts on the State class. If we’re thinking of the State class as a representation of lots of individual Walker classes where a 0.5 value associated to a node/state means that 0.5 percent of the population of walkers are located at that node. The action of the transition matrix class gives a new distribution of walkers after each of them have taken a random step. The density picture of a population of walkers is really just a heuristic. What this vector represents best is the probabilities of finding a walker in a specific place. The action of a transition matrix \(M\) on this density vector at time \(t\), \(v(t)\) is just matrix multiplication. So:</p>
\[v_i(t+1) = M_{0,i}v_0(t) + M_{1,i}v_1(t) + ... + M_{n,i}v_n(t)\]
<p>this just tells us that the new density of walkers at the state \(i\), is given by the density at each of the states times there transition probabilities from that state to state \(i\) i.e. the value \(M_{j,i}v_j(t)\).</p>
<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code>
<span class="c1"># Going to map between states with this class
</span><span class="k">class</span> <span class="nc">TransitionMatrix</span><span class="p">:</span>
<span class="k">def</span> <span class="nf">__init__</span><span class="p">(</span><span class="bp">self</span><span class="p">,</span> <span class="n">states</span><span class="p">,</span> <span class="n">data</span><span class="o">=</span><span class="bp">None</span><span class="p">):</span>
<span class="bp">self</span><span class="p">.</span><span class="n">p</span> <span class="o">=</span> <span class="p">{</span><span class="n">state</span><span class="p">:</span> <span class="p">{}</span> <span class="k">for</span> <span class="n">state</span> <span class="ow">in</span> <span class="n">states</span><span class="p">}</span>
<span class="k">if</span> <span class="n">data</span> <span class="ow">is</span> <span class="bp">None</span><span class="p">:</span>
<span class="n">data</span> <span class="o">=</span> <span class="n">states</span>
<span class="bp">self</span><span class="p">.</span><span class="n">_count</span><span class="p">(</span><span class="n">data</span><span class="p">)</span>
<span class="bp">self</span><span class="p">.</span><span class="n">_normalize</span><span class="p">()</span>
<span class="k">def</span> <span class="nf">_count</span><span class="p">(</span><span class="bp">self</span><span class="p">,</span> <span class="n">states</span><span class="p">):</span>
<span class="k">for</span> <span class="n">letter_1</span><span class="p">,</span> <span class="n">letter_2</span> <span class="ow">in</span> <span class="n">pairwise</span><span class="p">(</span><span class="n">states</span><span class="p">):</span>
<span class="bp">self</span><span class="p">.</span><span class="n">p</span><span class="p">[</span><span class="n">letter_1</span><span class="p">][</span><span class="n">letter_2</span><span class="p">]</span> <span class="o">=</span> <span class="bp">self</span><span class="p">.</span><span class="n">p</span><span class="p">[</span><span class="n">letter_1</span><span class="p">].</span><span class="n">get</span><span class="p">(</span><span class="n">letter_2</span><span class="p">,</span> <span class="mi">0</span><span class="p">)</span> <span class="o">+</span> <span class="mi">1</span>
<span class="k">def</span> <span class="nf">_normalize</span><span class="p">(</span><span class="bp">self</span><span class="p">):</span>
<span class="k">for</span> <span class="n">key</span><span class="p">,</span> <span class="n">val</span> <span class="ow">in</span> <span class="bp">self</span><span class="p">.</span><span class="n">p</span><span class="p">.</span><span class="n">items</span><span class="p">():</span>
<span class="n">row_total</span> <span class="o">=</span> <span class="nb">sum</span><span class="p">([</span><span class="n">count</span> <span class="k">for</span> <span class="n">_</span><span class="p">,</span> <span class="n">count</span> <span class="ow">in</span> <span class="n">val</span><span class="p">.</span><span class="n">items</span><span class="p">()])</span>
<span class="k">for</span> <span class="n">target</span><span class="p">,</span> <span class="n">count</span> <span class="ow">in</span> <span class="n">val</span><span class="p">.</span><span class="n">items</span><span class="p">():</span>
<span class="bp">self</span><span class="p">.</span><span class="n">p</span><span class="p">[</span><span class="n">key</span><span class="p">][</span><span class="n">target</span><span class="p">]</span> <span class="o">=</span> <span class="bp">self</span><span class="p">.</span><span class="n">p</span><span class="p">[</span><span class="n">key</span><span class="p">][</span><span class="n">target</span><span class="p">]</span><span class="o">/</span><span class="n">row_total</span>
<span class="k">def</span> <span class="nf">__matmul__</span><span class="p">(</span><span class="bp">self</span><span class="p">,</span> <span class="n">other</span><span class="p">):</span>
<span class="s">"""If applying the transistion matrix class to a Walker class then select the next state at random.
If applying to a State class we generate a new distrbution. """</span>
<span class="k">if</span> <span class="nb">isinstance</span><span class="p">(</span><span class="n">other</span><span class="p">,</span> <span class="n">State</span><span class="p">):</span>
<span class="n">new_state</span> <span class="o">=</span> <span class="n">State</span><span class="p">([</span><span class="n">s</span> <span class="k">for</span> <span class="n">s</span><span class="p">,</span> <span class="n">_</span> <span class="ow">in</span> <span class="n">other</span><span class="p">.</span><span class="n">states</span><span class="p">.</span><span class="n">items</span><span class="p">()])</span>
<span class="k">for</span> <span class="n">s_1</span><span class="p">,</span> <span class="n">p</span> <span class="ow">in</span> <span class="n">other</span><span class="p">.</span><span class="n">states</span><span class="p">.</span><span class="n">items</span><span class="p">():</span>
<span class="n">sum_p_s2_s1</span> <span class="o">=</span> <span class="mi">0</span>
<span class="k">for</span> <span class="n">s_2</span><span class="p">,</span> <span class="n">P</span> <span class="ow">in</span> <span class="bp">self</span><span class="p">.</span><span class="n">p</span><span class="p">.</span><span class="n">items</span><span class="p">():</span>
<span class="n">sum_p_s2_s1</span> <span class="o">=</span> <span class="n">sum_p_s2_s1</span> <span class="o">+</span> <span class="n">other</span><span class="p">.</span><span class="n">states</span><span class="p">[</span><span class="n">s_2</span><span class="p">]</span><span class="o">*</span><span class="n">P</span><span class="p">.</span><span class="n">get</span><span class="p">(</span><span class="n">s_1</span><span class="p">,</span> <span class="mi">0</span><span class="p">)</span>
<span class="n">new_state</span><span class="p">.</span><span class="n">states</span><span class="p">[</span><span class="n">s_1</span><span class="p">]</span> <span class="o">=</span> <span class="n">sum_p_s2_s1</span>
<span class="k">return</span> <span class="n">new_state</span>
<span class="k">if</span> <span class="nb">isinstance</span><span class="p">(</span><span class="n">other</span><span class="p">,</span> <span class="n">Walker</span><span class="p">):</span>
<span class="n">ps</span> <span class="o">=</span> <span class="bp">self</span><span class="p">.</span><span class="n">p</span><span class="p">[</span><span class="n">other</span><span class="p">.</span><span class="n">state</span><span class="p">]</span>
<span class="n">choices</span> <span class="o">=</span> <span class="p">[</span><span class="o">*</span><span class="n">ps</span><span class="p">.</span><span class="n">keys</span><span class="p">()]</span>
<span class="n">weights</span> <span class="o">=</span> <span class="p">[</span><span class="o">*</span><span class="n">ps</span><span class="p">.</span><span class="n">values</span><span class="p">()]</span>
<span class="n">choice</span> <span class="o">=</span> <span class="n">random</span><span class="p">.</span><span class="n">choices</span><span class="p">(</span><span class="n">choices</span><span class="p">,</span> <span class="n">weights</span><span class="o">=</span><span class="n">weights</span><span class="p">).</span><span class="n">pop</span><span class="p">()</span>
<span class="k">return</span> <span class="n">Walker</span><span class="p">(</span><span class="n">choice</span><span class="p">)</span>
</code></pre></div></div>
<p>The main feature of the above class is the <code class="language-plaintext highlighter-rouge">__matmul__</code> method. This means if we have an instance <code class="language-plaintext highlighter-rouge">T</code> of a <code class="language-plaintext highlighter-rouge">TransistionMatrix</code> class and an instance <code class="language-plaintext highlighter-rouge">w</code> of a Walker class we can write:</p>
<div class="language-py highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">T</span><span class="o">@</span><span class="n">w</span>
</code></pre></div></div>
<p>Then in the <code class="language-plaintext highlighter-rouge">__matmul__(self, other)</code>, we have <code class="language-plaintext highlighter-rouge">self</code> as the <code class="language-plaintext highlighter-rouge">TransistionMatrix</code> instance and <code class="language-plaintext highlighter-rouge">other</code> as the Walker. I’ve set this up to evolve both States and Walkers so the method first checks which the other object is. In our case it’s going to be a Walker. <code class="language-plaintext highlighter-rouge">self.p</code> is a dictionary that encodes the transition probabilities. <code class="language-plaintext highlighter-rouge">self.p[other.state]</code> is the set of connected nodes and probabilities from <code class="language-plaintext highlighter-rouge">other.state</code> or the node on which the walker is sitting.</p>
<p>so we get the probability vector: <code class="language-plaintext highlighter-rouge">self.p[other.state]</code>, we unpack the states and the probabilities and then use pythons built-in random package to choose one at random in proportion to the weight vector. We then return a new Walker in that state.</p>
<p><strong>Note</strong> <em>you don’t need to use <code class="language-plaintext highlighter-rouge">@</code> there’s nothing special about this operator, it would probability be more readable to create a method called multiply instead.</em></p>
<h2 id="creating-a-transistion-matrix">Creating A Transistion Matrix</h2>
<hr />
<p>So I’m going to get the transition matrix associated to the letter pairs in the above introduction. The class essentially counts through each adjacent pair of letters and records the frequencies they occur. It then normalizes these frequencies so that given a state the sum of probabilities for transitions to each other state from that state is 1.</p>
<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="kn">import</span> <span class="nn">re</span>
<span class="n">intro</span> <span class="o">=</span> <span class="s">"""So graphs are made up of nodes and edges. We're going to build on top of a graph a random dynamical system that's going to move around on it known as a random walk. Doing so and recording each of the steps the random dynamical system takes we will manage to extract samples of the nodes in the graph. Typically if you allow the random walk to move between node by selecting the next node uniformly from the set of connections you don't get a uniform distribution of the network nodes. Instead you get something biased towards nodes with higher connection counts. There is a trick to modifying the nature of the random walk so that it gets a uniform distribution across all the network nodes this is known as the Metropolis Hastings algorithm which basically chooses subsequent nodes with higher probability if they're less connected. In the case of the small graphs we're going to look at this sampling method isn't really needed because you could just choose a random sample from the nodes. This type of technique is useful for sampling networks from which you can just draw a sample. In the case of twitter for instance you cannot select a user at random this means any sample you draw from the twitter network will be biased. In order to fix this you could run a Metropolis Hastings random walk on the network and after a long time you'd get the sample your want. (It would be a very long time. Twitters API rate limits basically make this impossible in reality) Suppose you have a collection of states ${s_1,...,s_n}$ these will be the nodes on the graph. Imagine a little chap we'll call a walker whose sat at one of these nodes/states. We're going to denote the probability of the walker at $s_i$ moving to $s_j$ as given by $p_{i,j}$. The states/nodes can really be anything, for instance they could encode letters and the probabilities $p_{a, b}$ would be the probability for finding the letters 'a' and 'b' next to each other in a sentence."""</span>
<span class="n">cleaned_intro</span> <span class="o">=</span> <span class="n">re</span><span class="p">.</span><span class="n">sub</span><span class="p">(</span><span class="sa">r</span><span class="s">'\W+'</span><span class="p">,</span> <span class="s">''</span><span class="p">,</span> <span class="n">intro</span><span class="p">)</span>
<span class="n">T</span> <span class="o">=</span> <span class="n">TransitionMatrix</span><span class="p">(</span><span class="n">cleaned_intro</span><span class="p">)</span>
</code></pre></div></div>
<p>So now we can see which letters can go to which other letters and how likely they are to do so. We just look at <code class="language-plaintext highlighter-rouge">T.p['u']</code>.</p>
<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">T</span><span class="p">.</span><span class="n">p</span><span class="p">[</span><span class="s">'u'</span><span class="p">]</span>
</code></pre></div></div>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>{'p': 0.05128205128205128,
'i': 0.02564102564102564,
'n': 0.15384615384615385,
'a': 0.05128205128205128,
'd': 0.07692307692307693,
'b': 0.07692307692307693,
't': 0.05128205128205128,
'g': 0.02564102564102564,
'r': 0.05128205128205128,
'e': 0.05128205128205128,
'c': 0.10256410256410256,
'l': 0.15384615384615385,
's': 0.10256410256410256,
'h': 0.02564102564102564}
</code></pre></div></div>
<p>We can verify the sum of these values is 1. The set up here is all done in the <code class="language-plaintext highlighter-rouge">_normalize</code> and <code class="language-plaintext highlighter-rouge">_count</code> methods on the <code class="language-plaintext highlighter-rouge">TransistionMatrix</code> class</p>
<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="nb">sum</span><span class="p">([</span><span class="n">v</span> <span class="k">for</span> <span class="n">_</span><span class="p">,</span><span class="n">v</span> <span class="ow">in</span> <span class="n">T</span><span class="p">.</span><span class="n">p</span><span class="p">[</span><span class="s">'u'</span><span class="p">].</span><span class="n">items</span><span class="p">()])</span>
</code></pre></div></div>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>1.0
</code></pre></div></div>
<h2 id="sampling-orbits">Sampling orbits</h2>
<hr />
<p>So first lets draw a orbit from the state space by iteratively applying a transition matrix to a walker.</p>
<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">m</span> <span class="o">=</span> <span class="n">State</span><span class="p">(</span><span class="n">cleaned_intro</span><span class="p">)</span>
<span class="n">T</span> <span class="o">=</span> <span class="n">TransitionMatrix</span><span class="p">(</span><span class="n">cleaned_intro</span><span class="p">)</span>
<span class="n">w</span> <span class="o">=</span> <span class="n">Walker</span><span class="p">(</span><span class="s">'t'</span><span class="p">)</span>
<span class="n">w</span><span class="p">.</span><span class="n">state</span>
<span class="n">orbit</span> <span class="o">=</span> <span class="p">[</span><span class="n">w</span><span class="p">]</span>
<span class="k">for</span> <span class="n">i</span> <span class="ow">in</span> <span class="nb">range</span><span class="p">(</span><span class="mi">10</span><span class="p">):</span>
<span class="n">orbit</span><span class="p">.</span><span class="n">append</span><span class="p">(</span><span class="n">T</span><span class="o">@</span><span class="n">orbit</span><span class="p">[</span><span class="o">-</span><span class="mi">1</span><span class="p">])</span>
<span class="k">print</span><span class="p">([</span><span class="n">o</span><span class="p">.</span><span class="n">state</span> <span class="k">for</span> <span class="n">o</span> <span class="ow">in</span> <span class="n">orbit</span><span class="p">])</span>
</code></pre></div></div>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>['t', 'l', 'l', 'g', 'r', 'o', 'w', 'n', 't', 'h', 'e']
</code></pre></div></div>
<p>We can compute the probability of a given orbit by multiplying the probabilities of each of the transitions. So given an orbit \(o=[s_1, s_2, ..., s_n]\) it’s likelihood of occurring is:</p>
\[p(o) = \prod_{i=0}^{n}p(s_i\|s_{i-1})\]
<p>where \(p(s_i\| s_{i-1}) = p_{s_{i-1},s_i}\) is the probability of \(s_{i-1}\) going to \(s_i\).</p>
<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">p_o</span> <span class="o">=</span> <span class="mi">1</span>
<span class="k">for</span> <span class="n">s_1</span><span class="p">,</span> <span class="n">s_2</span> <span class="ow">in</span> <span class="n">pairwise</span><span class="p">(</span><span class="n">orbit</span><span class="p">):</span>
<span class="k">print</span><span class="p">(</span><span class="n">s_1</span><span class="p">.</span><span class="n">state</span><span class="p">,</span> <span class="s">' -> '</span> <span class="p">,</span><span class="n">s_2</span><span class="p">.</span><span class="n">state</span><span class="p">,</span> <span class="s">' with prob:'</span><span class="p">,</span><span class="n">T</span><span class="p">.</span><span class="n">p</span><span class="p">[</span><span class="n">s_1</span><span class="p">.</span><span class="n">state</span><span class="p">][</span><span class="n">s_2</span><span class="p">.</span><span class="n">state</span><span class="p">])</span>
<span class="n">p_o</span> <span class="o">=</span> <span class="n">p_o</span><span class="o">*</span><span class="n">T</span><span class="p">.</span><span class="n">p</span><span class="p">[</span><span class="n">s_1</span><span class="p">.</span><span class="n">state</span><span class="p">][</span><span class="n">s_2</span><span class="p">.</span><span class="n">state</span><span class="p">]</span>
<span class="k">print</span><span class="p">(</span><span class="s">'--------------------------------'</span><span class="p">)</span>
<span class="k">print</span><span class="p">(</span><span class="s">'prob of orbit: '</span><span class="p">,</span> <span class="n">p_o</span><span class="p">)</span>
</code></pre></div></div>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>t -> l with prob: 0.00684931506849315
l -> l with prob: 0.20833333333333334
l -> g with prob: 0.027777777777777776
g -> r with prob: 0.13157894736842105
r -> o with prob: 0.16
o -> w with prob: 0.028985507246376812
w -> n with prob: 0.05714285714285714
n -> t with prob: 0.11504424778761062
t -> h with prob: 0.3013698630136986
h -> e with prob: 0.5
--------------------------------
prob of orbit: 2.3960031603136377e-11
</code></pre></div></div>
<p>The outcome is very small which is to be expected given that the number of possible orbits grows large fast.</p>
<h2 id="distributions">Distributions</h2>
<hr />
<p>Lets quickly rehash the State Class. So the State class represents distributions on the set of letters that make up the nodes of the network.</p>
<p>Suppose we want to draw a representative sample of this network of letters. In this case we’re drawing an analogy between twitter and the letters in our network. So a representative sample of the network just means a representative sample of the nodes, in our case letters, (Twitters: users).</p>
<p>This can be represented by a density that’s uniform across all the letters. This is representative in that if we took a sample from it the probability of us drawing a particular node is the same as drawing any other. These types of samples are important if your trying to talk about the average member of a population.</p>
<p>I’ve defined a class method that builds this probability vector for you (The assumption of Metropolis Hastings is that this is unreachable, i.e. we cannot sample from this distribution):</p>
<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">m</span> <span class="o">=</span> <span class="n">State</span><span class="p">(</span><span class="n">cleaned_intro</span><span class="p">)</span>
<span class="n">uniform</span> <span class="o">=</span> <span class="n">State</span><span class="p">.</span><span class="n">from_uniform</span><span class="p">(</span><span class="n">m</span><span class="p">.</span><span class="n">states</span><span class="p">.</span><span class="n">keys</span><span class="p">())</span>
<span class="k">print</span><span class="p">(</span><span class="n">uniform</span><span class="p">)</span>
<span class="n">uniform</span><span class="p">.</span><span class="n">draw</span><span class="p">()</span>
</code></pre></div></div>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>-------------------------------------------------------------------------------
S : 0.03125 | o : 0.03125 | g : 0.03125 | r : 0.03125 | a : 0.03125 | p : 0.03125 |
h : 0.03125 | s : 0.03125 | e : 0.03125 | m : 0.03125 | d : 0.03125 | u : 0.03125 |
f : 0.03125 | n : 0.03125 | W : 0.03125 | i : 0.03125 | t : 0.03125 | b : 0.03125 |
l : 0.03125 | y : 0.03125 | c : 0.03125 | v : 0.03125 | k : 0.03125 | w : 0.03125 |
D : 0.03125 | x : 0.03125 | T : 0.03125 | I : 0.03125 | q : 0.03125 | j : 0.03125 |
_ : 0.03125 | 1 : 0.03125 |
</code></pre></div></div>
<p><img src="/assets/random-walks-on-graphs/random-walks-on-graphs_22_1.png" alt="png" /></p>
<p>Lets now see what the distribution of letters looks like when we take a sample from the network by running a Walker on the network for a long time and storing each of the nodes that it arrives at along it’s path.</p>
<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">m</span> <span class="o">=</span> <span class="n">State</span><span class="p">(</span><span class="n">cleaned_intro</span><span class="p">)</span>
<span class="n">T</span> <span class="o">=</span> <span class="n">TransitionMatrix</span><span class="p">(</span><span class="n">cleaned_intro</span><span class="p">)</span>
<span class="n">w</span> <span class="o">=</span> <span class="n">Walker</span><span class="p">(</span><span class="s">'t'</span><span class="p">)</span>
<span class="n">w</span><span class="p">.</span><span class="n">state</span>
<span class="n">orbit</span> <span class="o">=</span> <span class="p">[</span><span class="n">w</span><span class="p">]</span>
<span class="k">for</span> <span class="n">i</span> <span class="ow">in</span> <span class="nb">range</span><span class="p">(</span><span class="mi">5000</span><span class="p">):</span>
<span class="n">orbit</span><span class="p">.</span><span class="n">append</span><span class="p">(</span><span class="n">T</span><span class="o">@</span><span class="n">orbit</span><span class="p">[</span><span class="o">-</span><span class="mi">1</span><span class="p">])</span>
<span class="n">orbit_dist</span> <span class="o">=</span> <span class="n">State</span><span class="p">.</span><span class="n">from_orbit</span><span class="p">([</span><span class="n">o</span><span class="p">.</span><span class="n">state</span> <span class="k">for</span> <span class="n">o</span> <span class="ow">in</span> <span class="n">orbit</span><span class="p">])</span>
<span class="k">print</span><span class="p">(</span><span class="n">orbit_dist</span><span class="p">)</span>
<span class="n">orbit_dist</span><span class="p">.</span><span class="n">draw</span><span class="p">()</span>
</code></pre></div></div>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>-------------------------------------------------------------------------------
t : 0.09498 | h : 0.04939 | e : 0.11258 | a : 0.07998 | s : 0.06319 | x : 0.00280 |
l : 0.04079 | d : 0.03679 | b : 0.01900 | y : 0.02300 | w : 0.02120 | o : 0.09098 |
g : 0.02400 | n : 0.07558 | i : 0.05759 | r : 0.04859 | k : 0.01180 | v : 0.00600 |
m : 0.02979 | p : 0.02140 | u : 0.02380 | I : 0.00460 | c : 0.03039 | _ : 0.00320 |
j : 0.00240 | T : 0.00380 | f : 0.01940 | D : 0.00040 | W : 0.00060 | S : 0.00060 |
q : 0.00140 |
</code></pre></div></div>
<p><img src="/assets/random-walks-on-graphs/random-walks-on-graphs_24_1.png" alt="png" /></p>
<p>So this clearly hasn’t converged to a representative sample of the network as it’s heavily biased towards certain nodes. To talk about convergence more clearly we need to define a metric by which to gauge how similar two distributions are.</p>
<h2 id="similarity-of-distributions">Similarity of Distributions:</h2>
<hr />
<p>To get this we just subtract the two vectors from each other, sum the square of the differences and take the square root of the sum.</p>
\[dist(p_1, p_2) = \sqrt{\sum_{i=0}^{N}(p_1[i] - p_2[i])^2}\]
<p>I’ve added a <code class="language-plaintext highlighter-rouge">dist</code> method to the State class that computes this quantity. So for two equal distributions we get:</p>
<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">uniform</span><span class="p">.</span><span class="n">dist</span><span class="p">(</span><span class="n">uniform</span><span class="p">)</span>
</code></pre></div></div>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>0.0
</code></pre></div></div>
<p>and for two unequal distributions we have a value greater than zero:</p>
<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">orbit_dist</span><span class="p">.</span><span class="n">dist</span><span class="p">(</span><span class="n">uniform</span><span class="p">)</span>
</code></pre></div></div>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>0.17698646223343223
</code></pre></div></div>
<p>this gives us a method for examining convergence of orbits and allows us to numerically answer:</p>
<ul>
<li>Is it the case that two different orbits are converging to the same distribution?</li>
<li>And how does the orbit distribution converge or not converge w.r.t. the uniform distribution?</li>
</ul>
<p>We can test this. The following graphs show how the orbit distributions of two random walks converge to each other but neither converge to the uniform distribution:</p>
<p><strong>Note</strong>: <em>This is not a fast way of doing this, but it’ll do</em></p>
<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">T</span> <span class="o">=</span> <span class="n">TransitionMatrix</span><span class="p">(</span><span class="n">cleaned_intro</span><span class="p">)</span>
<span class="n">orbit_1</span> <span class="o">=</span> <span class="p">[</span><span class="n">Walker</span><span class="p">(</span><span class="s">'t'</span><span class="p">)]</span>
<span class="n">orbit_2</span> <span class="o">=</span> <span class="p">[</span><span class="n">Walker</span><span class="p">(</span><span class="s">'o'</span><span class="p">)]</span>
<span class="n">differences_1</span> <span class="o">=</span> <span class="p">[]</span>
<span class="n">differences_2</span> <span class="o">=</span> <span class="p">[]</span>
<span class="n">differences_3</span> <span class="o">=</span> <span class="p">[]</span>
<span class="k">for</span> <span class="n">i</span> <span class="ow">in</span> <span class="nb">range</span><span class="p">(</span><span class="mi">5000</span><span class="p">):</span>
<span class="n">orbit_1</span><span class="p">.</span><span class="n">append</span><span class="p">(</span><span class="n">T</span><span class="o">@</span><span class="n">orbit_1</span><span class="p">[</span><span class="o">-</span><span class="mi">1</span><span class="p">])</span>
<span class="n">orbit_2</span><span class="p">.</span><span class="n">append</span><span class="p">(</span><span class="n">T</span><span class="o">@</span><span class="n">orbit_2</span><span class="p">[</span><span class="o">-</span><span class="mi">1</span><span class="p">])</span>
<span class="n">orbit_dist_1</span> <span class="o">=</span> <span class="n">State</span><span class="p">.</span><span class="n">from_orbit</span><span class="p">([</span><span class="n">o</span><span class="p">.</span><span class="n">state</span> <span class="k">for</span> <span class="n">o</span> <span class="ow">in</span> <span class="n">orbit_1</span><span class="p">])</span>
<span class="n">orbit_dist_2</span> <span class="o">=</span> <span class="n">State</span><span class="p">.</span><span class="n">from_orbit</span><span class="p">([</span><span class="n">o</span><span class="p">.</span><span class="n">state</span> <span class="k">for</span> <span class="n">o</span> <span class="ow">in</span> <span class="n">orbit_2</span><span class="p">])</span>
<span class="n">differences_1</span><span class="p">.</span><span class="n">append</span><span class="p">(</span><span class="n">orbit_dist_1</span><span class="p">.</span><span class="n">dist</span><span class="p">(</span><span class="n">orbit_dist_2</span><span class="p">))</span>
<span class="n">differences_2</span><span class="p">.</span><span class="n">append</span><span class="p">(</span><span class="n">orbit_dist_1</span><span class="p">.</span><span class="n">dist</span><span class="p">(</span><span class="n">uniform</span><span class="p">))</span>
<span class="n">differences_3</span><span class="p">.</span><span class="n">append</span><span class="p">(</span><span class="n">orbit_dist_2</span><span class="p">.</span><span class="n">dist</span><span class="p">(</span><span class="n">uniform</span><span class="p">))</span>
<span class="n">plt</span><span class="p">.</span><span class="n">plot</span><span class="p">(</span><span class="n">differences_1</span><span class="p">)</span>
<span class="n">plt</span><span class="p">.</span><span class="n">plot</span><span class="p">(</span><span class="n">differences_2</span><span class="p">)</span>
<span class="n">plt</span><span class="p">.</span><span class="n">plot</span><span class="p">(</span><span class="n">differences_3</span><span class="p">)</span>
<span class="n">plt</span><span class="p">.</span><span class="n">title</span><span class="p">(</span><span class="s">'Distance between orbit distrubutions'</span><span class="p">)</span>
<span class="n">plt</span><span class="p">.</span><span class="n">ylabel</span><span class="p">(</span><span class="s">'difference'</span><span class="p">)</span>
<span class="n">plt</span><span class="p">.</span><span class="n">xlabel</span><span class="p">(</span><span class="s">'time'</span><span class="p">)</span>
<span class="n">plt</span><span class="p">.</span><span class="n">show</span><span class="p">()</span>
</code></pre></div></div>
<p><img src="/assets/random-walks-on-graphs/random-walks-on-graphs_31_0.png" alt="png" /></p>
<p>Each line in the graph above denotes the distances between the distributions of orbit 1, orbit 2 and the uniform distribution for each step in the each orbit. Not the uniform distribution is fixed as we don’t evolve it via the transition matrix.</p>
<p>So above we’re running two orbits from different initial points. ‘t’ and ‘o’. These orbits converge towards each other as demonstrated by the line that slopes down towards 0. Both of these orbits however don’t converge to the uniform distribution which is represented by the two lines that plateau around 0.2. In the next post I’m going to show how Metropolis Hastings Algorithm can be used to adjust a random walk so that it converges to the uniform distribution.</p>Computing orbits of random walks on networksFirst Impressions of Reinforcement Learning2020-05-16T17:37:37+00:002020-05-16T17:37:37+00:00https://mauicv.com/reinforcement-learning/2020/05/16/reinforcement-learning-first-impressions<p><sup><strong>note</strong>: <em>Relevant code for this post is <a href="https://github.com/mauicv/openai-gym-solns">here</a></em></sup></p>
<hr />
<h2 id="first-impressions">First Impressions</h2>
<p>So I’ve recently started looking at Reinforcement learning in my spare time because it just seems like a pretty powerful tool. I feel like I sort of threw myself in the deep end in that I have no experience with any of the usual machine learning frameworks and while roughly I understand the underlying theory behind it all I was pretty rusty and Reinforcement learning turned out to be a slight departure from what I already knew. This post is going to be a summary of my experience.</p>
<hr />
<h2 id="general-problem">General Problem</h2>
<p>All Reinforcement Learning takes place in some environment. An environment is really just a collection of states. Floundering around in this environment is an actor or agent who is at any point in time in a single state of the environment. The actor interacts with the environment by making actions. These actions transition the actor from the state it’s in to a new state. We want to train the actor to move to a particular state or collection of states that fulfils some goal of ours.</p>
<p>As an example in the <a href="https://gym.openai.com/envs/LunarLander-v2/">openai lunar lander gym problem</a> the environment is the moon surface, the landing site and the position of the shuttle. The actor is the spaceship and the actions it can take are the firing of each of it’s engines or the choice not to do anything. We want to train this actor to successfully move between states in the environment so as to land itself at the landing location.</p>
<p>There’s a couple of approaches to solving this kind of problem. The one I was primarily interested in and spent the most time looking at where policy gradient methods. In policy gradient methods you essentially have a function that tells you how your agent is going to behave given any state in the system. Initially this function is random and useless but by using it and keeping those actions that it suggests resulting in good outcomes and throwing those that result in bad outcomes you slowly improve it until the set goal is achieved.</p>
<p>The normal approach here is to make the policy function a parametrised neural network that takes the states of the environment and gives as output a probability distribution describing the actions likelihood of having a good result. If you sample an action from this policy and it goes well then you compute the derivative of the policy density function at the sampled action and then use gradient ascent to make that action more likely for that state. And vis versa if it goes badly you do the opposite.</p>
<p>Usually you don’t actually know if a given action was good or bad until later. This is really the crux of the whole thing, because maybe something the agent did at a time \(t_{1}\) made a significant difference to the outcome that results at time \(t_{2}\). If the time interval \(t_{2} - t_{1}\) is big then it’s not clear that there should be any relationship between the action at \(t_{1}\) and the outcome at \(t_{2}\). There are ways around this in that you can try and shape the rewards allocated through out the training. If you want an agent to learn to walk then maybe its good to reward standing as an action first. This becomes messy though because it’s hard to define what behaviours to reward in between the random behaviour and the end goal. Not just this its also pretty labour intensive.</p>
<hr />
<h2 id="main-obstacles">Main obstacles</h2>
<p>The major barriers to my personal progress in this domain have definitely been</p>
<ol>
<li>Bits and pieces of ungrepable domain specific knowledge</li>
<li>Not knowing what to expect from Reinforcement algorithm performance</li>
<li>Not knowing how to approach debugging Reinforcement algorithms</li>
</ol>
<h4 id="domain-specific-knowledge">Domain specific knowledge</h4>
<p>Both rl and software development suffer from weird bits and pieces of domain specific knowledge. This kind of thing just comes with the territory. The kind of thing I’m talking about here is stuff that’s hard to search for because you don’t know specifically what’s going wrong. An example is not knowing that you would typically normalize inputs before feeding them into a neural network. If not knowing and thus not doing this causes your training to fail it’s not something your going to know to change, because, well you don’t know to do so, at least not until you somehow stumble across it while searching around on the internet…</p>
<h4 id="expecations-and-debugging">Expecations and Debugging</h4>
<p>In software development we get error messages when stuffs broken. In contrast in rl you get nothing and instead are left to form vague hypothesis about why it’s not doing what you want it to do. This it the aspect of the whole process which is perhaps the most frustrating. Instead of having a clear obstacle to navigate you have a collection of possible issues. Even worse sometimes there isn’t even an issue and what you think is broken is actually just slow, or your logging the wrong thing. My feeling is there you its the sort of thing that requires intuition born of experience where you eventually learn to pick up the subtle indications of each different type of issue, and know the types of things that might solve that issue. It’s almost as if through an iterative process we’re learning to keep what works and discard what doesn’t…</p>
<p>I don’t have a great deal of experience of machine learning in general but I have a feeling that rl departs from ml in the sense that it’s hard to ascertain when something is learning. Usually in ml you have a loss function which is the object your looking to minimize. While training you can graph this function and it’s typically close to monotonic. So while it may not be improving fast and it may not be converging to a global optimum you do know whether or not it is improving. In reinforcement learning you get something completely different. Sometimes you get this:</p>
<p><img src="/assets/intro-to-rl/clear-progress.png" alt="progress" height="50%" width="80%" /></p>
<p>And then sometimes you get this:</p>
<p><img src="/assets/intro-to-rl/unclear-progress.png" alt="progress?" height="50%" width="80%" /></p>
<p>The inherent variance in policy gradient methods seems to be a function of two things (or a couple of very vague hypothesis):</p>
<ul>
<li>In machine learning terms, the states are inputs and the rewards are the training data labels. The problem we have within the paradigm of rl is that it’s typically unclear how the rewards need to be allocated. so the labels are moving targets and we have to trust that in sampling the system enough they’ll converge to there true values.</li>
<li>An update in parameter space, if too big, can overshoot it’s mark and push the network into a in-optimal state. The actor will move along a safe orbit until the overshot update will perturb it off this orbit. If it does this at a particularly inopportune moment then I’d guess you can get some kind of divergence of trajectories in which orbits that once went to safe areas of the state space are now redirected into low performing areas. I think because there are delays between action and outcome and because we discount earlier actions significance w.r.t. the end result it can take a long time for the network to realize that that particular action at that particular time was a mistake. Instead the actor spends ages trying to navigate the poor environment it’s been redirected into.</li>
</ul>
<p>It was this that probably led me to bang my head against a wall the most! Often times you watch the performance of the actor improve and improve until it’s doing really well until suddenly it plummets to perform worse than when it started. In retrospect this isn’t always a particularly bad thing in that in being redirected there the actor will learn how to navigate said poor environment. Building in redundancy like this should result in stable solutions that apply under perturbation.</p>
<p>These issues led me to take a kind of messy stop and start approach to the training of the some of the rl environments. I’d save the model during training when it was doing well and then revert to the save point if it then flat lined. This may have been more to do with my psychology than any sensible strategy but over time the model did improve even if that approach was disappointingly messy.</p>
<hr />
<h2 id="main-takeaways">Main takeaways</h2>
<p>Reinforcement Learning is very interesting, both in theory and application, and it seems to be pretty powerful in the sense that I can see there being problems that it can solve that would be very hard if not impossible to code solutions for. I’ve been way more negative in the above than positive but I’m pretty excited by this stuff! It’s pretty remarkable that these methods work and you get solutions that seem somehow natural.</p>
<p>However it’s also been a little annoying too. This is partly because the initial learning curve seemed to be a lot steeper than I expected and also because my expectations in general where way higher than they should have been. I didn’t try but I suspect in the time it took me to obtain a solution to the lunar-lander environment through training I could have easily coded a programmatic solution and with way less of me hitting my head of a wall. <em>I might be very wrong about this though</em>. I think this is the major con in that whenever you set out to solve an problem using rl your kind of making a bet that it’s going to be easy enough to do so. There are a lot of unknowns and they don’t seem to be predictable in any particular way. Like maybe it just happens to be the case that the way the reward environment has to be shaped your very unlikely to ever have the model get to the point where it’s making progress. It’s also just hard to ascertain progress. My feeling is that if you can write a programmatic solution to the problem then you should probably always prioritise doing so. RL becomes super interesting when we find ourself in the domain of problems that cannot be solved by hand.</p>
<p>The most enjoyable aspect of the whole thing so far has definitely been watching the way that once the lunar lander has hovered down to the final 5 pixels it then drops onto the moons surface in a weirdly human way. As if there’s actually some guy inside who just flips the engine off switch to bring the vehicle down to ground*.</p>
<p><sub> * I have no idea how to land aircraft</sub></p>What to expect when starting out learning rl