Jekyll2019-10-06T20:43:19+00:00/feed.xmlA pursuit of truthThis is a collection of random thoughts and mathematical endevours in my pursuit of the "truth" which most likely will never end.A quick introduction to derivatives for machine learning people2018-02-09T00:00:00+00:002018-02-09T00:00:00+00:00/math/derivatives/machine/learning/ai/ml/2018/02/09/A-quick-introduction-to-derivatives-for-machine-learning-people<p>If you’re like me you probably have used derivatives for a huge part of your
life and learned a few rules on how they work and behave without actually
understanding where it all comes from. As kids we learn some of these rules
early on like the power rule for example in which we know that the derivative of
<script type="math/tex">x^2</script> is <script type="math/tex">2x</script> which in a more general form turns to <script type="math/tex">\frac{dx^a}{dx}=ax^{a-1}</script>.
This is in principle fine since all rules can be readily memorized and looked up
in a table. The downside of that is of course that you’re using a system and a
formalism that you fundamentally do not understand. Again not necessarily an
issue if you are not developing machine learning frameworks yourself on a daily
basis but nevertheless it’s really nice to know what’s going on behind the
scenes. I myself despise black boxes . So in order to dig a little bit deeper
into that I’ll show you what it’s all based on. To do that we have to define
what a derivative is supposed to do for you. Do you know? I’m sure you do, but
just in case you don’t;</p>
<blockquote>
<p>A derivative is a continuous description of how a function changes with small
changes in one or multiple variables.</p>
</blockquote>
<p>We’re going to look into many aspects of that statement. For example</p>
<ul>
<li>What does small mean?</li>
<li>What does change mean?</li>
<li>Why is it continuous?</li>
<li>How is this useful?</li>
</ul>
<p>Let’s get to it!</p>
<h2 id="the-total-and-the-partial-derivative">The total and the partial derivative</h2>
<p>These terms are typically a source of confusion for many as they are sometimes seen as equivalent and in many cases they seem indistinguishable from one another. They are however not! Let’s start by defining the partial derivative and then move on to the total derivative from there. For this purpose I will use an imaginary function <script type="math/tex">f\left(t, x, y\right)</script> where we have three variables <script type="math/tex">t</script>, <script type="math/tex">x</script>, and <script type="math/tex">y</script>. The partial derivative answers the questions of how <script type="math/tex">f</script> changes (<script type="math/tex">\partial f</script>) when <strong>one</strong> variable changes by a small amount (<script type="math/tex">\partial x</script>). In this setting all other variables are assumed to be constant and static. Thus the partial derivative is denoted <script type="math/tex">\frac{\partial f}{\partial x}</script>. In order to show what happens when we do this operation we need to first define <script type="math/tex">f</script> as something. Let’s say it looks like this <script type="math/tex">f(t,x,y)=\frac{4\pi}{3}txy</script> which incidentally is the volume of an ellipsoid. Well, perhaps not so incidentally.. Either way I have chosen a different parametrization than is commonly used. In the picture below you can see from top to left to right a sphere, spheroid, and ellipsoid respectively. In our setting we can choose <script type="math/tex">t=a, x=b, y=c</script> for the dimensions.</p>
<p><img src="/images/Ellipsoide.svg" alt="By Ag2gaeh - Own work, CC BY-SA 4.0, https://commons.wikimedia.org/w/index.php?curid=45585493" /></p>
<p>The partial derivative of the volume of these geometrical spaces then becomes</p>
<script type="math/tex; mode=display">\frac{\partial f}{\partial x}=\frac{4\pi}{3}ty</script>
<p>where we have applied the power rule. As you see the <script type="math/tex">t</script> and the <script type="math/tex">y</script> was not touched since we assumed them to be fixed. Thus in the picture above we model what happens with the volume as <script type="math/tex">b</script> extends or shortens by a small amount. This answers our question if they are really independent of <script type="math/tex">x</script>. But what if they are not? Well in this case we need the total derivative of <script type="math/tex">f</script> with respect to <script type="math/tex">x</script> which is denoted by <script type="math/tex">\frac{df}{dx}</script> and is defined like this</p>
<script type="math/tex; mode=display">\frac{df}{dx}=\frac{\partial f}{\partial t}\frac{dt}{dx}+\frac{\partial f}{\partial x}\frac{dx}{dx}+\frac{\partial f}{\partial y}\frac{dy}{dx} = \frac{\partial f}{\partial t}\frac{dt}{dx}+\frac{\partial f}{\partial x}+\frac{\partial f}{\partial y}\frac{dy}{dx}</script>
<p>where you can see the partial derivative as a part of the total one. So for illustrative purposes let’s constrain the function to a situation where <script type="math/tex">t=x</script>. What happens with the derivative then? Well the partial derivative from before stays the same. but the two other terms we need to calculate. The first part becomes <script type="math/tex">\frac{\partial f}{\partial t}\frac{dt}{dx}=\frac{4\pi}{3}xy \cdot 1</script> while the last part turns to <script type="math/tex">\frac{\partial f}{\partial y}\frac{dy}{dx}=\frac{4\pi}{3}tx \cdot 0=0</script>. Thus now we get</p>
<script type="math/tex; mode=display">\frac{df}{dx}=\frac{4\pi}{3}xy+\frac{4\pi}{3}ty+0=\frac{4\pi}{3}(t+x)y=\frac{4\pi}{3}2xy</script>
<p>by adding the terms and substituting <script type="math/tex">t=x</script> in the last step. Now hopefully it’s apparent that <script type="math/tex">\frac{\partial f}{\partial x}\neq\frac{df}{dx}</script> and that you need to be careful before you state independence between your variables while doing your derivatives.</p>
<p>WAIT! I hear you cry, couldn’t we just do the substitution after we calculated the partial derivative? Indeed you could and you would get something that’s off by a factor 2 which can be substantial. Basically you would get the following insanity</p>
<script type="math/tex; mode=display">\frac{\partial f}{\partial x}=\frac{4\pi}{3}ty=\frac{4\pi}{3}xy\neq \frac{4\pi}{3}2xy</script>
<p>This is because what we’re usually after is indeed the total derivative and not the partial. However, you could of course have done the substitution <strong>before</strong> you calculated the partial derivative. This would turn out nicely like</p>
<script type="math/tex; mode=display">\frac{\partial f}{\partial x}=\frac{\partial }{\partial x} \frac{4\pi}{3}txy=\frac{\partial }{\partial x} \frac{4\pi}{3}x^2y=\frac{4\pi}{3}2xy</script>
<p>where we again reach consistency. Thus, you cannot plug in dependencies in your partial derivative <strong>after</strong> it’s been calculated!</p>
<h3 id="interpretation-as-a-differential">Interpretation as a differential</h3>
<p>Let’s return to the definition of the total derivative for a while. Remember that it looked like this</p>
<script type="math/tex; mode=display">\frac{df}{dx} = \frac{\partial f}{\partial t}\frac{dt}{dx}+\frac{\partial f}{\partial x}+\frac{\partial f}{\partial y}\frac{dy}{dx}</script>
<p>for a function <script type="math/tex">f(t,x,y)</script> with three variables. Now, if we multiply this by <script type="math/tex">dx</script> everywhere we end up with</p>
<script type="math/tex; mode=display">df = \frac{\partial f}{\partial t}dt+\frac{\partial f}{\partial x}dx+\frac{\partial f}{\partial y}dy</script>
<p>which is an expression of a differential view on the function <script type="math/tex">f</script>. It states that a very small change in <script type="math/tex">f</script> can be defined like a weighted sum of the small changes in it’s variables where the weights are the partial derivatives of the function with respect to the same variables. We can state this in general for a function <script type="math/tex">q</script> with <script type="math/tex">M</script> variables like this</p>
<script type="math/tex; mode=display">dq=\sum_{i=1}^M\frac{\partial q}{\partial x_i}dx_i</script>
<p>which is a much more compact and nice way of looking at it. Writing terms out explicitly quickly becomes tedious. On the flip side of this we also get a compact way of representing our total derivative definition. Again sticking to the function <script type="math/tex">q</script> with it’s <script type="math/tex">M</script> variables.</p>
<script type="math/tex; mode=display">\frac{dq}{dx_p}=\sum_{i=1}^M\frac{\partial q}{\partial x_i}\frac{dx_i}{dx_p}\delta_{ip}+\frac{\partial q}{\partial x_p}</script>
<p>The <script type="math/tex">\delta_{ip}</script> is defined to be 1 everywhere except where <script type="math/tex">i=p</script> in which case we define it to <script type="math/tex">0</script>. I know that’s not very traditional but it works so I will use the delta function in this way. I do this because</p>
<script type="math/tex; mode=display">\frac{dq}{dx_p}=\sum_{i=1}^M\frac{\partial q}{\partial x_i}\frac{dx_i}{dx_p}</script>
<p>while being correct, doesn’t put the focus on the partial derivative of the variable of interest <script type="math/tex">x_p</script> but this is really a matter of taste and not at all important for the usage.</p>
<h2 id="the-chain-rule-of-calculus">The chain rule of calculus</h2>
<p>One of the perhaps most common rules to use when calculating analytical derivatives is the chain rule. Mathematically it basically states the following</p>
<script type="math/tex; mode=display">\frac{df(g(x))}{dx}=\frac{\partial f(g(x))}{\partial g(x)}\frac{dg(x)}{dx}</script>
<p>which doesn’t look impressive, but don’t let it’s simplicity fool you. It’s a workhorse without parity in the analytical world of gradients. Remember, <script type="math/tex">g(x)</script> could be anything in this setting. So could <script type="math/tex">x</script> for that matter. As such this rule is applicable to everything relating to gradients.</p>
<h3 id="the-chain-rule-of-probability">The chain rule of probability</h3>
<p>A small note here regarding naming. The “chain rule” actually exists in probability as well under the name of “The chain rule of probability” or “The general product rule”. I find the latter more natural. In any case that rule states the following</p>
<script type="math/tex; mode=display">p(x,y)=p(x|y)p(y)</script>
<p>where <script type="math/tex">p</script> is the probability function for events <script type="math/tex">x</script> and <script type="math/tex">y</script>. This rule can be further generalized into <script type="math/tex">n</script> variables by iterating this rule. See the following example:</p>
<script type="math/tex; mode=display">% <![CDATA[
\begin{eqnarray}
p(x_1, x_2, x_3, x_4) & = & p(x_1|x_2,x_3,x_4)p(x_2,x_3,x_4) = \\
& = & p(x_1|x_2,x_3,x_4)p(x_2|x_3,x_4)p(x_3,x_4) = \\
& = & p(x_1|x_2,x_3,x_4)p(x_2|x_3,x_4)p(x_3|x_4)p(x_4)
\end{eqnarray} %]]></script>
<p>You might be forgiven for believing that the order of the variables somehow matter when applying this rule, but of course it doesn’t since all we’re doing is slicing up the probability space into smaller independent patches. So in a more compact format we can express this general rule like this</p>
<script type="math/tex; mode=display">p\left(\bigcap_{k=1}^{n} x_k\right)=\prod_{k=1}^{n}p\left(x_k \bigg| \bigcap_{j=1}^{k-1}x_j\right)</script>
<p>where we have used <script type="math/tex">n</script> general variables representing our probability landscape. Now to the reason why i brought this up.</p>
<blockquote>
<p>The chain rule of probability has nothing to do with the chain rule of calculus.</p>
</blockquote>
<p>So remember to always think of the context if you hear someone name dropping the “chain rule”, since without context it’s quite ambiguous.</p>
<h2 id="building-your-own-backpropagation-engine-for-deep-neural-networks">Building your own backpropagation engine for deep neural networks</h2>
<p>In this section I’ll take you through a simple multi-layered perceptron and a derivation of the backpropagation algorithm. There are many ways to derive this but I’ll start from the error minimization approach which basically describes the fit of a neural network <script type="math/tex">f(\pmb x, \pmb\theta)</script> by the deviance from a known target <script type="math/tex">y</script>. The architecture we will solve for is shown in the image below where we have two hidden layers. We stick to this for simplicity. We’ll also only use one output instead of multiple but it’s readily generalizable.</p>
<p><img src="/images/annsimple.png" alt="Illustration of a simple neural network with two layers." /></p>
<p>Instead of representing our network graphically we’ll do a more formal representation here where the functional form will be stated mathematically. Basically the functional form will be</p>
<script type="math/tex; mode=display">f(\pmb x_t, \pmb \theta)=\sum_{k=1}^K\tilde\theta_k \left(\varphi\left(\sum_{j=1}^J\hat\theta_{j,k}\varphi\left(\sum_{i=1}^I\theta_{i,j}x_{t,i}\right)\right)\right)</script>
<p>where the bold face symbols denotes vectors. The <script type="math/tex">\varphi(s)=\frac{1}{1+\exp(-a s)}</script> function is the sigmoid activation function with a hyperparameter <script type="math/tex">a</script> which we will not tune or care about in this introduction. A small note here, disregarding the <script type="math/tex">a</script> parameter here is really silly since it will fundamentally change the learning of this network. The only reason I allow myself to do it is because it’s beyond the scope to cover it at this point.</p>
<p>In order to train a neural network we need to update the parameters according to how much they affect the error we see. This error can be defined like this for a regression like problem for a data point <script type="math/tex">(\pmb x_t, y_t)</script>.</p>
<script type="math/tex; mode=display">E(\pmb x_t,\pmb\theta, y_t)=\frac{1}{2}\left(f(\pmb x_t, \pmb\theta)-y_t\right)^2</script>
<p>If we look at the second last layer then we simply update the parameters according to the following rule</p>
<script type="math/tex; mode=display">\hat\theta_{j,k}^{new} = \hat\theta_{j,k}^{old} - \eta\frac{\partial E(\pmb x_t, \pmb\theta, y_t)}{\partial \hat\theta_{j,k}} \forall j,k</script>
<p>for each new data point. This is called Stochastic Gradient Descent (<a href="https://en.wikipedia.org/wiki/Stochastic_gradient_descent">SGD</a>). You can read a lot about that in many places so I won’t dive into it here. Suffice it to say that this process can be repeated for each parameter in each layer. So the infamous <a href="https://en.wikipedia.org/wiki/Backpropagation">backpropagation</a> algorithm is just an application of updating your parameters by the partial derivative of the error with respect to that very parameter. Do the partial derivatives for yourself now and see how easy you can derive it. A small trick you can use is to realize that <script type="math/tex">\varphi '(s)=\varphi(s)(1-\varphi(s))</script> where I’ve used the prime notation for a derivative. There’s a nice tutorial on how to do this numerically <a href="https://mattmazur.com/2015/03/17/a-step-by-step-backpropagation-example/">here</a>.</p>
<h2 id="take-home-messages">Take home messages</h2>
<ul>
<li>The total derivative and the partial derivative are related but at times fundamentally different.</li>
<li>All constraints and variable substitutions have to be done <strong>before</strong> calculating the partial derivative.</li>
<li>The partial derivative ignores implicit dependencies.</li>
<li>The total derivative takes all dependencies into account.</li>
<li>Many magic recipes, like the backpropagation algorithm, usually comes from quite simple ideas and doing it for yourself is really instructional and useful.</li>
</ul>Dr. Michael GreenIf you’re like me you probably have used derivatives for a huge part of your life and learned a few rules on how they work and behave without actually understanding where it all comes from. As kids we learn some of these rules early on like the power rule for example in which we know that the derivative of is which in a more general form turns to . This is in principle fine since all rules can be readily memorized and looked up in a table. The downside of that is of course that you’re using a system and a formalism that you fundamentally do not understand. Again not necessarily an issue if you are not developing machine learning frameworks yourself on a daily basis but nevertheless it’s really nice to know what’s going on behind the scenes. I myself despise black boxes . So in order to dig a little bit deeper into that I’ll show you what it’s all based on. To do that we have to define what a derivative is supposed to do for you. Do you know? I’m sure you do, but just in case you don’t;The importance of context2018-02-01T00:00:00+00:002018-02-01T00:00:00+00:00/2018/02/01/The-importance-of-context<p>When we do modeling it’s of utmost importance that we pay attention to context. Without context there is little that can be inferred.</p>
<p>Let’s create a correlated dummy dataset that will allow me to highlight my point. In this case we’ll just sample our data from a two dimensional multivariate gaussian distribution specified by the mean vector <script type="math/tex">\mu_X</script> and covariance matrix <script type="math/tex">\Sigma_X</script>. We will also create a response variable <script type="math/tex">y</script> which is defined like</p>
<script type="math/tex; mode=display">y_t\sim N(\mu_{y,t}, \sigma_y)</script>
<script type="math/tex; mode=display">\mu_{y,t}=1x_1+1x_2+1x_1 x_2+5</script>
<script type="math/tex; mode=display">\sigma_y\sim N(0,20)</script>
<p>where <script type="math/tex">x_1</script> and <script type="math/tex">x_2</script> are realized samples from the two dimensional multivariate guassian distribution above. This covariance matrix looks like this</p>
<table>
<thead>
<tr>
<th style="text-align: left"> </th>
<th style="text-align: right">X1</th>
<th style="text-align: right">X2</th>
</tr>
</thead>
<tbody>
<tr>
<td style="text-align: left">X1</td>
<td style="text-align: right">3.0</td>
<td style="text-align: right">2.5</td>
</tr>
<tr>
<td style="text-align: left">X2</td>
<td style="text-align: right">2.5</td>
<td style="text-align: right">3.0</td>
</tr>
</tbody>
</table>
<p>where the correlation between our variables are obvious. So let’s plot each variable against it’s response and have a look. As you can see it’s quite apparent that the variables are rather similar.</p>
<div class="language-r highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">Sigma</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="n">matrix</span><span class="p">(</span><span class="nf">c</span><span class="p">(</span><span class="m">3</span><span class="p">,</span><span class="w"> </span><span class="m">2.5</span><span class="p">,</span><span class="w"> </span><span class="m">2.5</span><span class="p">,</span><span class="w"> </span><span class="m">3</span><span class="p">),</span><span class="w"> </span><span class="m">2</span><span class="p">,</span><span class="w"> </span><span class="m">2</span><span class="p">)</span><span class="w">
</span><span class="n">mydf</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="n">as_tibble</span><span class="p">(</span><span class="n">data.frame</span><span class="p">(</span><span class="n">mvrnorm</span><span class="p">(</span><span class="m">500</span><span class="p">,</span><span class="w"> </span><span class="nf">c</span><span class="p">(</span><span class="m">10</span><span class="p">,</span><span class="w"> </span><span class="m">10</span><span class="p">),</span><span class="w"> </span><span class="n">Sigma</span><span class="p">)))</span><span class="w"> </span><span class="o">%>%</span><span class="w">
</span><span class="n">mutate</span><span class="p">(</span><span class="n">y</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="m">1</span><span class="w"> </span><span class="o">*</span><span class="w"> </span><span class="n">X</span><span class="m">1</span><span class="w"> </span><span class="o">+</span><span class="w"> </span><span class="m">1</span><span class="w"> </span><span class="o">*</span><span class="w"> </span><span class="n">X</span><span class="m">2</span><span class="w"> </span><span class="o">+</span><span class="w"> </span><span class="m">1</span><span class="w"> </span><span class="o">*</span><span class="w"> </span><span class="n">X</span><span class="m">1</span><span class="w"> </span><span class="o">*</span><span class="w"> </span><span class="n">X</span><span class="m">2</span><span class="w"> </span><span class="o">+</span><span class="w"> </span><span class="m">5</span><span class="w"> </span><span class="o">+</span><span class="w"> </span><span class="n">rnorm</span><span class="p">(</span><span class="nf">length</span><span class="p">(</span><span class="n">X</span><span class="m">1</span><span class="p">),</span><span class="w"> </span><span class="m">0</span><span class="p">,</span><span class="w"> </span><span class="m">20</span><span class="p">))</span><span class="w">
</span><span class="n">gather</span><span class="p">(</span><span class="n">mydf</span><span class="p">,</span><span class="w"> </span><span class="n">variable</span><span class="p">,</span><span class="w"> </span><span class="n">value</span><span class="p">,</span><span class="w"> </span><span class="o">-</span><span class="n">y</span><span class="p">)</span><span class="w"> </span><span class="o">%>%</span><span class="w">
</span><span class="n">ggplot</span><span class="p">(</span><span class="n">aes</span><span class="p">(</span><span class="n">y</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">y</span><span class="p">,</span><span class="w"> </span><span class="n">x</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">value</span><span class="p">,</span><span class="w"> </span><span class="n">color</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">variable</span><span class="p">))</span><span class="w"> </span><span class="o">+</span><span class="w">
</span><span class="n">geom_point</span><span class="p">()</span><span class="w"> </span><span class="o">+</span><span class="w"> </span><span class="n">geom_smooth</span><span class="p">()</span><span class="w"> </span><span class="o">+</span><span class="w"> </span><span class="n">xlab</span><span class="p">(</span><span class="s2">"Variable value"</span><span class="p">)</span><span class="w"> </span><span class="o">+</span><span class="w"> </span><span class="n">ylab</span><span class="p">(</span><span class="s2">"y"</span><span class="p">)</span><span class="w"> </span><span class="o">+</span><span class="w">
</span><span class="n">facet_grid</span><span class="p">(</span><span class="n">.</span><span class="w"> </span><span class="o">~</span><span class="w"> </span><span class="n">variable</span><span class="p">)</span><span class="w">
</span></code></pre></div></div>
<p><img src="/images/figure/dataplotforvariables-1.png" alt="plot of chunk dataplotforvariables" /></p>
<p>What would you expect us to get from it if we fit a simple model? We have generated 500 observations and we are estimating 4 coefficients. Should be fine right? Well it turns out it’s not fine at all. Not fine at all. Remember that we defined our coefficients to be 1 both for the independent effects and for the interaction effects between <script type="math/tex">x_1</script> and <script type="math/tex">x_2</script>. The intercept is set to <script type="math/tex">5</script>. In other words we actually have point parameters here behind the physical model. This is an assumption that in most modeling situations would be crazy, but we use it here to highlight a point. Let’s make a linear regression model with the interaction effects present.</p>
<div class="highlighter-rouge"><div class="highlight"><pre class="highlight"><code>mylm <- lm(y~X1+X2+X1:X2, data=mydf)
</code></pre></div></div>
<p>In R you specify interaction effects like this “:” which might look a bit weird but just accept it for now. It could have been written in other ways but I like to be explicit. Now that we have this model we can investigate what it says about our unknown parameters that we estimated.</p>
<table style="border-collapse:collapse; border:none;border-bottom:double;">
<tr>
<td style="padding:0.2cm; border-top:double;"> </td>
<td style="border-bottom:1px solid; padding-left:0.5em; padding-right:0.5em; border-top:double;"> </td>
<td style="padding:0.2cm; text-align:center; border-bottom:1px solid; border-top:double;" colspan="4">y</td>
</tr>
<tr>
<td style="padding:0.2cm; font-style:italic;"> </td>
<td style="padding-left:0.5em; padding-right:0.5em; font-style:italic;"> </td>
<td style="padding:0.2cm; text-align:center; font-style:italic; ">B</td>
<td style="padding:0.2cm; text-align:center; font-style:italic; ">CI</td>
<td style="padding:0.2cm; text-align:center; font-style:italic; ">std. Error</td>
<td style="padding:0.2cm; text-align:center; font-style:italic; ">p</td>
</tr>
<tr>
<td style="padding:0.2cm; border-top:1px solid; text-align:left;">(Intercept)</td>
<td style="padding-left:0.5em; padding-right:0.5em; border-top:1px solid; "> </td>
<td style="padding:0.2cm; text-align:center; border-top:1px solid; ">14.88</td>
<td style="padding:0.2cm; text-align:center; border-top:1px solid; ">-27.29 – 57.05</td>
<td style="padding:0.2cm; text-align:center; border-top:1px solid; ">21.46</td>
<td style="padding:0.2cm; text-align:center; border-top:1px solid; ">.489</td>
</tr>
<tr>
<td style="padding:0.2cm; text-align:left;">X1</td>
<td style="padding-left:0.5em; padding-right:0.5em;"> </td>
<td style="padding:0.2cm; text-align:center; ">0.06</td>
<td style="padding:0.2cm; text-align:center; ">-4.56 – 4.68</td>
<td style="padding:0.2cm; text-align:center; ">2.35</td>
<td style="padding:0.2cm; text-align:center; ">.981</td>
</tr>
<tr>
<td style="padding:0.2cm; text-align:left;">X2</td>
<td style="padding-left:0.5em; padding-right:0.5em;"> </td>
<td style="padding:0.2cm; text-align:center; ">0.13</td>
<td style="padding:0.2cm; text-align:center; ">-4.47 – 4.74</td>
<td style="padding:0.2cm; text-align:center; ">2.34</td>
<td style="padding:0.2cm; text-align:center; ">.955</td>
</tr>
<tr>
<td style="padding:0.2cm; text-align:left;">X1:X2</td>
<td style="padding-left:0.5em; padding-right:0.5em;"> </td>
<td style="padding:0.2cm; text-align:center; ">1.09</td>
<td style="padding:0.2cm; text-align:center; ">0.66 – 1.52</td>
<td style="padding:0.2cm; text-align:center; ">0.22</td>
<td style="padding:0.2cm; text-align:center; "><.001</td>
</tr>
<tr>
<td style="padding:0.2cm; padding-top:0.1cm; padding-bottom:0.1cm; text-align:left; border-top:1px solid;">Observations</td>
<td style="padding-left:0.5em; padding-right:0.5em; border-top:1px solid;"> </td><td style="padding:0.2cm; padding-top:0.1cm; padding-bottom:0.1cm; text-align:center; border-top:1px solid;" colspan="4">500</td>
</tr>
<tr>
<td style="padding:0.2cm; text-align:left; padding-top:0.1cm; padding-bottom:0.1cm;">R<sup>2</sup> / adj. R<sup>2</sup></td>
<td style="padding-left:0.5em; padding-right:0.5em;"> </td><td style="padding:0.2cm; text-align:center; padding-top:0.1cm; padding-bottom:0.1cm;" colspan="4">.783 / .781</td>
</tr>
</table>
<p>A quick look at the table reveals a number of pathologies. If we look at the intercept we can see that it’s 198 per cent off. For the <script type="math/tex">x_1</script> and <script type="math/tex">x_2</script> variables we’re -94 and -87 per cent off respectively. The interaction effect ends up being 9 percent off target which is not much. All in all though, we’re significantly off the target. This is not surprising though. In fact, I would have been surprised had we succeeded. So what’s the problem? Well, the problem is that our basic assumption of independence between variables quite frankly does not hold. The reason why it doesn’t hold is because the generated data is indeed correlated. Remember our covariance matrix in the two dimensional multivariate gaussian.</p>
<p>Let’s try to fix our analysis. In this setting we need to introduce context and the easiest most natural way to deal with that are priors. To do this we cannot use our old trusted friend “lm” in R but must resort to a bayesian framework. <a href="http://mc-stan.org">Stan</a> makes that very simple. This implementation of our model is not very elegant but it will neatly show you how easily you can define models in this language. We simply specify our data, parameters and model. We set the priors in the model part. Notice here that we don’t put priors on everything. For instance. I might know that a value around 1 is reasonable for our main and interaction effects but I have no idea of where the intercept should be. In this case I will simple be completely ignorant and not inject my knowledge into the model about the intercept because I fundamentally believe I don’t have any. That’s why <script type="math/tex">\beta_0</script> does not appear in the model section.</p>
<div class="highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">data</span> <span class="p">{</span>
<span class="n">int</span><span class="p"><</span><span class="n">lower</span><span class="p">=</span><span class="m">0</span><span class="p">></span> <span class="n">N</span><span class="p">;</span>
<span class="k">real</span> <span class="n">y</span><span class="p">[</span><span class="n">N</span><span class="p">];</span>
<span class="k">real</span> <span class="n">x1</span><span class="p">[</span><span class="n">N</span><span class="p">];</span>
<span class="k">real</span> <span class="n">x2</span><span class="p">[</span><span class="n">N</span><span class="p">];</span>
<span class="p">}</span>
<span class="k">parameters</span> <span class="p">{</span>
<span class="k">real</span> <span class="n">b0</span><span class="p">;</span>
<span class="k">real</span> <span class="n">b1</span><span class="p">;</span>
<span class="k">real</span> <span class="n">b2</span><span class="p">;</span>
<span class="k">real</span> <span class="n">b3</span><span class="p">;</span>
<span class="k">real</span><span class="p"><</span><span class="n">lower</span><span class="p">=</span><span class="m">1</span><span class="p">></span> <span class="n">sigma</span><span class="p">;</span>
<span class="p">}</span>
<span class="k">model</span> <span class="p">{</span>
<span class="n">b1</span> <span class="p">~</span> <span class="n">normal</span><span class="p">(</span><span class="m">1</span><span class="p">,</span> <span class="m">0.5</span><span class="p">);</span>
<span class="n">b2</span> <span class="p">~</span> <span class="n">normal</span><span class="p">(</span><span class="m">1</span><span class="p">,</span> <span class="m">0.5</span><span class="p">);</span>
<span class="n">b3</span> <span class="p">~</span> <span class="n">normal</span><span class="p">(</span><span class="m">1</span><span class="p">,</span> <span class="m">0.5</span><span class="p">);</span>
<span class="n">for</span><span class="p">(</span><span class="n">i</span> <span class="k">in</span> <span class="m">1</span><span class="p">:</span><span class="n">N</span><span class="p">)</span> <span class="n">y</span><span class="p">[</span><span class="n">i</span><span class="p">]</span> <span class="p">~</span> <span class="n">normal</span><span class="p">(</span><span class="n">b0</span><span class="p">+</span><span class="n">b1</span><span class="p">*</span><span class="n">x1</span><span class="p">[</span><span class="n">i</span><span class="p">]+</span><span class="n">b2</span><span class="p">*</span><span class="n">x2</span><span class="p">[</span><span class="n">i</span><span class="p">]+</span><span class="n">b3</span><span class="p">*</span><span class="n">x1</span><span class="p">[</span><span class="n">i</span><span class="p">]*</span><span class="n">x2</span><span class="p">[</span><span class="n">i</span><span class="p">],</span> <span class="n">sigma</span><span class="p">);</span>
<span class="p">}</span>
</code></pre></div></div>
<p>If we go ahead and run this model we get the inference after the MCMC engine is done. The summary of the bayesian model can be seen below where the coefficients make a lot more sense.</p>
<div class="highlighter-rouge"><div class="highlight"><pre class="highlight"><code>## mean sd 2.5% 97.5%
## b0 6.2015899 4.50797603 -2.399446e+00 15.161352
## b1 0.9823694 0.48849017 5.569159e-02 1.949283
## b2 0.9775408 0.47906798 6.376570e-02 1.912414
## b3 1.0014422 0.04332279 9.151089e-01 1.084912
## sigma 20.0924148 0.62945656 1.890865e+01 21.342656
## lp__ -1747.7225224 1.64308492 -1.751779e+03 -1745.557799
</code></pre></div></div>
<p>If we look at the distributions for our parameters we can see that in the right context we capture the essense of our model but moreover we also see the support the data gives to the different possible values. We select 80 percent intervals here to illustrate the width of the distribution and the mass.</p>
<p><img src="/images/figure/histogramsex1-1.png" alt="plot of chunk histogramsex1" /></p>
<p>Notice here that we are around the right area and we don’t get the crazy results that we got from our regression earlier. This is because of our knowledge (context) of the problem. The model armed with our knowledge correctly realizes that there are many possible values for the intercept and the width of that distribution is a testement to that. Further there’s some uncertainty about the value for the main effects in the model meanwhile the interaction effect is really nailed down and our estimate here is not uncertain at all.</p>Dr. Michael GreenWhen we do modeling it’s of utmost importance that we pay attention to context. Without context there is little that can be inferred. Let’s create a correlated dummy dataset that will allow me to highlight my point. In this case we’ll just sample our data from a two dimensional multivariate gaussian distribution specified by the mean vector and covariance matrix . We will also create a response variable which is defined like where and are realized samples from the two dimensional multivariate guassian distribution above. This covariance matrix looks like thisDeep Neural Networks in Julia - Love at first sight?2018-01-10T00:00:00+00:002018-01-10T00:00:00+00:00/2018/01/10/Deep-learning-in-julia<p>I love new initiatives that tries to do something fresh and innovative. The
relatively new language <a href="https://julialang.org/">Julia</a> is one of my favorite
languages. It features a lot of good stuff in addition to being targeted towards
computational people like me. I won’t bore you with the details of the language
itself but suffice it to say that we finally have a general purpose language
where you don’t have to compromise expressiveness with efficiency.</p>
<h2 id="prerequisites">Prerequisites</h2>
<p>When reading this it helps if you have a basic understanding of Neural networks
and their mathematical properties. Mathwise, basic linear algebra will do for
the majority of the post.</p>
<h2 id="short-introductory-example---boston-housing">Short introductory example - Boston Housing</h2>
<p>Instead of writing on and on about how cool this new language is I will just
show you how quickly you can get a simple neural network up and running. The
first example we will create is the
<a href="http://www.cs.toronto.edu/~delve/data/boston/bostonDetail.html">BostonHousing</a>
dataset. This is baked into the deep learning library Knet. So let’s start by
fetching the data.</p>
<div class="language-julia highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">using</span> <span class="n">Knet</span><span class="x">;</span>
<span class="n">include</span><span class="x">(</span><span class="n">Knet</span><span class="o">.</span><span class="n">dir</span><span class="x">(</span><span class="s">"data"</span><span class="x">,</span><span class="s">"housing.jl"</span><span class="x">));</span>
<span class="n">x</span><span class="x">,</span><span class="n">y</span> <span class="o">=</span> <span class="n">housing</span><span class="x">();</span>
</code></pre></div></div>
<p>Now that we have the data we also need to define the basic functions that will
make up our network. We will start with the predict function where we define
<script type="math/tex">\omega</script> and <script type="math/tex">x</script> as input. <script type="math/tex">\omega</script> in this case is our parameters which
is a 2 element array containing weights in the first element and biases in the
second. The <script type="math/tex">x</script> contains the dataset which in our case is a matrix of size
506x13, i.e., 506 observations and 13 covariates.</p>
<div class="language-julia highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">predict</span><span class="x">(</span><span class="n">ω</span><span class="x">,</span> <span class="n">x</span><span class="x">)</span> <span class="o">=</span> <span class="n">ω</span><span class="x">[</span><span class="mi">1</span><span class="x">]</span> <span class="o">*</span> <span class="n">x</span> <span class="o">.+</span> <span class="n">ω</span><span class="x">[</span><span class="mi">2</span><span class="x">];</span>
<span class="n">loss</span><span class="x">(</span><span class="n">ω</span><span class="x">,</span> <span class="n">x</span><span class="x">,</span> <span class="n">y</span><span class="x">)</span> <span class="o">=</span> <span class="n">mean</span><span class="x">(</span><span class="n">abs2</span><span class="x">,</span> <span class="n">predict</span><span class="x">(</span><span class="n">ω</span><span class="x">,</span> <span class="n">x</span><span class="x">)</span><span class="o">-</span><span class="n">y</span><span class="x">);</span>
<span class="n">lossgradient</span> <span class="o">=</span> <span class="n">grad</span><span class="x">(</span><span class="n">loss</span><span class="x">);</span>
<span class="k">function</span><span class="nf"> train</span><span class="x">(</span><span class="n">ω</span><span class="x">,</span> <span class="n">data</span><span class="x">;</span> <span class="n">lr</span><span class="o">=</span><span class="mf">0.01</span><span class="x">)</span>
<span class="k">for</span> <span class="x">(</span><span class="n">x</span><span class="x">,</span><span class="n">y</span><span class="x">)</span> <span class="k">in</span> <span class="n">data</span>
<span class="n">dω</span> <span class="o">=</span> <span class="n">lossgradient</span><span class="x">(</span><span class="n">ω</span><span class="x">,</span> <span class="n">x</span><span class="x">,</span> <span class="n">y</span><span class="x">)</span>
<span class="k">for</span> <span class="n">i</span> <span class="k">in</span> <span class="mi">1</span><span class="x">:</span><span class="n">length</span><span class="x">(</span><span class="n">ω</span><span class="x">)</span>
<span class="n">ω</span><span class="x">[</span><span class="n">i</span><span class="x">]</span> <span class="o">-=</span> <span class="n">dω</span><span class="x">[</span><span class="n">i</span><span class="x">]</span><span class="o">*</span><span class="n">lr</span>
<span class="k">end</span>
<span class="k">end</span>
<span class="k">return</span> <span class="n">ω</span>
<span class="k">end</span><span class="x">;</span>
</code></pre></div></div>
<p>Let’s have a look at the first 5 variables of the data set and their relation to
the response that we would like to predict. The y axis in the plots is the
response variable and the x axis the respective variables values. As you can see
there are some correlations which seems to indicate some kind of relation though
this is not definitive proof of a relation!</p>
<div class="language-julia highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">using</span> <span class="n">Plots</span><span class="x">;</span>
<span class="n">using</span> <span class="n">StatPlots</span><span class="x">;</span>
<span class="n">include</span><span class="x">(</span><span class="n">Knet</span><span class="o">.</span><span class="n">dir</span><span class="x">(</span><span class="s">"data"</span><span class="x">,</span><span class="s">"housing.jl"</span><span class="x">));</span>
<span class="n">x</span><span class="x">,</span><span class="n">y</span> <span class="o">=</span> <span class="n">housing</span><span class="x">();</span>
<span class="c">#plotly();</span>
<span class="n">gr</span><span class="x">();</span>
<span class="n">scatter</span><span class="x">(</span><span class="n">x</span><span class="err">'</span><span class="x">,</span> <span class="n">y</span><span class="x">[</span><span class="mi">1</span><span class="x">,:],</span> <span class="n">layout</span><span class="o">=</span><span class="x">(</span><span class="mi">3</span><span class="x">,</span><span class="mi">5</span><span class="x">),</span> <span class="n">reg</span><span class="o">=</span><span class="n">true</span><span class="x">,</span> <span class="n">size</span><span class="o">=</span><span class="x">(</span><span class="mi">950</span><span class="x">,</span><span class="mi">500</span><span class="x">))</span>
</code></pre></div></div>
<p><img src="/images/figure/docbostonhousing_3_1.svg" alt="" /></p>
<p>Here’s the training part of the script where we define and train a perceptron, i.e., a linear neural network on the Boston Housing dataset. We track the error every 10th epoch and register it in our DataFrame.</p>
<div class="language-julia highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">using</span> <span class="n">DataFrames</span>
<span class="n">ω</span> <span class="o">=</span> <span class="kt">Any</span><span class="x">[</span> <span class="mf">0.1</span><span class="o">*</span><span class="n">randn</span><span class="x">(</span><span class="mi">1</span><span class="x">,</span><span class="mi">13</span><span class="x">),</span> <span class="mf">0.0</span> <span class="x">];</span>
<span class="n">errdf</span> <span class="o">=</span> <span class="n">DataFrame</span><span class="x">(</span><span class="n">Epoch</span><span class="o">=</span><span class="mi">1</span><span class="x">:</span><span class="mi">20</span><span class="x">,</span> <span class="n">Error</span><span class="o">=</span><span class="mf">0.0</span><span class="x">);</span>
<span class="n">cntr</span> <span class="o">=</span> <span class="mi">1</span><span class="x">;</span>
<span class="k">for</span> <span class="n">i</span><span class="o">=</span><span class="mi">1</span><span class="x">:</span><span class="mi">200</span>
<span class="n">train</span><span class="x">(</span><span class="n">ω</span><span class="x">,</span> <span class="x">[(</span><span class="n">x</span><span class="x">,</span><span class="n">y</span><span class="x">)])</span>
<span class="k">if</span> <span class="n">mod</span><span class="x">(</span><span class="n">i</span><span class="x">,</span> <span class="mi">10</span><span class="x">)</span> <span class="o">==</span> <span class="mi">0</span>
<span class="n">println</span><span class="x">(</span><span class="s">"Epoch </span><span class="si">$</span><span class="s">i: </span><span class="si">$</span><span class="s">(round(loss(ω,x,y)))"</span><span class="x">)</span>
<span class="n">errdf</span><span class="x">[</span><span class="n">cntr</span><span class="x">,</span> <span class="x">:</span><span class="n">Epoch</span><span class="x">]</span><span class="o">=</span><span class="n">i</span>
<span class="n">errdf</span><span class="x">[</span><span class="n">cntr</span><span class="x">,</span> <span class="x">:</span><span class="n">Error</span><span class="x">]</span><span class="o">=</span><span class="n">loss</span><span class="x">(</span><span class="n">ω</span><span class="x">,</span><span class="n">x</span><span class="x">,</span><span class="n">y</span><span class="x">)</span>
<span class="n">cntr</span><span class="o">+=</span><span class="mi">1</span>
<span class="k">end</span>
<span class="k">end</span>
</code></pre></div></div>
<div class="highlighter-rouge"><div class="highlight"><pre class="highlight"><code>Epoch 10: 383.0
Epoch 20: 262.0
Epoch 30: 182.0
Epoch 40: 130.0
Epoch 50: 94.0
Epoch 60: 71.0
Epoch 70: 55.0
Epoch 80: 45.0
Epoch 90: 38.0
Epoch 100: 33.0
Epoch 110: 30.0
Epoch 120: 28.0
Epoch 130: 26.0
Epoch 140: 25.0
Epoch 150: 24.0
Epoch 160: 24.0
Epoch 170: 24.0
Epoch 180: 23.0
Epoch 190: 23.0
Epoch 200: 23.0
</code></pre></div></div>
<p>If you inspect the training error per epoch you’ll see that the error steadily goes down but flattens out after around 150. This is when we have reached the minimum error we can achieve with this small model. The plot on the right hand side shows the correlation between the predicted house prices and the real observed ones. The fit is not bad but there are definitely outliers.</p>
<div class="language-julia highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">p1</span> <span class="o">=</span> <span class="n">scatter</span><span class="x">(</span><span class="n">errdf</span><span class="x">[:,:</span><span class="n">Epoch</span><span class="x">],</span> <span class="n">errdf</span><span class="x">[:,:</span><span class="n">Error</span><span class="x">],</span> <span class="n">xlabel</span><span class="o">=</span><span class="s">"Epoch"</span><span class="x">,</span> <span class="n">ylabel</span><span class="o">=</span><span class="s">"Error"</span><span class="x">)</span>
<span class="n">p2</span> <span class="o">=</span> <span class="n">scatter</span><span class="x">(</span><span class="n">predict</span><span class="x">(</span><span class="n">ω</span><span class="x">,</span> <span class="n">x</span><span class="x">)</span><span class="err">'</span><span class="x">,</span> <span class="n">y</span><span class="err">'</span><span class="x">,</span> <span class="n">reg</span><span class="o">=</span><span class="n">true</span><span class="x">,</span> <span class="n">xlabel</span><span class="o">=</span><span class="s">"Predicted"</span><span class="x">,</span> <span class="n">ylabel</span><span class="o">=</span><span class="s">"Observed"</span><span class="x">)</span>
<span class="n">plot</span><span class="x">(</span><span class="n">p1</span><span class="x">,</span> <span class="n">p2</span><span class="x">,</span> <span class="n">layout</span><span class="o">=</span><span class="x">(</span><span class="mi">1</span><span class="x">,</span><span class="mi">2</span><span class="x">),</span> <span class="n">size</span><span class="o">=</span><span class="x">(</span><span class="mi">950</span><span class="x">,</span><span class="mi">500</span><span class="x">))</span>
</code></pre></div></div>
<p><img src="/images/figure/docbostonhousing_5_1.svg" alt="" /></p>
<h2 id="increasing-model-complexity">Increasing model complexity</h2>
<p>Since we made a toy model before which basically was a simple multiple linear regression model, we will step into the land of deep learning. Well, we’ll use two hidden layers instead of none. The way to go about this is actually ridiculously simple. Since we’ve written all code so far in Raw Julia except for the grad function, which comes from the AutoGrad package, we can readily extend the depth of our network. But before we move on let’s save the network we trained from before.</p>
<div class="language-julia highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">ω1</span> <span class="o">=</span> <span class="n">ω</span><span class="x">;</span>
</code></pre></div></div>
<p>Now that that’s out of the way the next thing we need to do is to define the weights of our network and thereby our new structure. We will build a neural network with</p>
<ul>
<li>One input layer of size 13</li>
<li>A hidden layer of size 64</li>
<li>Another hidden layer of size 15</li>
<li>A final output layer which will be our prediction
which will have way more parameters than needed to solve this, but we’ll add all these parameters just for fun. We’ll return to why this is a horrible idea later. Knowing the overall structure we can now define our new <script type="math/tex">\omega</script>. When you read it please bear in mind that we use a [weights,bias,weights,bias] structure.</li>
</ul>
<div class="language-julia highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">ω</span> <span class="o">=</span> <span class="kt">Any</span><span class="x">[</span><span class="mf">0.1f0</span><span class="o">*</span><span class="n">randn</span><span class="x">(</span><span class="kt">Float32</span><span class="x">,</span><span class="mi">64</span><span class="x">,</span><span class="mi">13</span><span class="x">),</span> <span class="n">zeros</span><span class="x">(</span><span class="kt">Float32</span><span class="x">,</span><span class="mi">64</span><span class="x">,</span><span class="mi">1</span><span class="x">),</span>
<span class="mf">0.1f0</span><span class="o">*</span><span class="n">randn</span><span class="x">(</span><span class="kt">Float32</span><span class="x">,</span><span class="mi">15</span><span class="x">,</span><span class="mi">64</span><span class="x">),</span> <span class="n">zeros</span><span class="x">(</span><span class="kt">Float32</span><span class="x">,</span><span class="mi">15</span><span class="x">,</span><span class="mi">1</span><span class="x">),</span>
<span class="mf">0.1f0</span><span class="o">*</span><span class="n">randn</span><span class="x">(</span><span class="kt">Float32</span><span class="x">,</span><span class="mi">1</span><span class="x">,</span><span class="mi">15</span><span class="x">),</span> <span class="n">zeros</span><span class="x">(</span><span class="kt">Float32</span><span class="x">,</span><span class="mi">1</span><span class="x">,</span><span class="mi">1</span><span class="x">)]</span>
</code></pre></div></div>
<p>The <script type="math/tex">\omega</script> is not the only thing we need to fix. We also need a new
prediction function. Now, instead of making it targeted towards our specific
network, we will instead write one that works for any number of layers. It’s
given below. Notice the ReLu function in the hidden nodes. If you don’t know why
this is a good idea there are several papers that explains why in great detail.
The short version is that it helps with the vanishing gradients problem in deep
networks.</p>
<div class="language-julia highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="k">function</span><span class="nf"> predict</span><span class="x">(</span><span class="n">ω</span><span class="x">,</span> <span class="n">x</span><span class="x">)</span>
<span class="n">x</span> <span class="o">=</span> <span class="n">mat</span><span class="x">(</span><span class="n">x</span><span class="x">)</span>
<span class="k">for</span> <span class="n">i</span><span class="o">=</span><span class="mi">1</span><span class="x">:</span><span class="mi">2</span><span class="x">:</span><span class="n">length</span><span class="x">(</span><span class="n">ω</span><span class="x">)</span><span class="o">-</span><span class="mi">2</span>
<span class="n">x</span> <span class="o">=</span> <span class="n">relu</span><span class="o">.</span><span class="x">(</span><span class="n">ω</span><span class="x">[</span><span class="n">i</span><span class="x">]</span><span class="o">*</span><span class="n">x</span> <span class="o">.+</span> <span class="n">ω</span><span class="x">[</span><span class="n">i</span><span class="o">+</span><span class="mi">1</span><span class="x">])</span>
<span class="k">end</span>
<span class="k">return</span> <span class="n">ω</span><span class="x">[</span><span class="k">end</span><span class="o">-</span><span class="mi">1</span><span class="x">]</span><span class="o">*</span><span class="n">x</span> <span class="o">.+</span> <span class="n">ω</span><span class="x">[</span><span class="k">end</span><span class="x">]</span>
<span class="k">end</span>
</code></pre></div></div>
<div class="highlighter-rouge"><div class="highlight"><pre class="highlight"><code>predict (generic function with 1 method)
</code></pre></div></div>
<p>Regarding the loss and the gradient of the loss we use exactly the same code! No
need for any changes. This is one of the AutoGrad’s superpowers; It can
differentiate almost any Julia function. The same is also true for our training
function. It also doesn’t change. Nor do the loop where we apply it. Cool stuff
my friends.</p>
<div class="language-julia highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">loss</span><span class="x">(</span><span class="n">ω</span><span class="x">,</span> <span class="n">x</span><span class="x">,</span> <span class="n">y</span><span class="x">)</span> <span class="o">=</span> <span class="n">mean</span><span class="x">(</span><span class="n">abs2</span><span class="x">,</span> <span class="n">predict</span><span class="x">(</span><span class="n">ω</span><span class="x">,</span> <span class="n">x</span><span class="x">)</span><span class="o">-</span><span class="n">y</span><span class="x">)</span>
<span class="n">lossgradient</span> <span class="o">=</span> <span class="n">grad</span><span class="x">(</span><span class="n">loss</span><span class="x">)</span>
<span class="n">errdf</span> <span class="o">=</span> <span class="n">DataFrame</span><span class="x">(</span><span class="n">Epoch</span><span class="o">=</span><span class="mi">1</span><span class="x">:</span><span class="mi">60</span><span class="x">,</span> <span class="n">Error</span><span class="o">=</span><span class="mf">0.0</span><span class="x">)</span>
<span class="n">cntr</span> <span class="o">=</span> <span class="mi">1</span>
<span class="k">for</span> <span class="n">i</span><span class="o">=</span><span class="mi">1</span><span class="x">:</span><span class="mi">600</span>
<span class="n">train</span><span class="x">(</span><span class="n">ω</span><span class="x">,</span> <span class="x">[(</span><span class="n">x</span><span class="x">,</span><span class="n">y</span><span class="x">)])</span>
<span class="k">if</span> <span class="n">mod</span><span class="x">(</span><span class="n">i</span><span class="x">,</span> <span class="mi">10</span><span class="x">)</span> <span class="o">==</span> <span class="mi">0</span>
<span class="n">errdf</span><span class="x">[</span><span class="n">cntr</span><span class="x">,</span> <span class="x">:</span><span class="n">Epoch</span><span class="x">]</span><span class="o">=</span><span class="n">i</span>
<span class="n">errdf</span><span class="x">[</span><span class="n">cntr</span><span class="x">,</span> <span class="x">:</span><span class="n">Error</span><span class="x">]</span><span class="o">=</span><span class="n">loss</span><span class="x">(</span><span class="n">ω</span><span class="x">,</span><span class="n">x</span><span class="x">,</span><span class="n">y</span><span class="x">)</span>
<span class="n">cntr</span><span class="o">+=</span><span class="mi">1</span>
<span class="k">end</span>
<span class="k">end</span>
</code></pre></div></div>
<p>Let’s then have a look at the naive performance shall we? We start out by
showing the same plots as we did for the linear model. It might not be super
obvious what happened, but the error went down by a lot.</p>
<div class="language-julia highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">p3</span> <span class="o">=</span> <span class="n">scatter</span><span class="x">(</span><span class="n">errdf</span><span class="x">[:,:</span><span class="n">Epoch</span><span class="x">],</span> <span class="n">errdf</span><span class="x">[:,:</span><span class="n">Error</span><span class="x">],</span> <span class="n">xlabel</span><span class="o">=</span><span class="s">"Epoch"</span><span class="x">,</span> <span class="n">ylabel</span><span class="o">=</span><span class="s">"Error"</span><span class="x">)</span>
<span class="n">p4</span> <span class="o">=</span> <span class="n">scatter</span><span class="x">(</span><span class="n">predict</span><span class="x">(</span><span class="n">ω</span><span class="x">,</span> <span class="n">x</span><span class="x">)</span><span class="err">'</span><span class="x">,</span> <span class="n">y</span><span class="err">'</span><span class="x">,</span> <span class="n">reg</span><span class="o">=</span><span class="n">true</span><span class="x">,</span> <span class="n">xlabel</span><span class="o">=</span><span class="s">"Predicted"</span><span class="x">,</span> <span class="n">ylabel</span><span class="o">=</span><span class="s">"Observed"</span><span class="x">)</span>
<span class="n">plot</span><span class="x">(</span><span class="n">p3</span><span class="x">,</span> <span class="n">p4</span><span class="x">,</span> <span class="n">layout</span><span class="o">=</span><span class="x">(</span><span class="mi">1</span><span class="x">,</span><span class="mi">2</span><span class="x">),</span> <span class="n">size</span><span class="o">=</span><span class="x">(</span><span class="mi">950</span><span class="x">,</span><span class="mi">500</span><span class="x">))</span>
</code></pre></div></div>
<p><img src="/images/figure/docbostonhousing_10_1.svg" alt="" /></p>
<p>But the interesting comparison is of course how much better the fit really was.
We can show the correlation plots from both models next to each other. The
correlation for the first model was [0.85] while for our latest version it was
[0.98].</p>
<div class="language-julia highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">plot</span><span class="x">(</span><span class="n">p2</span><span class="x">,</span> <span class="n">p4</span><span class="x">,</span> <span class="n">layout</span><span class="o">=</span><span class="x">(</span><span class="mi">1</span><span class="x">,</span><span class="mi">2</span><span class="x">),</span> <span class="n">size</span><span class="o">=</span><span class="x">(</span><span class="mi">950</span><span class="x">,</span><span class="mi">500</span><span class="x">))</span>
</code></pre></div></div>
<p><img src="/images/figure/docbostonhousing_11_1.svg" alt="" /></p>
<p>As the complexity is probably high enough it makes sense to check if it’s too
flexible and have a validation run duing our fitting process. This is usually
rather instructive when dealing with highly parameterized functions. We start by
splitting our data set up in training and testing.</p>
<div class="language-julia highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">xtrn</span><span class="x">,</span> <span class="n">xtst</span> <span class="o">=</span> <span class="n">x</span><span class="x">[:,</span> <span class="mi">1</span><span class="x">:</span><span class="mi">400</span><span class="x">],</span> <span class="n">x</span><span class="x">[:,</span> <span class="mi">401</span><span class="x">:</span><span class="k">end</span><span class="x">]</span>
<span class="n">ytrn</span><span class="x">,</span> <span class="n">ytst</span> <span class="o">=</span> <span class="n">y</span><span class="x">[:,</span> <span class="mi">1</span><span class="x">:</span><span class="mi">400</span><span class="x">],</span> <span class="n">y</span><span class="x">[:,</span> <span class="mi">401</span><span class="x">:</span><span class="k">end</span><span class="x">]</span>
<span class="n">ω</span> <span class="o">=</span> <span class="kt">Any</span><span class="x">[</span><span class="mf">0.1f0</span><span class="o">*</span><span class="n">randn</span><span class="x">(</span><span class="kt">Float32</span><span class="x">,</span><span class="mi">64</span><span class="x">,</span><span class="mi">13</span><span class="x">),</span> <span class="n">zeros</span><span class="x">(</span><span class="kt">Float32</span><span class="x">,</span><span class="mi">64</span><span class="x">,</span><span class="mi">1</span><span class="x">),</span>
<span class="mf">0.1f0</span><span class="o">*</span><span class="n">randn</span><span class="x">(</span><span class="kt">Float32</span><span class="x">,</span><span class="mi">15</span><span class="x">,</span><span class="mi">64</span><span class="x">),</span> <span class="n">zeros</span><span class="x">(</span><span class="kt">Float32</span><span class="x">,</span><span class="mi">15</span><span class="x">,</span><span class="mi">1</span><span class="x">),</span>
<span class="mf">0.1f0</span><span class="o">*</span><span class="n">randn</span><span class="x">(</span><span class="kt">Float32</span><span class="x">,</span><span class="mi">1</span><span class="x">,</span><span class="mi">15</span><span class="x">),</span> <span class="n">zeros</span><span class="x">(</span><span class="kt">Float32</span><span class="x">,</span><span class="mi">1</span><span class="x">,</span><span class="mi">1</span><span class="x">)]</span>
<span class="n">errdf</span> <span class="o">=</span> <span class="n">DataFrame</span><span class="x">(</span><span class="n">Epoch</span><span class="o">=</span><span class="mi">1</span><span class="x">:</span><span class="mi">60</span><span class="x">,</span> <span class="n">TrnError</span><span class="o">=</span><span class="mf">0.0</span><span class="x">,</span> <span class="n">ValError</span><span class="o">=</span><span class="mf">0.0</span><span class="x">)</span>
<span class="n">cntr</span> <span class="o">=</span> <span class="mi">1</span>
<span class="k">for</span> <span class="n">i</span><span class="o">=</span><span class="mi">1</span><span class="x">:</span><span class="mi">600</span>
<span class="n">train</span><span class="x">(</span><span class="n">ω</span><span class="x">,</span> <span class="x">[(</span><span class="n">xtrn</span><span class="x">,</span> <span class="n">ytrn</span><span class="x">)])</span>
<span class="k">if</span> <span class="n">mod</span><span class="x">(</span><span class="n">i</span><span class="x">,</span> <span class="mi">10</span><span class="x">)</span> <span class="o">==</span> <span class="mi">0</span>
<span class="n">errdf</span><span class="x">[</span><span class="n">cntr</span><span class="x">,</span> <span class="x">:</span><span class="n">Epoch</span><span class="x">]</span><span class="o">=</span><span class="n">i</span>
<span class="n">errdf</span><span class="x">[</span><span class="n">cntr</span><span class="x">,</span> <span class="x">:</span><span class="n">TrnError</span><span class="x">]</span><span class="o">=</span><span class="n">loss</span><span class="x">(</span><span class="n">ω</span><span class="x">,</span><span class="n">xtrn</span><span class="x">,</span><span class="n">ytrn</span><span class="x">)</span>
<span class="n">errdf</span><span class="x">[</span><span class="n">cntr</span><span class="x">,</span> <span class="x">:</span><span class="n">ValError</span><span class="x">]</span><span class="o">=</span><span class="n">loss</span><span class="x">(</span><span class="n">ω</span><span class="x">,</span><span class="n">xtst</span><span class="x">,</span><span class="n">ytst</span><span class="x">)</span>
<span class="n">cntr</span><span class="o">+=</span><span class="mi">1</span>
<span class="k">end</span>
<span class="k">end</span>
</code></pre></div></div>
<p>After doing the training we can inspect what happens with the training and validation error over time. What you are seeing is extremely typical for neural networks that are not regularized or treated in a Bayesian way. Initially both the training error and the validation error goes down. However, as the model gets better and better at fitting the traning set it gets worse at the validation set which is not part of the training.</p>
<div class="language-julia highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">using</span> <span class="n">StatPlots</span>
<span class="nd">@df</span> <span class="n">errdf</span><span class="x">[</span><span class="mi">5</span><span class="x">:</span><span class="mi">60</span><span class="x">,:]</span> <span class="n">plot</span><span class="x">(:</span><span class="n">Epoch</span><span class="x">,</span> <span class="x">[:</span><span class="n">ValError</span><span class="x">,</span> <span class="x">:</span><span class="n">TrnError</span><span class="x">],</span> <span class="n">xlabel</span><span class="o">=</span><span class="s">"Epoch"</span><span class="x">,</span> <span class="n">ylabel</span><span class="o">=</span><span class="s">"Error"</span><span class="x">,</span>
<span class="n">label</span><span class="o">=</span><span class="x">[</span><span class="s">"Validation"</span> <span class="s">"Training"</span><span class="x">],</span> <span class="n">lw</span><span class="o">=</span><span class="mi">3</span><span class="x">)</span>
</code></pre></div></div>
<p><img src="/images/figure/docbostonhousing_13_1.svg" alt="" /></p>
<h1 id="deep-convolutional-networks-on-mnist">Deep convolutional networks on mnist</h1>
<p>The Boston Housing data set is, relatively speaking, a rather simple job for any well versed machine learning girl/guy. As such we need a little bit more bite to dig into what Julia can really do. So with no further ado we turn to convolutional neural networks. We will implement the LeNet which is an old fart in the game but nicely illustrates complexity and power.</p>
<div class="language-julia highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">using</span> <span class="n">Knet</span>
<span class="n">using</span> <span class="n">Plots</span>
<span class="n">using</span> <span class="n">StatPlots</span>
<span class="c"># Define necessary functions</span>
<span class="k">function</span><span class="nf"> predict</span><span class="x">(</span><span class="n">w</span><span class="x">,</span> <span class="n">x0</span><span class="x">)</span>
<span class="n">x1</span> <span class="o">=</span> <span class="n">pool</span><span class="x">(</span><span class="n">relu</span><span class="o">.</span><span class="x">(</span><span class="n">conv4</span><span class="x">(</span><span class="n">w</span><span class="x">[</span><span class="mi">1</span><span class="x">],</span><span class="n">x0</span><span class="x">)</span> <span class="o">.+</span> <span class="n">w</span><span class="x">[</span><span class="mi">2</span><span class="x">]))</span>
<span class="n">x2</span> <span class="o">=</span> <span class="n">pool</span><span class="x">(</span><span class="n">relu</span><span class="o">.</span><span class="x">(</span><span class="n">conv4</span><span class="x">(</span><span class="n">w</span><span class="x">[</span><span class="mi">3</span><span class="x">],</span><span class="n">x1</span><span class="x">)</span> <span class="o">.+</span> <span class="n">w</span><span class="x">[</span><span class="mi">4</span><span class="x">]))</span>
<span class="n">x3</span> <span class="o">=</span> <span class="n">relu</span><span class="o">.</span><span class="x">(</span><span class="n">w</span><span class="x">[</span><span class="mi">5</span><span class="x">]</span><span class="o">*</span><span class="n">mat</span><span class="x">(</span><span class="n">x2</span><span class="x">)</span> <span class="o">.+</span> <span class="n">w</span><span class="x">[</span><span class="mi">6</span><span class="x">])</span>
<span class="k">return</span> <span class="n">w</span><span class="x">[</span><span class="mi">7</span><span class="x">]</span><span class="o">*</span><span class="n">x3</span> <span class="o">.+</span> <span class="n">w</span><span class="x">[</span><span class="mi">8</span><span class="x">]</span>
<span class="k">end</span>
<span class="n">loss</span><span class="x">(</span><span class="n">ω</span><span class="x">,</span> <span class="n">x</span><span class="x">,</span> <span class="n">ygold</span><span class="x">)</span> <span class="o">=</span> <span class="n">nll</span><span class="x">(</span><span class="n">predict</span><span class="x">(</span><span class="n">ω</span><span class="x">,</span> <span class="n">x</span><span class="x">),</span> <span class="n">ygold</span><span class="x">)</span>
<span class="n">lossgradient</span> <span class="o">=</span> <span class="n">grad</span><span class="x">(</span><span class="n">loss</span><span class="x">)</span>
<span class="k">function</span><span class="nf"> train</span><span class="x">(</span><span class="n">model</span><span class="x">,</span> <span class="n">data</span><span class="x">,</span> <span class="n">optim</span><span class="x">)</span>
<span class="k">for</span> <span class="x">(</span><span class="n">x</span><span class="x">,</span><span class="n">y</span><span class="x">)</span> <span class="k">in</span> <span class="n">data</span>
<span class="n">grads</span> <span class="o">=</span> <span class="n">lossgradient</span><span class="x">(</span><span class="n">model</span><span class="x">,</span><span class="n">x</span><span class="x">,</span><span class="n">y</span><span class="x">)</span>
<span class="n">update!</span><span class="x">(</span><span class="n">model</span><span class="x">,</span> <span class="n">grads</span><span class="x">,</span> <span class="n">optim</span><span class="x">)</span>
<span class="k">end</span>
<span class="k">end</span>
<span class="c"># Load the data</span>
<span class="n">include</span><span class="x">(</span><span class="n">Knet</span><span class="o">.</span><span class="n">dir</span><span class="x">(</span><span class="s">"data"</span><span class="x">,</span><span class="s">"mnist.jl"</span><span class="x">))</span>
<span class="n">xtrn</span><span class="x">,</span> <span class="n">ytrn</span><span class="x">,</span> <span class="n">xtst</span><span class="x">,</span> <span class="n">ytst</span> <span class="o">=</span> <span class="n">mnist</span><span class="x">()</span>
<span class="n">dtrn</span> <span class="o">=</span> <span class="n">minibatch</span><span class="x">(</span><span class="n">xtrn</span><span class="x">,</span> <span class="n">ytrn</span><span class="x">,</span> <span class="mi">100</span><span class="x">)</span>
<span class="n">dtst</span> <span class="o">=</span> <span class="n">minibatch</span><span class="x">(</span><span class="n">xtst</span><span class="x">,</span> <span class="n">ytst</span><span class="x">,</span> <span class="mi">100</span><span class="x">)</span>
<span class="c"># Initialise neural network</span>
<span class="n">ω</span> <span class="o">=</span> <span class="kt">Any</span><span class="x">[</span><span class="n">xavier</span><span class="x">(</span><span class="kt">Float32</span><span class="x">,</span> <span class="mi">5</span><span class="x">,</span> <span class="mi">5</span><span class="x">,</span> <span class="mi">1</span><span class="x">,</span> <span class="mi">20</span><span class="x">),</span> <span class="n">zeros</span><span class="x">(</span><span class="kt">Float32</span><span class="x">,</span> <span class="mi">1</span><span class="x">,</span> <span class="mi">1</span><span class="x">,</span> <span class="mi">20</span><span class="x">,</span> <span class="mi">1</span><span class="x">),</span>
<span class="n">xavier</span><span class="x">(</span><span class="kt">Float32</span><span class="x">,</span> <span class="mi">5</span><span class="x">,</span> <span class="mi">5</span><span class="x">,</span> <span class="mi">20</span><span class="x">,</span> <span class="mi">50</span><span class="x">),</span> <span class="n">zeros</span><span class="x">(</span><span class="kt">Float32</span><span class="x">,</span> <span class="mi">1</span><span class="x">,</span> <span class="mi">1</span><span class="x">,</span> <span class="mi">50</span><span class="x">,</span> <span class="mi">1</span><span class="x">),</span>
<span class="n">xavier</span><span class="x">(</span><span class="kt">Float32</span><span class="x">,</span> <span class="mi">500</span><span class="x">,</span> <span class="mi">800</span><span class="x">),</span> <span class="n">zeros</span><span class="x">(</span><span class="kt">Float32</span><span class="x">,</span> <span class="mi">500</span><span class="x">,</span> <span class="mi">1</span><span class="x">),</span>
<span class="n">xavier</span><span class="x">(</span><span class="kt">Float32</span><span class="x">,</span> <span class="mi">10</span><span class="x">,</span> <span class="mi">500</span><span class="x">),</span> <span class="n">zeros</span><span class="x">(</span><span class="kt">Float32</span><span class="x">,</span> <span class="mi">10</span><span class="x">,</span> <span class="mi">1</span><span class="x">)</span> <span class="x">]</span>
<span class="n">o</span> <span class="o">=</span> <span class="n">optimizers</span><span class="x">(</span><span class="n">ω</span><span class="x">,</span> <span class="n">Adam</span><span class="x">)</span>
</code></pre></div></div>
<p>The last part of the code that will actually do the training I will not execute because well that basically takes forever because I’m currently writing this on a non GPU based laptop :) I have reached out to the developer of Knet for an answer and it appears as though the entire library was primarily developed with a GPU computational platform in mind. This is not an unreasonable assumption due to the good fit between neural networks and GPU’s. However, that doesn’t help us when we want to do deep learning on our CPU based machine.</p>
<div class="language-julia highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="c"># println((:epoch, 0, :trn, accuracy(ω, dtrn, predict), :tst, accuracy(ω, dtst, predict)))</span>
<span class="c"># for epoch=1:10</span>
<span class="c"># train(ω, dtrn, o)</span>
<span class="c"># println((:epoch, epoch, :trn, accuracy(ω, dtrn, predict), :tst, accuracy(ω, dtst, predict)))</span>
<span class="c"># end</span>
</code></pre></div></div>
<p>So armed with this information we look to another nice deep learning framework called MXNet. We can use this to define a convolutional neural network like we did with Knet. MXNet is a different framework so of course defining a network looks differently.</p>
<p>We start by loading the package in Julia and declaring a Variable in the MXNet framework called “data”. This will serve as our reference to a data set that we wish to run the network on. Before we move on I should say that MXNet allows for two programming paradigms.</p>
<ul>
<li>Imperative</li>
<li>Symbolic
The symbolic paradigm is the one used by Tensorflow where you define a computational graph and then do operations on it. Thus nothing gets executed until you run data through it. Imperative works more like normal programming which is line by line basis. Both approches have their benefits and caveats and fortunately MXNet and PyTorch supports both. For the remainder of this post we will use the Symbolic interface.</li>
</ul>
<div class="language-julia highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">using</span> <span class="n">MXNet</span>
<span class="n">data</span> <span class="o">=</span> <span class="n">mx</span><span class="o">.</span><span class="n">Variable</span><span class="x">(:</span><span class="n">data</span><span class="x">)</span>
</code></pre></div></div>
<p>We will again sort of replicate the LeNet model. The first convolutional layers is specified like this:</p>
<div class="language-julia highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">conv1</span> <span class="o">=</span> <span class="nd">@mx.chain</span> <span class="n">mx</span><span class="o">.</span><span class="n">Convolution</span><span class="x">(</span><span class="n">data</span><span class="x">,</span> <span class="n">kernel</span><span class="o">=</span><span class="x">(</span><span class="mi">5</span><span class="x">,</span><span class="mi">5</span><span class="x">),</span> <span class="n">num_filter</span><span class="o">=</span><span class="mi">20</span><span class="x">)</span> <span class="o">=></span>
<span class="n">mx</span><span class="o">.</span><span class="n">Activation</span><span class="x">(</span><span class="n">act_type</span><span class="o">=</span><span class="x">:</span><span class="n">tanh</span><span class="x">)</span> <span class="o">=></span>
<span class="n">mx</span><span class="o">.</span><span class="n">Pooling</span><span class="x">(</span><span class="n">pool_type</span><span class="o">=</span><span class="x">:</span><span class="n">max</span><span class="x">,</span> <span class="n">kernel</span><span class="o">=</span><span class="x">(</span><span class="mi">2</span><span class="x">,</span><span class="mi">2</span><span class="x">),</span> <span class="n">stride</span><span class="o">=</span><span class="x">(</span><span class="mi">2</span><span class="x">,</span><span class="mi">2</span><span class="x">))</span>
</code></pre></div></div>
<p>Notice the activation function as well as the pooling. We use 20 filters with a 5 by 5 kernel. On top of that we add a max pooling layer with another 2 by 2 kernel and a 2 by 2 stride. The second layer uses 50 filters with the same tactics. For both convolutional layers we use the tanh activation function.</p>
<div class="language-julia highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">conv2</span> <span class="o">=</span> <span class="nd">@mx.chain</span> <span class="n">mx</span><span class="o">.</span><span class="n">Convolution</span><span class="x">(</span><span class="n">conv1</span><span class="x">,</span> <span class="n">kernel</span><span class="o">=</span><span class="x">(</span><span class="mi">5</span><span class="x">,</span><span class="mi">5</span><span class="x">),</span> <span class="n">num_filter</span><span class="o">=</span><span class="mi">50</span><span class="x">)</span> <span class="o">=></span>
<span class="n">mx</span><span class="o">.</span><span class="n">Activation</span><span class="x">(</span><span class="n">act_type</span><span class="o">=</span><span class="x">:</span><span class="n">tanh</span><span class="x">)</span> <span class="o">=></span>
<span class="n">mx</span><span class="o">.</span><span class="n">Pooling</span><span class="x">(</span><span class="n">pool_type</span><span class="o">=</span><span class="x">:</span><span class="n">max</span><span class="x">,</span> <span class="n">kernel</span><span class="o">=</span><span class="x">(</span><span class="mi">2</span><span class="x">,</span><span class="mi">2</span><span class="x">),</span> <span class="n">stride</span><span class="o">=</span><span class="x">(</span><span class="mi">2</span><span class="x">,</span><span class="mi">2</span><span class="x">))</span>
</code></pre></div></div>
<p>We end the network structure with two fully connected layers of size 500 and 10 respectively. Further the softmax function is applied to the output layer to make the outputs sum to one for each data point. This does NOT turn this into a probability mind you!</p>
<div class="language-julia highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">fc1</span> <span class="o">=</span> <span class="nd">@mx.chain</span> <span class="n">mx</span><span class="o">.</span><span class="n">Flatten</span><span class="x">(</span><span class="n">conv2</span><span class="x">)</span> <span class="o">=></span>
<span class="n">mx</span><span class="o">.</span><span class="n">FullyConnected</span><span class="x">(</span><span class="n">num_hidden</span><span class="o">=</span><span class="mi">500</span><span class="x">)</span> <span class="o">=></span>
<span class="n">mx</span><span class="o">.</span><span class="n">Activation</span><span class="x">(</span><span class="n">act_type</span><span class="o">=</span><span class="x">:</span><span class="n">tanh</span><span class="x">)</span>
<span class="n">fc2</span> <span class="o">=</span> <span class="n">mx</span><span class="o">.</span><span class="n">FullyConnected</span><span class="x">(</span><span class="n">fc1</span><span class="x">,</span> <span class="n">num_hidden</span><span class="o">=</span><span class="mi">10</span><span class="x">)</span>
<span class="n">lenet</span> <span class="o">=</span> <span class="n">mx</span><span class="o">.</span><span class="n">SoftmaxOutput</span><span class="x">(</span><span class="n">fc2</span><span class="x">,</span> <span class="n">name</span><span class="o">=</span><span class="x">:</span><span class="n">softmax</span><span class="x">)</span>
</code></pre></div></div>
<h2 id="load-data">Load data</h2>
<p>Before we move into the training of the newly specified network we need to fetch the data and convert it into a format that MXNet understands. It starts out by specifying some choices for the data fetch and then downloads MNIST into Pkg.dir(“MXNet”)/data/mnist if not exist. After that data providers are created which allows us to iterate over the data sets.</p>
<div class="language-julia highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="c"># load data</span>
<span class="n">batch_size</span> <span class="o">=</span> <span class="mi">100</span>
<span class="n">data_name</span> <span class="o">=</span> <span class="x">:</span><span class="n">data</span>
<span class="n">label_name</span> <span class="o">=</span> <span class="x">:</span><span class="n">softmax_label</span>
<span class="n">filenames</span> <span class="o">=</span> <span class="n">mx</span><span class="o">.</span><span class="n">get_mnist_ubyte</span><span class="x">()</span>
<span class="n">train_provider</span> <span class="o">=</span> <span class="n">mx</span><span class="o">.</span><span class="n">MNISTProvider</span><span class="x">(</span><span class="n">image</span><span class="o">=</span><span class="n">filenames</span><span class="x">[:</span><span class="n">train_data</span><span class="x">],</span>
<span class="n">label</span><span class="o">=</span><span class="n">filenames</span><span class="x">[:</span><span class="n">train_label</span><span class="x">],</span>
<span class="n">data_name</span><span class="o">=</span><span class="n">data_name</span><span class="x">,</span> <span class="n">label_name</span><span class="o">=</span><span class="n">label_name</span><span class="x">,</span>
<span class="n">batch_size</span><span class="o">=</span><span class="n">batch_size</span><span class="x">,</span> <span class="n">shuffle</span><span class="o">=</span><span class="n">true</span><span class="x">,</span> <span class="n">flat</span><span class="o">=</span><span class="n">false</span><span class="x">,</span> <span class="n">silent</span><span class="o">=</span><span class="n">true</span><span class="x">)</span>
<span class="n">eval_provider</span> <span class="o">=</span> <span class="n">mx</span><span class="o">.</span><span class="n">MNISTProvider</span><span class="x">(</span><span class="n">image</span><span class="o">=</span><span class="n">filenames</span><span class="x">[:</span><span class="n">test_data</span><span class="x">],</span>
<span class="n">label</span><span class="o">=</span><span class="n">filenames</span><span class="x">[:</span><span class="n">test_label</span><span class="x">],</span>
<span class="n">data_name</span><span class="o">=</span><span class="n">data_name</span><span class="x">,</span> <span class="n">label_name</span><span class="o">=</span><span class="n">label_name</span><span class="x">,</span>
<span class="n">batch_size</span><span class="o">=</span><span class="n">batch_size</span><span class="x">,</span> <span class="n">shuffle</span><span class="o">=</span><span class="n">false</span><span class="x">,</span> <span class="n">flat</span><span class="o">=</span><span class="n">false</span><span class="x">,</span> <span class="n">silent</span><span class="o">=</span><span class="n">true</span><span class="x">)</span>
</code></pre></div></div>
<p>This gives us a training set and a validation set on the MNIST data from our MXNet package. To start using this data on our specified model we need to define the context in which we are going to train it. In this case, as I said before, we will use the CPU. If you have a GPU at your disposal I recommend you to exchange the context to mx.gpu() instead. Other than contextualising we need to define an optimizer and we’ll stick with our stochastic gradient descent. The parameters are standard choices. Do feel free to play around with them.</p>
<div class="language-julia highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">model</span> <span class="o">=</span> <span class="n">mx</span><span class="o">.</span><span class="n">FeedForward</span><span class="x">(</span><span class="n">lenet</span><span class="x">,</span> <span class="n">context</span><span class="o">=</span><span class="n">mx</span><span class="o">.</span><span class="n">cpu</span><span class="x">())</span>
<span class="n">optimizer</span> <span class="o">=</span> <span class="n">mx</span><span class="o">.</span><span class="n">SGD</span><span class="x">(</span><span class="n">lr</span><span class="o">=</span><span class="mf">0.05</span><span class="x">,</span> <span class="n">momentum</span><span class="o">=</span><span class="mf">0.9</span><span class="x">,</span> <span class="n">weight_decay</span><span class="o">=</span><span class="mf">0.00001</span><span class="x">)</span>
<span class="c"># mx.fit(model, optimizer, train_provider, n_epoch=10, eval_data=eval_provider)</span>
</code></pre></div></div>
<p>The actual fit is commented out as it still takes a long time on my machine. So please have a look at my repository for the full code. Thus, much like a cooking program I will restore some data from an offline run. I have loaded the probabilities for each digit class for each example along with the correct labels and an array of what I am calling outliers.</p>
<div class="language-julia highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">using</span> <span class="n">JLD</span>
<span class="n">probs</span> <span class="o">=</span> <span class="nb">load</span><span class="x">(</span><span class="s">"mymxnetresults.jld"</span><span class="x">,</span> <span class="s">"probs"</span><span class="x">)</span>
<span class="n">labels</span> <span class="o">=</span> <span class="nb">load</span><span class="x">(</span><span class="s">"mymxnetresults.jld"</span><span class="x">,</span> <span class="s">"labels"</span><span class="x">)</span>
<span class="n">outliers</span> <span class="o">=</span> <span class="nb">load</span><span class="x">(</span><span class="s">"mymxnetresults.jld"</span><span class="x">,</span> <span class="s">"outliers"</span><span class="x">)</span>
</code></pre></div></div>
<p>Now we can use this new data to compute the accuracy which you can see below is quite ok on a validation set. This is by no means the best achievable on the MNIST but it suffices to illustrate what we can do with deep convolutional nets.</p>
<div class="language-julia highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">correct</span> <span class="o">=</span> <span class="mi">0</span>
<span class="k">for</span> <span class="n">i</span> <span class="o">=</span> <span class="mi">1</span><span class="x">:</span><span class="n">length</span><span class="x">(</span><span class="n">labels</span><span class="x">)</span>
<span class="c"># labels are 0...9</span>
<span class="k">if</span> <span class="n">indmax</span><span class="x">(</span><span class="n">probs</span><span class="x">[:,</span><span class="n">i</span><span class="x">])</span> <span class="o">==</span> <span class="n">labels</span><span class="x">[</span><span class="n">i</span><span class="x">]</span><span class="o">+</span><span class="mi">1</span>
<span class="n">correct</span> <span class="o">+=</span> <span class="mi">1</span>
<span class="k">end</span>
<span class="k">end</span>
<span class="n">accuracy</span> <span class="o">=</span> <span class="mi">100</span><span class="n">correct</span><span class="o">/</span><span class="n">length</span><span class="x">(</span><span class="n">labels</span><span class="x">)</span>
<span class="n">println</span><span class="x">(</span><span class="n">mx</span><span class="o">.</span><span class="n">format</span><span class="x">(</span><span class="s">"Accuracy on eval set: {1:.2f}%"</span><span class="x">,</span> <span class="n">accuracy</span><span class="x">))</span>
</code></pre></div></div>
<div class="highlighter-rouge"><div class="highlight"><pre class="highlight"><code>Accuracy on eval set: 98.93%
</code></pre></div></div>
<p>As stated the accuracy is quite fine. However, there’s an interesting point I would like to make: What about the networks internal belief and confidence in what it’s seeing? To highlight this we will look at the outliers which consists of all examples the neural net got wrong.</p>
<div class="language-julia highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">tmpdf</span><span class="o">=</span><span class="n">DataFrame</span><span class="x">(</span><span class="n">outliers</span><span class="x">)</span>
</code></pre></div></div>
<p>First we will plot the predicted class against the internal confidence of the network in it’s classification.</p>
<div class="language-julia highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">scatter</span><span class="x">(</span><span class="nb">convert</span><span class="x">(</span><span class="n">Array</span><span class="x">,</span> <span class="n">tmpdf</span><span class="x">[</span><span class="mi">2</span><span class="x">,:])</span><span class="err">'</span><span class="x">,</span> <span class="nb">convert</span><span class="x">(</span><span class="n">Array</span><span class="x">,</span> <span class="n">tmpdf</span><span class="x">[</span><span class="mi">4</span><span class="x">,:])</span><span class="err">'</span><span class="x">,</span> <span class="n">ylim</span><span class="o">=</span><span class="x">(</span><span class="mi">0</span><span class="x">,</span> <span class="mi">1</span><span class="x">),</span>
<span class="n">ylabel</span><span class="o">=</span><span class="s">"Likelihood"</span><span class="x">,</span> <span class="n">xlabel</span><span class="o">=</span><span class="s">"Wrongly Predicted class"</span><span class="x">)</span>
</code></pre></div></div>
<p><img src="/images/figure/docbostonhousing_25_1.svg" alt="" /></p>
<p>Now that we can see that the network is mostly quite convinced of it’s findings even when it’s dead wrong. It’s above 40% sure every time. But let’s do better and have a look at the histogram of likelihoods on the erroneosly classified digits.</p>
<div class="language-julia highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">histogram</span><span class="x">(</span><span class="nb">convert</span><span class="x">(</span><span class="n">Array</span><span class="x">,</span> <span class="n">tmpdf</span><span class="x">[</span><span class="mi">4</span><span class="x">,:])</span><span class="err">'</span><span class="x">,</span> <span class="n">xlims</span><span class="o">=</span><span class="x">(</span><span class="mi">0</span><span class="x">,</span><span class="mi">1</span><span class="x">),</span>
<span class="n">xlabel</span><span class="o">=</span><span class="s">"Likelihood"</span><span class="x">,</span> <span class="n">ylabel</span><span class="o">=</span><span class="s">"Frequency"</span><span class="x">)</span>
</code></pre></div></div>
<p><img src="/images/figure/docbostonhousing_26_1.svg" alt="" /></p>
<p>In here it’s clear to see that the vast majority of the mass is way above 80% which again shows how convinced the network is even when it’s inherently wrong.</p>
<h1 id="conclusions">Conclusions</h1>
<p>Julia is an upcoming cool language that I believe will be well suited for deep learning applications and research. For a while longer I do believe that the early adopters of the language will be research engineers and scientists. There are a lot of things missing in the current packages compared to the Python and R universe but this is not strange given how young the language is. Further, Julia takes care of a lot of issues that Python and R currently have.
So give it a go and see how you like it!</p>
<p>Happy inferencing!</p>
<h1 id="links-and-resources">Links and resources</h1>
<ul>
<li>Knet: (http://denizyuret.github.io/Knet.jl/latest/)</li>
<li>MXNet: (http://mxnet.incubator.apache.org/)</li>
<li>Julia: (https://julialang.org/)</li>
</ul>Michael GreenI love new initiatives that tries to do something fresh and innovative. The relatively new language Julia is one of my favorite languages. It features a lot of good stuff in addition to being targeted towards computational people like me. I won’t bore you with the details of the language itself but suffice it to say that we finally have a general purpose language where you don’t have to compromise expressiveness with efficiency.On the apparent success of the maximum likelihood principle2017-07-28T00:00:00+00:002017-07-28T00:00:00+00:00/2017/07/28/On-the-apparent-success-of-the-maximum-likelihood-principle<p>Today we will run through an important concept in statistical learning theory
and modeling in general. It may come as no surprise that my point is as usual
“age quod agis”. This is a lifelong strive for me to convey that message to
fellow scientists and business people alike. Anyway, back to the topic. We will
have a look at why the Bayesian treatment of models is fundamentally important
to everyone and not only a select few mathematically inclined experts. The model
we will use for this post is a time series model describing Milk sales over
time. The model specification is</p>
<script type="math/tex; mode=display">% <![CDATA[
\begin{align}
y_t &\sim N(\mu_t, \sigma)\\
\mu_t &=\sum_{i=1}^{7}\beta_{i} x_{t,i} + \beta_0\\
\sigma &\sim U(0.01, \inf)
\end{align} %]]></script>
<p>which is a standard linear model. The <script type="math/tex">y_t</script> is the observed Milk sales units
at time <script type="math/tex">t</script> and the <script type="math/tex">x_{t,i}</script> is the indicator variable for weekday <script type="math/tex">i</script> at
time <script type="math/tex">t</script>. As per usual <script type="math/tex">\beta_0</script> serves as our intercept. A small sample of
the data set looks like this</p>
<table>
<thead>
<tr>
<th style="text-align: right">y</th>
<th style="text-align: right">WDay1</th>
<th style="text-align: right">WDay2</th>
<th style="text-align: right">WDay3</th>
<th style="text-align: right">WDay4</th>
<th style="text-align: right">WDay5</th>
<th style="text-align: right">WDay6</th>
<th style="text-align: right">WDay7</th>
</tr>
</thead>
<tbody>
<tr>
<td style="text-align: right">4331</td>
<td style="text-align: right">0</td>
<td style="text-align: right">0</td>
<td style="text-align: right">1</td>
<td style="text-align: right">0</td>
<td style="text-align: right">0</td>
<td style="text-align: right">0</td>
<td style="text-align: right">0</td>
</tr>
<tr>
<td style="text-align: right">6348</td>
<td style="text-align: right">0</td>
<td style="text-align: right">0</td>
<td style="text-align: right">0</td>
<td style="text-align: right">1</td>
<td style="text-align: right">0</td>
<td style="text-align: right">0</td>
<td style="text-align: right">0</td>
</tr>
<tr>
<td style="text-align: right">5804</td>
<td style="text-align: right">0</td>
<td style="text-align: right">0</td>
<td style="text-align: right">0</td>
<td style="text-align: right">0</td>
<td style="text-align: right">1</td>
<td style="text-align: right">0</td>
<td style="text-align: right">0</td>
</tr>
<tr>
<td style="text-align: right">6897</td>
<td style="text-align: right">0</td>
<td style="text-align: right">0</td>
<td style="text-align: right">0</td>
<td style="text-align: right">0</td>
<td style="text-align: right">0</td>
<td style="text-align: right">1</td>
<td style="text-align: right">0</td>
</tr>
<tr>
<td style="text-align: right">8428</td>
<td style="text-align: right">0</td>
<td style="text-align: right">0</td>
<td style="text-align: right">0</td>
<td style="text-align: right">0</td>
<td style="text-align: right">0</td>
<td style="text-align: right">0</td>
<td style="text-align: right">1</td>
</tr>
<tr>
<td style="text-align: right">6725</td>
<td style="text-align: right">1</td>
<td style="text-align: right">0</td>
<td style="text-align: right">0</td>
<td style="text-align: right">0</td>
<td style="text-align: right">0</td>
<td style="text-align: right">0</td>
<td style="text-align: right">0</td>
</tr>
</tbody>
</table>
<p>which, for the response variable <script type="math/tex">y</script>, looks like the distributional plot below.</p>
<p><img src="/images/figure/problemplot-1.png" alt="plot of chunk problemplot" /></p>
<p>For those of you wth modeling experience you will see that a mere intra-weekly
seasonality will not be enough for capturing all the interesting parts of this
particular series but for the point I’m trying to make it will work just fine
sticking with seasonality + and intercept.</p>
<h2 id="estimating-the-parameters-of-the-model">Estimating the parameters of the model</h2>
<p>We’re going to estimate the parameters of this model by</p>
<ol>
<li>The full Bayesian treatment, i.e., we’re going to estimate <script type="math/tex">p(\beta\vert y, X)</script></li>
<li>The Maximum likelihood, i.e., we’re going to estimate <script type="math/tex">p(y\vert \beta, X)</script> which in the tables and the plots will be referred to as “Freq” from the term “Frequentist” which I inherently dislike but I made the tables and plots a while ago so bear with me.</li>
</ol>
<p>If you rememeber your probability theory training you know that <script type="math/tex">p(\beta\vert
y, X) \neq p(y\vert \beta, X)</script>. Sure but so what? Well, this matters a lot. In
order to see why let’s dig into these terms. First off, let’s have a look at the
proper full Bayesian treatment. We can express that posterior distribution using
three terms, namely the</p>
<ol>
<li><strong>Likelihood</strong>,</li>
<li>the <strong>Prior</strong> and</li>
<li>the <strong>Evidence</strong>.</li>
</ol>
<script type="math/tex; mode=display">p(\beta\vert y, X)=\frac{p(y\vert \beta, X)p(\beta\vert X)}{\int p(y,\beta, X) d\beta}</script>
<p>The Evidence is the denominator and serves as a normalization factor that allows
us to talk about probabilities in the first place. The nominator consists of two
terms; the Likelihood (to the left), and the prior (to the right). It’s worth
noticing here that the prior for <script type="math/tex">\beta</script> may very well depend on the
covariates as such, and even on the response variable should we wish to venture
into emperical priors. Explained in plain words the equation above states that
we wish to estimate the posterior probability of our parameters <script type="math/tex">\beta</script> by
weigting our prior knowledge and assumptions about those parameters with the
plausability of them generating a data set like ours, normalized by the
plausability of the data itself under the existing mathematical model. Now
doesn’t that sound reasonable? I think it does.</p>
<p>Now if we look into the same kind of analysis for what the Maximum Likelihood
method does we find the following equation</p>
<script type="math/tex; mode=display">p(y\vert \beta, X)=\frac{p(\beta\vert y, X)}{p(\beta\vert X)}\int p(y,\beta, X) d\beta</script>
<p>which states that the probability of observing a data set like ours given fixed
<script type="math/tex">\beta</script>’s is the posterior probability of the <script type="math/tex">\beta</script>’s divided by our prior
assumptions scaled by the total plausability of the data itself. Now this also
sounds reasonable, and it is. The only problem is that the quantity on the left
hand side is not sampled; it is maximized in Maximum Likelihood. Hence the
name.. On top of that what you do in 99% of all cases is ignore the right hand
side in the equation above and just postulate that <script type="math/tex">p(y\vert
\beta,X)=\mathcal{N}(\mu,\sigma)</script> which is a rather rough statement to begin
with, but let’s not dive into that right now. So when you maximize this
expression, what are you actually doing? Tadam! You’re doing data fitting. This
might seem like a good thing but it’s not. Basically you’re generating every
conceivable hypothesis known to the model at hand and picking the one that
happens to coincide the best with your, in most cases, tiny dataset. That’s not
even the worst part; The worst part is that you won’t even, once the fitting is
done, be able to express yourself about the uncertainty of the parameters of
your model!</p>
<p>Now that we have skimmed through the surface of the math behind the two
methodologies we’re ready to look at some results and do the real analysis.</p>
<h2 id="technical-setup">Technical setup</h2>
<p>The Bayesian approach is estimated using the probabalistic programming language
<a href="http://mc-stan.org/"><strong>Stan</strong></a> following the model described in the beginning,
i.e., we have completely uninformed priors. This is to make it as similar to the
Maximum Likelihood method as possible. The Maximum Likelihood method is
implemented using the <em>lm</em> function in <a href="https://www.R-project.org/"><strong>R</strong></a>.
Thus, in R we’re simply doing</p>
<div class="highlighter-rouge"><div class="highlight"><pre class="highlight"><code>mylm <- lm(y~WDay1+WDay2+WDay3+WDay4+WDay5+WDay6+WDay7, data=ourdata)
</code></pre></div></div>
<p>meanwhile in Stan we’re doing the following, admittedly a bit more complicated, code.</p>
<div class="highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">data</span> <span class="p">{</span>
<span class="n">int</span><span class="p"><</span> <span class="n">lower</span> <span class="p">=</span> <span class="m">0</span> <span class="p">></span> <span class="n">N</span><span class="p">;</span> <span class="p">//</span> <span class="n">Number</span> <span class="k">of</span> <span class="n">data</span> <span class="n">points</span>
<span class="n">vector</span><span class="p">[</span><span class="n">N</span><span class="p">]</span> <span class="n">y</span><span class="p">;</span> <span class="p">//</span> <span class="n">The</span> <span class="n">response</span> <span class="n">variable</span>
<span class="n">matrix</span><span class="p">[</span><span class="n">N</span><span class="p">,</span> <span class="m">7</span><span class="p">]</span> <span class="n">xweekday</span><span class="p">;</span> <span class="p">//</span> <span class="n">The</span> <span class="n">weekdays</span> <span class="n">variables</span>
<span class="p">}</span>
<span class="k">parameters</span> <span class="p">{</span>
<span class="k">real</span> <span class="n">b0</span><span class="p">;</span> <span class="p">//</span> <span class="n">The</span> <span class="n">intercept</span>
<span class="n">vector</span><span class="p">[</span><span class="m">7</span><span class="p">]</span> <span class="n">bweekday</span><span class="p">;</span> <span class="p">//</span> <span class="n">The</span> <span class="n">weekday</span> <span class="n">regression</span> <span class="k">parameters</span>
<span class="k">real</span><span class="p"><</span> <span class="n">lower</span> <span class="p">=</span> <span class="m">0</span> <span class="p">></span> <span class="n">sigma</span><span class="p">;</span> <span class="p">//</span> <span class="n">The</span> <span class="n">standard</span> <span class="n">deviation</span>
<span class="p">}</span>
<span class="n">transformed</span> <span class="k">parameters</span> <span class="p">{</span>
<span class="n">vector</span><span class="p">[</span><span class="n">N</span><span class="p">]</span> <span class="n">mu</span><span class="p">;</span> <span class="p">//</span> <span class="n">Declaration</span>
<span class="n">mu</span> <span class="p">=</span> <span class="n">b0</span> <span class="p">+</span> <span class="n">xweekday</span><span class="p">*</span><span class="n">bweekday</span><span class="p">;</span> <span class="p">//</span> <span class="n">The</span> <span class="n">mean</span> <span class="n">prediction</span> <span class="n">each</span> <span class="n">timestep</span>
<span class="p">}</span>
<span class="k">model</span> <span class="p">{</span>
<span class="n">y</span> <span class="p">~</span> <span class="n">normal</span><span class="p">(</span><span class="n">mu</span><span class="p">,</span> <span class="n">sigma</span><span class="p">);</span> <span class="p">//</span> <span class="n">Likelihood</span>
<span class="p">}</span>
<span class="n">generated</span> <span class="n">quantities</span> <span class="p">{</span>
<span class="n">vector</span><span class="p">[</span><span class="n">N</span><span class="p">]</span> <span class="n">yhat</span><span class="p">;</span>
<span class="n">yhat</span> <span class="p">=</span> <span class="n">b0</span> <span class="p">+</span> <span class="n">xweekday</span> <span class="p">*</span> <span class="n">bweekdayhat</span><span class="p">;</span>
<span class="p">}</span>
</code></pre></div></div>
<p>If you’re not in the mood to learn Stan you can achieve the same thing by using the <a href="https://github.com/paul-buerkner/brms"><strong>brms</strong></a> package in R and run the following code</p>
<div class="highlighter-rouge"><div class="highlight"><pre class="highlight"><code>require(brms)
mybrms <- brm(bf(y~WDay1+WDay2+WDay3+WDay4+WDay5+WDay6+WDay7), data=ourdata, cores = 2, chains = 4)
</code></pre></div></div>
<p>which will write, compile and sample your model in Stan and return it to R.</p>
<h1 id="results">Results</h1>
<p>Now to the dirty details of our calculations for the parameter estimates of the model. Throughout the results we will discuss the Bayesian estimation first and then the ML-approach. This pertains to each plot and or table. The first result we will have a look at is the estimates themselves. For the Bayesian estimates we have the average values and the uncertainty expresses as an estimation error. For the ML approach we have the estimates and a standard error. Have a look.</p>
<table>
<thead>
<tr>
<th style="text-align: left"> </th>
<th style="text-align: right">Estimate</th>
<th style="text-align: right">Est.Error</th>
<th style="text-align: right">Estimate</th>
<th style="text-align: right">Std. Error</th>
</tr>
</thead>
<tbody>
<tr>
<td style="text-align: left">Intercept</td>
<td style="text-align: right">75539</td>
<td style="text-align: right">450271</td>
<td style="text-align: right">8866</td>
<td style="text-align: right">83</td>
</tr>
<tr>
<td style="text-align: left">WDay1</td>
<td style="text-align: right">-67911</td>
<td style="text-align: right">450271</td>
<td style="text-align: right">-1231</td>
<td style="text-align: right">117</td>
</tr>
<tr>
<td style="text-align: left">WDay2</td>
<td style="text-align: right">-68249</td>
<td style="text-align: right">450270</td>
<td style="text-align: right">-1571</td>
<td style="text-align: right">117</td>
</tr>
<tr>
<td style="text-align: left">WDay3</td>
<td style="text-align: right">-68560</td>
<td style="text-align: right">450269</td>
<td style="text-align: right">-1882</td>
<td style="text-align: right">117</td>
</tr>
<tr>
<td style="text-align: left">WDay4</td>
<td style="text-align: right">-69072</td>
<td style="text-align: right">450270</td>
<td style="text-align: right">-2396</td>
<td style="text-align: right">117</td>
</tr>
<tr>
<td style="text-align: left">WDay5</td>
<td style="text-align: right">-69754</td>
<td style="text-align: right">450270</td>
<td style="text-align: right">-3076</td>
<td style="text-align: right">117</td>
</tr>
<tr>
<td style="text-align: left">WDay6</td>
<td style="text-align: right">-69723</td>
<td style="text-align: right">450270</td>
<td style="text-align: right">-3045</td>
<td style="text-align: right">117</td>
</tr>
<tr>
<td style="text-align: left">WDay7</td>
<td style="text-align: right">-66678</td>
<td style="text-align: right">450270</td>
<td style="text-align: right">NA</td>
<td style="text-align: right">NA</td>
</tr>
</tbody>
</table>
<p>If you’re looking at the table above, you might think “What the damn hell!?”, Bayesian statistics makes no sense at all! Why did we get these crazy estimates? Look at the nice narrow <strong>confidence</strong> intervals on the right hand side of the table generated by the maximum likelihood estimates and compare them to the wide <strong>credibility</strong> intervals to the left. You might be forgiven for dismissing the results from the Bayesian approach, since the difference is quite subtle from a mathematical point of view. After all we are computing the exact same mathematical model. The difference is our reasoning about the parameters. If you remember correctly maximum likelihood views the parameters as fixed constants without any variation. The variation you see in maximum likelihood comes from the uncertainty about the data and not the parameters! This is important to remember. The “Std. Error” from the maximum likelihood estimate has nothing to do with uncertainty about the parameter values for the observed data set. Instead it’s uncertainty regarding what would happen to the estimates if we observed more data sets that looks like ours. Remember from the section above that, Statistically speaking, what ML does is maximize <script type="math/tex">p(y\vert \beta,X)</script> which expresses likelihood over different <script type="math/tex">y</script>’s given an observed and fixed set of parameters <script type="math/tex">\beta</script> along with covariates <script type="math/tex">X</script>.</p>
<p>But ok, maybe you think there’s something very fishy with this model since the estimates are so different. How could we possible end up capturing the same time series? Well, rest assured that we can. Below you can see a scatter plot between the Observed response <script type="math/tex">y</script> and the predicted <script type="math/tex">\hat{y}</script> for the Bayesian and ML estimation. Pretty similar huh? We can also have a look at the average fitted values from the Bayesian estimation and the fitted values from the ML method. As you can see they agree to an rather high degree.</p>
<p><img src="/images/figure/corrplots-1.png" alt="Plot over the agreement between the fitting of the two approaches. The lefthand side shows the fitted vs observed for the Bayesian and the ML. The right hand side shows a scatterplot of the fitted from both approaches." /></p>
<p>Graphs can be quite decieving though so let’s do our homework and quantify how good these models really are head to head.</p>
<h1 id="model-validation-and-sanity-checking">Model validation and sanity checking</h1>
<p>I’ll start by taking you through the standard measures of goodness within time series analysis. Specifically we have the following measures.</p>
<ul>
<li>Mean Absolute Error (MAE)</li>
<li>Mean Absolute Standard Error (MASE)</li>
<li>Mean Absolute Percentage Error (MAPE)</li>
<li>Root Mean Square Error (RMSE)</li>
<li>Normalized Root Mean Square Error (NRMSE)</li>
<li>Coefficient of Variation Root Mean Square Error (CVRMSE)</li>
<li>Proportion of variance explained (R²)</li>
</ul>
<p>These are quantified in the table below and as you can see there’s virtually no difference between the two estimations. The reason for this is of course that they were built with the same open assumptions about which values that are plausible. In fact both estimation procedures almost accept anything that’s consistent with the data at hand.</p>
<table>
<thead>
<tr>
<th style="text-align: left"> </th>
<th style="text-align: right">Bayes</th>
<th style="text-align: right">Freq</th>
</tr>
</thead>
<tbody>
<tr>
<td style="text-align: left">MAE</td>
<td style="text-align: right">803.19</td>
<td style="text-align: right">803.63</td>
</tr>
<tr>
<td style="text-align: left">MASE</td>
<td style="text-align: right">0.79</td>
<td style="text-align: right">0.79</td>
</tr>
<tr>
<td style="text-align: left">MAPE</td>
<td style="text-align: right">0.12</td>
<td style="text-align: right">0.12</td>
</tr>
<tr>
<td style="text-align: left">RMSE</td>
<td style="text-align: right">1117.51</td>
<td style="text-align: right">1117.01</td>
</tr>
<tr>
<td style="text-align: left">NRMSE</td>
<td style="text-align: right">0.10</td>
<td style="text-align: right">0.10</td>
</tr>
<tr>
<td style="text-align: left">CVRMSE</td>
<td style="text-align: right">0.16</td>
<td style="text-align: right">0.16</td>
</tr>
<tr>
<td style="text-align: left">R2</td>
<td style="text-align: right">0.45</td>
<td style="text-align: right">0.45</td>
</tr>
</tbody>
</table>
<p>So again it seems like there’s not much differentiating these models from one another. That is true while looking at the result of the average fitted values from the two estimates. However, there’s a massive difference in the <strong>interpretation</strong> of the model. What do I mean by that you might ask yourself, and it’s a good question because if the fit is apparently more or less the same we should be able to pick any of the methods right? Wrong! Remember what I said about sampling being important as it unveils structure in the parameter space that is otherwise hidden through the ML approach. In the illustration below you can see the posterior density of each <script type="math/tex">\beta</script> for the weekday effects. Here it’s clear that they can take many different values which ends up in equally good models. This is the reason why our uncertainty is huge in the Bayesian estimation. There is really a lot of probable parameter values that could be assumed by the model. Also present in the illustration is the ML estimate indicated by a dark vertical line.</p>
<p><img src="/images/figure/posteriorsbeta-1.png" alt="plot of chunk posteriorsbeta" /></p>
<p>If you look closely there are at least two or three major peaks in the densities which denotes the highest probability for those parameters (In this plot we have four different MCMC chains for each parameter), so why on earth is ML so crazy sure about the parameter values? If you read my post you already know the answer, as we already discussed that the error/uncertainty expressed by the ML approach has <em>nothing</em> to do with the uncertainty of the parameters. It’s purely an uncertainty about the data. As such there is no probabilistic interpretation of the parameters under the ML methodology. They are considered as fixed constants. It’s the data that’s considered to be random.</p>
<p>There is one more important check that we need to do and that’s a posterior predictive check just to make sure that we are not biased too much in our estimation. Again inspecting the density and cumulative distribution function below we can see that we’re doing quite ok given that we only have day of week as covariates in our model.</p>
<p><img src="/images/figure/posteriorpredict-1.png" alt="plot of chunk posteriorpredict" /></p>
<h2 id="diving-into-the-intercept">Diving into the intercept</h2>
<p>As you saw previously there’s way more support for different values of our parameters than the ML method shows us. To further visualize this we’ll take a look at the samples for the intercept <script type="math/tex">\beta_0</script> chain by chain using violin plots. They show the distribution on the y axis and the chain id on the x axis. As before the ML estimate is indicated by a black horizontal line. You can see that the ML approach only agrees with the expected value of chain number 2. The other support is completely ignored and not exposed to the user.</p>
<p><img src="/images/figure/interceptviolin-1.png" alt="plot of chunk interceptviolin" /></p>
<p>Why is this an issue one might wonder, and the answer to that is that there is no guarantee that chain number two is the one that best represents the physical reality we’re trying to model. The purpose of any model is (or at least should be) to understand the underlying physical reality that we’re interested in. As such the company selling the Milk that we just modeled might ask how much is my Base sales each day? We know that we can answer this because that is what we’re capturing using the intercept in our model. Let’s answer these questions based on our estimations</p>
<p><strong>Mrs. Manager</strong>: “So Miss Data Scientist, what’s our base sales?”</p>
<p><strong>Miss Data Scientist</strong>: “Well I have two answers for you. I will answer it using two uninformed approaches; an ML approach and a Bayesian approach. Here goes.”</p>
<ol>
<li>Bayesian answer: Your base sales is 75,539 which never happens and depending on the day is reduced by around -68,563.8 yielding an average Saturday sales of 8,861.</li>
<li>Maximum likelihood answer: Your base sales is 8,866 which happens on an average Saturday. All other days this is reduced by an average of -2,200</li>
</ol>
<p>The summaries you can see in this table.</p>
<table>
<thead>
<tr>
<th style="text-align: left">Weekday</th>
<th style="text-align: right">AvgSalesBayes</th>
<th style="text-align: right">AvgSalesFreq</th>
</tr>
</thead>
<tbody>
<tr>
<td style="text-align: left">Sun</td>
<td style="text-align: right">7,628</td>
<td style="text-align: right">7,635</td>
</tr>
<tr>
<td style="text-align: left">Mon</td>
<td style="text-align: right">7,289</td>
<td style="text-align: right">7,295</td>
</tr>
<tr>
<td style="text-align: left">Tue</td>
<td style="text-align: right">6,979</td>
<td style="text-align: right">6,984</td>
</tr>
<tr>
<td style="text-align: left">Wed</td>
<td style="text-align: right">6,466</td>
<td style="text-align: right">6,470</td>
</tr>
<tr>
<td style="text-align: left">Thu</td>
<td style="text-align: right">5,785</td>
<td style="text-align: right">5,790</td>
</tr>
<tr>
<td style="text-align: left">Fri</td>
<td style="text-align: right">5,816</td>
<td style="text-align: right">5,821</td>
</tr>
<tr>
<td style="text-align: left">Sat</td>
<td style="text-align: right">8,861</td>
<td style="text-align: right">8,866</td>
</tr>
</tbody>
</table>
<p><strong>Mrs. Manager</strong>: “That doesn’t make sense to me at all. Just pick the best performing model”</p>
<p><strong>Miss Data Scientist</strong>: “They’re both equally good performance wise.”</p>
<p><strong>Mrs. Manager</strong>: “I don’t like this at all!”</p>
<p><strong>Miss Data Scientist</strong>: “Me too.”</p>
<h1 id="what-you-should-do">What you should do</h1>
<p>So now that we have established that the Bayesian approach is necessary and useful the question still remains on how to fix the estimation. We will do two things to improve upon the estimation</p>
<ol>
<li>Set up informed priors for our believs about the plausability of the parameters</li>
<li>Save the sampler some time by setting a baseline for the weekdays</li>
</ol>
<p>Basically we will modify the model like this</p>
<script type="math/tex; mode=display">% <![CDATA[
\begin{align}
y_t &\sim N(\mu_t, \sigma)\\
\mu_t &=\sum_{i=1}^{7}\beta_{i} x_{t,i} + \beta_0\\
\beta_0 &\sim N(\mu_y^{emp}, \sigma_y^{emp})\\
\beta_i &\sim N(0, \sigma_y^{emp}) \forall i\in[1,7]\\
\sigma &\sim U(0.01, \infty)
\end{align} %]]></script>
<p>where <script type="math/tex">\mu_y^{emp}</script> and <script type="math/tex">\sigma_y^{emp}</script> are the empirical mean and standard deviation of the response variable respectively. This is a nice practical hack since it makes sure that your priors are in the vicinity of the response you’re trying to model. The resulting code is given below. You can try it on your own daily time series. It’s quite plug and play.</p>
<div class="highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">data</span> <span class="p">{</span>
<span class="n">int</span><span class="p"><</span> <span class="n">lower</span> <span class="p">=</span> <span class="m">0</span> <span class="p">></span> <span class="n">N</span><span class="p">;</span> <span class="p">//</span> <span class="n">Number</span> <span class="k">of</span> <span class="n">data</span> <span class="n">points</span>
<span class="n">vector</span><span class="p">[</span><span class="n">N</span><span class="p">]</span> <span class="n">y</span><span class="p">;</span> <span class="p">//</span> <span class="n">The</span> <span class="n">response</span> <span class="n">variable</span>
<span class="n">matrix</span><span class="p">[</span><span class="n">N</span><span class="p">,</span> <span class="m">7</span><span class="p">]</span> <span class="n">xweekday</span><span class="p">;</span> <span class="p">//</span> <span class="n">The</span> <span class="n">weekdays</span> <span class="n">variables</span>
<span class="p">}</span>
<span class="k">parameters</span> <span class="p">{</span>
<span class="k">real</span><span class="p"><</span> <span class="n">lower</span> <span class="p">=</span> <span class="m">0.01</span> <span class="p">></span> <span class="n">b0</span><span class="p">;</span> <span class="p">//</span> <span class="n">The</span> <span class="n">intercept</span>
<span class="n">vector</span><span class="p">[</span><span class="m">7</span> <span class="p">-</span> <span class="m">1</span><span class="p">]</span> <span class="n">bweekday</span><span class="p">;</span> <span class="p">//</span> <span class="n">The</span> <span class="n">weekday</span> <span class="n">regression</span> <span class="k">parameters</span>
<span class="k">real</span><span class="p"><</span> <span class="n">lower</span> <span class="p">=</span> <span class="m">0</span> <span class="p">></span> <span class="n">sigma</span><span class="p">;</span> <span class="p">//</span> <span class="n">The</span> <span class="n">standard</span> <span class="n">deviation</span>
<span class="p">}</span>
<span class="n">transformed</span> <span class="k">parameters</span> <span class="p">{</span>
<span class="p">//</span> <span class="k">Declarations</span>
<span class="n">vector</span><span class="p">[</span><span class="n">N</span><span class="p">]</span> <span class="n">mu</span><span class="p">;</span>
<span class="n">vector</span><span class="p">[</span><span class="m">7</span><span class="p">]</span> <span class="n">bweekdayhat</span><span class="p">;</span>
<span class="p">//</span> <span class="n">The</span> <span class="n">weekday</span> <span class="n">part</span>
<span class="n">bweekdayhat</span><span class="p">[</span><span class="m">1</span><span class="p">]</span> <span class="p">=</span> <span class="m">0</span><span class="p">;</span>
<span class="n">for</span> <span class="p">(</span><span class="n">i</span> <span class="k">in</span> <span class="m">1</span><span class="p">:(</span><span class="m">7</span> <span class="p">-</span> <span class="m">1</span><span class="p">)</span> <span class="p">)</span> <span class="n">bweekdayhat</span><span class="p">[</span><span class="n">i</span> <span class="p">+</span> <span class="m">1</span><span class="p">]</span> <span class="p">=</span> <span class="n">bweekday</span><span class="p">[</span><span class="n">i</span><span class="p">];</span>
<span class="p">//</span> <span class="n">The</span> <span class="n">mean</span> <span class="n">prediction</span> <span class="n">each</span> <span class="n">timestep</span>
<span class="n">mu</span> <span class="p">=</span> <span class="n">b0</span> <span class="p">+</span> <span class="n">xweekday</span><span class="p">*</span><span class="n">bweekdayhat</span><span class="p">;</span>
<span class="p">}</span>
<span class="k">model</span> <span class="p">{</span>
<span class="p">//</span> <span class="n">Priors</span>
<span class="n">b0</span> <span class="p">~</span> <span class="n">normal</span><span class="p">(</span><span class="n">mean</span><span class="p">(</span><span class="n">y</span><span class="p">),</span> <span class="n">sd</span><span class="p">(</span><span class="n">y</span><span class="p">));</span>
<span class="n">bweekday</span> <span class="p">~</span> <span class="n">normal</span><span class="p">(</span><span class="m">0</span><span class="p">,</span> <span class="n">sd</span><span class="p">(</span><span class="n">y</span><span class="p">));</span>
<span class="p">//</span> <span class="n">Likelihood</span>
<span class="n">y</span> <span class="p">~</span> <span class="n">normal</span><span class="p">(</span><span class="n">mu</span><span class="p">,</span> <span class="n">sigma</span><span class="p">);</span>
<span class="p">}</span>
<span class="n">generated</span> <span class="n">quantities</span> <span class="p">{</span>
<span class="n">vector</span><span class="p">[</span><span class="n">N</span><span class="p">]</span> <span class="n">yhat</span><span class="p">;</span>
<span class="n">yhat</span> <span class="p">=</span> <span class="n">b0</span> <span class="p">+</span> <span class="n">xweekday</span> <span class="p">*</span> <span class="n">bweekdayhat</span><span class="p">;</span>
<span class="p">}</span>
</code></pre></div></div>
<p>Now let’s have a look at this model instead. A quick look into these parameters show that we have nice clean unimodal posteriors due to our prior beliefs being applied to the analysis. The same table as shown previously is not shown below with the results for the new estimation appended to the rightmost side. For clarification we name the columns Estimate and SD.</p>
<table>
<thead>
<tr>
<th style="text-align: left"> </th>
<th style="text-align: right">Estimate</th>
<th style="text-align: right">Est.Error</th>
<th style="text-align: right">Estimate</th>
<th style="text-align: right">Std. Error</th>
<th style="text-align: right">Estimate</th>
<th style="text-align: right">SD</th>
</tr>
</thead>
<tbody>
<tr>
<td style="text-align: left">Intercept</td>
<td style="text-align: right">75,539</td>
<td style="text-align: right">450,271</td>
<td style="text-align: right">8,866</td>
<td style="text-align: right">83</td>
<td style="text-align: right">7,615</td>
<td style="text-align: right">89</td>
</tr>
<tr>
<td style="text-align: left">WDay1</td>
<td style="text-align: right">-67,911</td>
<td style="text-align: right">450,271</td>
<td style="text-align: right">-1,231</td>
<td style="text-align: right">117</td>
<td style="text-align: right">0</td>
<td style="text-align: right">0</td>
</tr>
<tr>
<td style="text-align: left">WDay2</td>
<td style="text-align: right">-68,249</td>
<td style="text-align: right">450,270</td>
<td style="text-align: right">-1,571</td>
<td style="text-align: right">117</td>
<td style="text-align: right">-324</td>
<td style="text-align: right">122</td>
</tr>
<tr>
<td style="text-align: left">WDay3</td>
<td style="text-align: right">-68,560</td>
<td style="text-align: right">450,269</td>
<td style="text-align: right">-1,882</td>
<td style="text-align: right">117</td>
<td style="text-align: right">-624</td>
<td style="text-align: right">118</td>
</tr>
<tr>
<td style="text-align: left">WDay4</td>
<td style="text-align: right">-69,072</td>
<td style="text-align: right">450,270</td>
<td style="text-align: right">-2,396</td>
<td style="text-align: right">117</td>
<td style="text-align: right">-1,141</td>
<td style="text-align: right">123</td>
</tr>
<tr>
<td style="text-align: left">WDay5</td>
<td style="text-align: right">-69,754</td>
<td style="text-align: right">450,270</td>
<td style="text-align: right">-3,076</td>
<td style="text-align: right">117</td>
<td style="text-align: right">-1,819</td>
<td style="text-align: right">119</td>
</tr>
<tr>
<td style="text-align: left">WDay6</td>
<td style="text-align: right">-69,723</td>
<td style="text-align: right">450,270</td>
<td style="text-align: right">-3,045</td>
<td style="text-align: right">117</td>
<td style="text-align: right">-1,788</td>
<td style="text-align: right">124</td>
</tr>
<tr>
<td style="text-align: left">WDay7</td>
<td style="text-align: right">-66,678</td>
<td style="text-align: right">450,270</td>
<td style="text-align: right">NA</td>
<td style="text-align: right">NA</td>
<td style="text-align: right">1,249</td>
<td style="text-align: right">122</td>
</tr>
</tbody>
</table>
<p>As you can see these estimates are quite different and to the naked eye makes more sense from what we know about the data set and what we can expect from intra-weekly effects. We can further check these estimates by inspecting the posteriors further. Note here the “bweekdayhat[1]” which is a delta distribution at 0. This serves as our baseline for the intra-week effect that we’re capturing. The x-axis in the plot are the estimated <script type="math/tex">\beta_i</script>’s and the y-axis for each parameter is the posterior probability density.</p>
<p><img src="/images/figure/intervalplots-1.png" alt="plot of chunk intervalplots" /></p>
<p>So from a model estimation standpoint we should be pretty happy now. But how does this new estimation compare to the others? Below I will repeat the model performance table from earlier and extend it with our new “Bayes2” estimation.</p>
<table>
<thead>
<tr>
<th style="text-align: left"> </th>
<th style="text-align: right">Bayes</th>
<th style="text-align: right">Freq</th>
<th style="text-align: right">Bayes2</th>
</tr>
</thead>
<tbody>
<tr>
<td style="text-align: left">MAE</td>
<td style="text-align: right">803.19</td>
<td style="text-align: right">803.63</td>
<td style="text-align: right">803.21</td>
</tr>
<tr>
<td style="text-align: left">MASE</td>
<td style="text-align: right">0.79</td>
<td style="text-align: right">0.79</td>
<td style="text-align: right">0.79</td>
</tr>
<tr>
<td style="text-align: left">MAPE</td>
<td style="text-align: right">0.12</td>
<td style="text-align: right">0.12</td>
<td style="text-align: right">0.12</td>
</tr>
<tr>
<td style="text-align: left">RMSE</td>
<td style="text-align: right">1117.51</td>
<td style="text-align: right">1117.01</td>
<td style="text-align: right">1117.05</td>
</tr>
<tr>
<td style="text-align: left">NRMSE</td>
<td style="text-align: right">0.10</td>
<td style="text-align: right">0.10</td>
<td style="text-align: right">0.10</td>
</tr>
<tr>
<td style="text-align: left">CVRMSE</td>
<td style="text-align: right">0.16</td>
<td style="text-align: right">0.16</td>
<td style="text-align: right">0.16</td>
</tr>
<tr>
<td style="text-align: left">R2</td>
<td style="text-align: right">0.45</td>
<td style="text-align: right">0.45</td>
<td style="text-align: right">0.45</td>
</tr>
</tbody>
</table>
<p>It’s evident that our new way of estimating the parameters of the model yields not only a more satisfying modeling approach but also provides us with a more actionable model without any reduction from a performance perspective. I’d call that a win win. Basically this means that our data scientist can go back with confidence and approach the manager again with robust findings and a knowledge about the space of potentially plausible parameters!</p>
<h1 id="summary-and-finishing-remarks">Summary and finishing remarks</h1>
<p>Today we looked at how to use Bayesian analysis applied to a real world problem. We saw the dangers in applying the maximum likelihood method blindly. Moreover we saw that the Bayesian formalism forces you to make your assumptions explicit. If you don’t it will show you all possibilities that the mathematical models supprts given the data set. This is important to remember and it is NOT a problem with the Bayesian analysis; It is a feature! So if I can leave you with some recommendations and guidelines when dealing with models I would say this:</p>
<ul>
<li>There’s nothing wrong in experimenting with ML methods for speady development of prototype models but whenever you are going to quantify your trust in your model you have to and I mean <strong>have</strong> to sample it and treat it in a proper probabilistic, i.e., Bayesian formalism.</li>
<li>Always make your assumptions and beliefs explicit in your final model. This will help not only you but fellow modelers who might use your model moving forward.</li>
<li>Learn to understand the difference between Maximum Likelihood and Sampling the posterior probability distribution of your parameters. It might be hard at first but it will be worth it in the end.</li>
<li>Accept that there is no such thing as an analysis without assumptions! When you’re doing linear regression using Maximum Likelihood you are effectively assuming that any value between minus infinity and plus infinity are equally likely and that is nonsense my friend.</li>
</ul>
<p>Happy inferencing!</p>Dr. Michael GreenToday we will run through an important concept in statistical learning theory and modeling in general. It may come as no surprise that my point is as usual “age quod agis”. This is a lifelong strive for me to convey that message to fellow scientists and business people alike. Anyway, back to the topic. We will have a look at why the Bayesian treatment of models is fundamentally important to everyone and not only a select few mathematically inclined experts. The model we will use for this post is a time series model describing Milk sales over time. The model specification is which is a standard linear model. The is the observed Milk sales units at time and the is the indicator variable for weekday at time . As per usual serves as our intercept. A small sample of the data set looks like thisBuilding and testing a simple deep learning object detection application2017-07-15T00:00:00+00:002017-07-15T00:00:00+00:00/2017/07/15/A-simple-object-detection-app<p>Deep learning is hot currently. Really hot. The reason for this is that there’s
more data available than ever in the space of perception. By perception I mean
tasks such as object recognition in images, natural language processing, speech
detection etc. Basically anything where we generate copious amounts of data
every day. Many companies are putting this data to good use. Google, Facebook,
Nvidia, Amazon etc. are all heavy in this space since they have access to most
of these data. We as users happily give it to them through all our social media
posts and online storage utilization. In any case I wanted to give you all a
flavor of what you can do with a relatively small convolutional neural network
trained to detect many different objects in an image. Specifically we will use a
network architecture known as MobileNet which is meant to run on smaller
devices. Luckily for us you can get access to pre-trained models and test it
out. So that’s exactly what we’ll do today.</p>
<h2 id="the-data-foundation">The data foundation</h2>
<p>The model has been trained on the <a href="http://mscoco.org/">COCO</a> dataset (Common
Objects in Context). Just as it sounds like this dataset contains a lot of
images with objects we see quite often in everyday life. Specifically it
consists of 300k images of 90 different objects such as</p>
<ul>
<li>Fruit</li>
<li>Vehicles</li>
<li>People</li>
<li>Etc.</li>
</ul>
<h2 id="the-model">The model</h2>
<p>The model we are using is the “Single Shot Multibox Detector (SSD) with
MobileNet” located
<a href="http://download.tensorflow.org/models/object_detection/ssd_mobilenet_v1_coco_11_06_2017.tar.gz">here</a>
and takes a little while to download. It’s stored using Google’s Protocol
Buffers format. I know what you’re thinking: Why oh why invent yet another
structured storage format? In Google’s defense this one is pretty cool.
Basically it’s a language-neutral, platform-neutral, extensible mechanism for
serializing structured data – think XML, but smaller, faster, and simpler. You
define how you want your data to be structured once, then you can use special
generated source code to easily write and read your structured data to and from
a variety of data streams and using a variety of languages.</p>
<h2 id="convolutional-neural-networks">Convolutional neural networks</h2>
<p><img src="/images/figure/objdetect/Screen-Shot-2015-11-07-at-7.26.20-AM.png" alt="Convolutional neural networks" /></p>
<p>Before we have a look at the results let’s give a quick introduction to what convolutional neural networks really are and why they are more successful at image analysis than normal multi-layered perceptrons. The whole idea behind using convolutional neural networks is that we need the network to be translation and rotation invariant. This just means that we need to be able to recognize an object no matter where in the image it resides. One way to achieve this is to swipe a patch reacting to certain patterns over the image. Think of it as a filter that lights up when it detects something. Of course we don’t know what we are looking for and therefore these filters are learned during training. In general in deep learning we learn lower level features in the initial layers while the later layers captures more elaborate features. This is pretty cool as we can save initially trained early layers and reuse them in other models.</p>
<h2 id="results">Results</h2>
<p>The video below was shot from my cell phone while the modeling team was working. As you can see the model does indeed identify some objects quite successfully. But it also fails to detect many of them. There are many reasons for this. One of them is that this model is optimized for speed and not for performance.</p>
<p><img src="/images/figure/objdetect/bw7modelers.gif" alt="continuous object detection" /></p>
<p>In this short post I showed you how it’s possible to utilize a previously fitted convolutional neural network and classify objects in a retrospective video. However this can be extended into a live object detection from a webcam or a surveillance camera. If you’re interested in doing this yourself have a look at Tensorflow’s example <a href="https://github.com/tensorflow/models/blob/master/object_detection/object_detection_tutorial.ipynb">here</a>.</p>
<p>Happy hacking!</p>Dr. Michael GreenDeep learning is hot currently. Really hot. The reason for this is that there’s more data available than ever in the space of perception. By perception I mean tasks such as object recognition in images, natural language processing, speech detection etc. Basically anything where we generate copious amounts of data every day. Many companies are putting this data to good use. Google, Facebook, Nvidia, Amazon etc. are all heavy in this space since they have access to most of these data. We as users happily give it to them through all our social media posts and online storage utilization. In any case I wanted to give you all a flavor of what you can do with a relatively small convolutional neural network trained to detect many different objects in an image. Specifically we will use a network architecture known as MobileNet which is meant to run on smaller devices. Luckily for us you can get access to pre-trained models and test it out. So that’s exactly what we’ll do today.About identifiability and granularity2017-05-06T00:00:00+00:002017-05-06T00:00:00+00:00/2017/05/06/About-identifiability-and-granularity<p>In time series modeling you typically run into issues concerning complexity
versus utility. What I mean by that is that there may be questions you need the
answer to but are afraid of the model complexity that comes along with it. This
fear of complexity is something that relates to identifiability and the curse of
dimensionality. Fortunately for us probabilistic programming can handle these
things neatly. In this post we’re going to look at a problem where we have a
choice between a granular model and an aggregated one. We need to use a proper
probabilistic model that we will sample in order to get the posterior
information we are looking for.</p>
<h2 id="the-generating-model">The generating model</h2>
<p>In order to do this exercise we need to know what we’re doing and as such we
will generate the data we need by simulating a stochastic process. I’m not a big
fan of this since simulated data will always be, well simulated, and as such not
very realistic. Data in our real world is not random people. This is worth
remembering, but as the clients I work with on a daily basis are not inclined to
share their precious data, and academic data sets are pointless since they are
almost exclusively too nice to represent any real challenge I resort to
simulated data. It’s enough to make my point. So without further ado I give you
the generating model.</p>
<script type="math/tex; mode=display">% <![CDATA[
\begin{align}
y_t &\sim N(\mu_t, 7)\\
\mu_t &= x_t + 7 z_t\\
x_t &\sim N(3, 1)\\
z_t &\sim N(1, 1)
\end{align} %]]></script>
<p>which is basically a gaussian mixture model. So that represents the ground
truth. The time series generated looks like this</p>
<p><img src="/images/figure/problemplot12-1.png" alt="plot of chunk problemplot12" /></p>
<p>where time is on the x axis and the response variable on the y axis. The first
few lines of the generated data are presented below.</p>
<table>
<thead>
<tr>
<th style="text-align: right">t</th>
<th style="text-align: right">y</th>
<th style="text-align: right">x</th>
<th style="text-align: right">z</th>
</tr>
</thead>
<tbody>
<tr>
<td style="text-align: right">0</td>
<td style="text-align: right">20.411003</td>
<td style="text-align: right">2.314330</td>
<td style="text-align: right">1.0381077</td>
</tr>
<tr>
<td style="text-align: right">1</td>
<td style="text-align: right">22.174020</td>
<td style="text-align: right">2.512780</td>
<td style="text-align: right">1.5292838</td>
</tr>
<tr>
<td style="text-align: right">2</td>
<td style="text-align: right">-5.035160</td>
<td style="text-align: right">2.048367</td>
<td style="text-align: right">-0.1099282</td>
</tr>
<tr>
<td style="text-align: right">3</td>
<td style="text-align: right">1.580412</td>
<td style="text-align: right">1.627389</td>
<td style="text-align: right">1.2106257</td>
</tr>
<tr>
<td style="text-align: right">4</td>
<td style="text-align: right">-5.391217</td>
<td style="text-align: right">4.924959</td>
<td style="text-align: right">-0.4488093</td>
</tr>
<tr>
<td style="text-align: right">5</td>
<td style="text-align: right">-1.360732</td>
<td style="text-align: right">3.237641</td>
<td style="text-align: right">-0.1645335</td>
</tr>
</tbody>
</table>
<p>So it’s apparent that we have three variables in this data set; the response
variable <script type="math/tex">y</script>, and the covariates <script type="math/tex">x</script> and <script type="math/tex">z</script> (<script type="math/tex">t</script> is just an indicator
of a fake time). So the real model is just a linear model of the two variables.
Now say that instead we want to go about solving this problem and we have two
individuals arguing about the best solution. Let’s call them Mr. Granularity and
Mr. Aggregation. Now Mr. Granularity is a fickle bastard as he always wants to
split things into more fine grained buckets. Mr. Aggregation on the other hand
is more kissable by nature. By that I’m refering to the Occam’s razor version of
kissable, meaning “Keep It Simple Sir” (KISS).</p>
<p>This means that Mr. Granularity wants to estimate a parameter for each of the two variables while Mr. Aggregation wants to estimate one parameter for the sum of <script type="math/tex">x</script> and <script type="math/tex">z</script>.</p>
<h2 id="mr-granularitys-solution">Mr. Granularity’s solution</h2>
<p>So let’s start out with the more complex solution. Mathematically Mr. Granularity defines the probabilistic model like this</p>
<script type="math/tex; mode=display">% <![CDATA[
\begin{align}
y_t &\sim N(\mu_t, \sigma)\\
\mu_t &=\beta_x x_t + \beta_z z_t + \beta_0\\
\beta_x &\sim N(0, 5)\\
\beta_z &\sim N(0, 5)\\
\beta_0 &\sim N(0, 5)\\
\sigma &\sim U(0.01, \inf)
\end{align} %]]></script>
<p>which is implemented in Stan code below. There’s nothing funky or noteworthy going on here. Just a simple linear model.</p>
<div class="highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">data</span> <span class="p">{</span>
<span class="n">int</span> <span class="n">N</span><span class="p">;</span>
<span class="k">real</span> <span class="n">x</span><span class="p">[</span><span class="n">N</span><span class="p">];</span>
<span class="k">real</span> <span class="n">z</span><span class="p">[</span><span class="n">N</span><span class="p">];</span>
<span class="k">real</span> <span class="n">y</span><span class="p">[</span><span class="n">N</span><span class="p">];</span>
<span class="p">}</span>
<span class="k">parameters</span> <span class="p">{</span>
<span class="k">real</span> <span class="n">b0</span><span class="p">;</span>
<span class="k">real</span> <span class="n">bx</span><span class="p">;</span>
<span class="k">real</span> <span class="n">bz</span><span class="p">;</span>
<span class="k">real</span><span class="p"><</span><span class="n">lower</span><span class="p">=</span><span class="m">0</span><span class="p">></span> <span class="n">sigma</span><span class="p">;</span>
<span class="p">}</span>
<span class="k">model</span> <span class="p">{</span>
<span class="n">b0</span> <span class="p">~</span> <span class="n">normal</span><span class="p">(</span><span class="m">0</span><span class="p">,</span> <span class="m">5</span><span class="p">);</span>
<span class="n">bx</span> <span class="p">~</span> <span class="n">normal</span><span class="p">(</span><span class="m">0</span><span class="p">,</span> <span class="m">5</span><span class="p">);</span>
<span class="n">bz</span> <span class="p">~</span> <span class="n">normal</span><span class="p">(</span><span class="m">0</span><span class="p">,</span> <span class="m">5</span><span class="p">);</span>
<span class="n">for</span><span class="p">(</span><span class="n">n</span> <span class="k">in</span> <span class="m">1</span><span class="p">:</span><span class="n">N</span><span class="p">)</span>
<span class="n">y</span><span class="p">[</span><span class="n">n</span><span class="p">]</span> <span class="p">~</span> <span class="n">normal</span><span class="p">(</span><span class="n">bx</span><span class="p">*</span><span class="n">x</span><span class="p">[</span><span class="n">n</span><span class="p">]+</span><span class="n">bz</span><span class="p">*</span><span class="n">z</span><span class="p">[</span><span class="n">n</span><span class="p">]+</span><span class="n">b0</span><span class="p">,</span> <span class="n">sigma</span><span class="p">);</span>
<span class="p">}</span>
<span class="n">generated</span> <span class="n">quantities</span> <span class="p">{</span>
<span class="k">real</span> <span class="n">y_pred</span><span class="p">[</span><span class="n">N</span><span class="p">];</span>
<span class="n">for</span> <span class="p">(</span><span class="n">n</span> <span class="k">in</span> <span class="m">1</span><span class="p">:</span><span class="n">N</span><span class="p">)</span>
<span class="n">y_pred</span><span class="p">[</span><span class="n">n</span><span class="p">]</span> <span class="p">=</span> <span class="n">x</span><span class="p">[</span><span class="n">n</span><span class="p">]*</span><span class="n">bx</span><span class="p">+</span><span class="n">z</span><span class="p">[</span><span class="n">n</span><span class="p">]*</span><span class="n">bz</span><span class="p">+</span><span class="n">b0</span><span class="p">;</span>
<span class="p">}</span>
</code></pre></div></div>
<h2 id="mr-aggregations-solution">Mr. Aggregation’s solution</h2>
<p>So remember that Mr. Aggregation was concerned about over-fitting and didn’t
want to split things up into the most granular pieces. As such, in his solution,
we will add the two variables <script type="math/tex">x</script> and <script type="math/tex">z</script> together and quantify them as if
they were one. The resulting model is given below followed by the implementation
in Stan.</p>
<script type="math/tex; mode=display">% <![CDATA[
\begin{align}
y_t &\sim N(\mu_t, \sigma)\\
\mu_t &=\beta_r (x_t + z_t) + \beta_0\\
\beta_r &\sim N(0, 5)\\
\beta_0 &\sim N(0, 5)\\
\sigma &\sim U(0.01, \inf)
\end{align} %]]></script>
<div class="highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">data</span> <span class="p">{</span>
<span class="n">int</span> <span class="n">N</span><span class="p">;</span>
<span class="k">real</span> <span class="n">x</span><span class="p">[</span><span class="n">N</span><span class="p">];</span>
<span class="k">real</span> <span class="n">z</span><span class="p">[</span><span class="n">N</span><span class="p">];</span>
<span class="k">real</span> <span class="n">y</span><span class="p">[</span><span class="n">N</span><span class="p">];</span>
<span class="p">}</span>
<span class="k">parameters</span> <span class="p">{</span>
<span class="k">real</span> <span class="n">b0</span><span class="p">;</span>
<span class="k">real</span> <span class="n">br</span><span class="p">;</span>
<span class="k">real</span><span class="p"><</span><span class="n">lower</span><span class="p">=</span><span class="m">0</span><span class="p">></span> <span class="n">sigma</span><span class="p">;</span>
<span class="p">}</span>
<span class="k">model</span> <span class="p">{</span>
<span class="n">b0</span> <span class="p">~</span> <span class="n">normal</span><span class="p">(</span><span class="m">0</span><span class="p">,</span> <span class="m">5</span><span class="p">);</span>
<span class="n">br</span> <span class="p">~</span> <span class="n">normal</span><span class="p">(</span><span class="m">0</span><span class="p">,</span> <span class="m">5</span><span class="p">);</span>
<span class="n">for</span><span class="p">(</span><span class="n">n</span> <span class="k">in</span> <span class="m">1</span><span class="p">:</span><span class="n">N</span><span class="p">)</span>
<span class="n">y</span><span class="p">[</span><span class="n">n</span><span class="p">]</span> <span class="p">~</span> <span class="n">normal</span><span class="p">(</span><span class="n">br</span><span class="p">*(</span><span class="n">x</span><span class="p">[</span><span class="n">n</span><span class="p">]+</span><span class="n">z</span><span class="p">[</span><span class="n">n</span><span class="p">])+</span><span class="n">b0</span><span class="p">,</span> <span class="n">sigma</span><span class="p">);</span>
<span class="p">}</span>
<span class="n">generated</span> <span class="n">quantities</span> <span class="p">{</span>
<span class="k">real</span> <span class="n">y_pred</span><span class="p">[</span><span class="n">N</span><span class="p">];</span>
<span class="n">for</span> <span class="p">(</span><span class="n">n</span> <span class="k">in</span> <span class="m">1</span><span class="p">:</span><span class="n">N</span><span class="p">)</span>
<span class="n">y_pred</span><span class="p">[</span><span class="n">n</span><span class="p">]</span> <span class="p">=</span> <span class="p">(</span><span class="n">x</span><span class="p">[</span><span class="n">n</span><span class="p">]+</span><span class="n">z</span><span class="p">[</span><span class="n">n</span><span class="p">])*</span><span class="n">br</span><span class="p">+</span><span class="n">b0</span><span class="p">;</span>
<span class="p">}</span>
</code></pre></div></div>
<h2 id="analysis">Analysis</h2>
<p>Now let’s have a look at the different solutions and what we end up with. This
problem was intentionally noise to confuse even the granular approach as much as
possible. We’ll start by inspecting the posteriors for the parameters of
interest. They’re shown below in these caterpillar plots where the parameters
are on the y-axis and the posterior density is given on the x-axis.</p>
<p><img src="/images/figure/distributions-1.png" alt="plot of chunk distributions" /></p>
<p>It is clear that the only direct comparison we can make is the intercept <script type="math/tex">b_0</script> from both models. Now if you remember, the generating function doesn’t contain an intercept. It’s <script type="math/tex">0</script>. Visually inspecting the graphs above will show you that something bad is happening to both models. Let’s put some numbers on this shall we. The Tables below will illuminate the situation.</p>
<h3 id="parameter-distributions---granular-model">Parameter distributions - Granular model</h3>
<table>
<thead>
<tr>
<th style="text-align: left"> </th>
<th style="text-align: right">mean</th>
<th style="text-align: right">2.5%</th>
<th style="text-align: right">25%</th>
<th style="text-align: right">50%</th>
<th style="text-align: right">75%</th>
<th style="text-align: right">97.5%</th>
</tr>
</thead>
<tbody>
<tr>
<td style="text-align: left">b0</td>
<td style="text-align: right">-0.88</td>
<td style="text-align: right">-4.86</td>
<td style="text-align: right">-2.21</td>
<td style="text-align: right">-0.89</td>
<td style="text-align: right">0.49</td>
<td style="text-align: right">3.14</td>
</tr>
<tr>
<td style="text-align: left">bx</td>
<td style="text-align: right">0.88</td>
<td style="text-align: right">-0.41</td>
<td style="text-align: right">0.44</td>
<td style="text-align: right">0.87</td>
<td style="text-align: right">1.32</td>
<td style="text-align: right">2.15</td>
</tr>
<tr>
<td style="text-align: left">bz</td>
<td style="text-align: right">7.21</td>
<td style="text-align: right">5.89</td>
<td style="text-align: right">6.77</td>
<td style="text-align: right">7.23</td>
<td style="text-align: right">7.66</td>
<td style="text-align: right">8.49</td>
</tr>
</tbody>
</table>
<p>Mr. Granularity have indeed identified a <em>possible</em> intercept with the current model. The mean value of the posterior is -0.88 and as you can see there is 33% probability mass larger than <script type="math/tex">0</script> indicating the models confidence that there is an intercept. The model expresses the same certainty about the fact that <script type="math/tex">\beta_x</script> and <script type="math/tex">\beta_z</script> are real given that 91% and 100% of their masses respectively are above <script type="math/tex">0</script>. The absolute errors for the models estimate are -0.12 and 0.21 for <script type="math/tex">\beta_x</script> and <script type="math/tex">\beta_z</script> respectively.</p>
<h3 id="parameter-distributions---aggregated-model">Parameter distributions - Aggregated model</h3>
<table>
<thead>
<tr>
<th style="text-align: left"> </th>
<th style="text-align: right">mean</th>
<th style="text-align: right">2.5%</th>
<th style="text-align: right">25%</th>
<th style="text-align: right">50%</th>
<th style="text-align: right">75%</th>
<th style="text-align: right">97.5%</th>
</tr>
</thead>
<tbody>
<tr>
<td style="text-align: left">b0</td>
<td style="text-align: right">-6.11</td>
<td style="text-align: right">-10.19</td>
<td style="text-align: right">-7.47</td>
<td style="text-align: right">-6.14</td>
<td style="text-align: right">-4.77</td>
<td style="text-align: right">-1.99</td>
</tr>
<tr>
<td style="text-align: left">br</td>
<td style="text-align: right">3.89</td>
<td style="text-align: right">2.90</td>
<td style="text-align: right">3.56</td>
<td style="text-align: right">3.89</td>
<td style="text-align: right">4.22</td>
<td style="text-align: right">4.89</td>
</tr>
</tbody>
</table>
<p>Mr. Aggregation have also identified a <em>possible</em> intercept with the current model. The mean value of the posterior is -6.11 and as you can see there is 0% probability mass larger than <script type="math/tex">0</script> indicating the models confidence that there is an intercept. The model expresses the same certainty about the fact that <script type="math/tex">\beta_r</script> is real given that 100% of it’s mass is above <script type="math/tex">0</script>. The absolute errors for the models estimate are 2.89 and -3.11 if you consider the distance from the true <script type="math/tex">\beta_x</script> and <script type="math/tex">\beta_z</script> respectively.</p>
<h2 id="comparing-the-solutions">Comparing the solutions</h2>
<p>The table below quantifies the differences between the estimated parameters and the parameters of the generating function. The top row are the true parameter values from the generating function and the row names are the different estimated parameters in Mr. A’s and Mr. G’s model respectively.</p>
<table>
<thead>
<tr>
<th> </th>
<th>b0</th>
<th>bx</th>
<th>bz</th>
</tr>
</thead>
<tbody>
<tr>
<td>Mr. A b0</td>
<td>6.11</td>
<td> </td>
<td> </td>
</tr>
<tr>
<td>Mr. A br</td>
<td> </td>
<td>2.89</td>
<td>3.11</td>
</tr>
<tr>
<td>Mr. G b0</td>
<td>0.88</td>
<td> </td>
<td> </td>
</tr>
<tr>
<td>Mr. G bx</td>
<td> </td>
<td>0.12</td>
<td> </td>
</tr>
<tr>
<td>Mr. G bz</td>
<td> </td>
<td> </td>
<td>0.21</td>
</tr>
</tbody>
</table>
<p>As is apparent from the table you can see that Mr. Aggregation’s model is 289% off with respect to the true <script type="math/tex">\beta_x</script> coefficient, and 44% off with respect to the true <script type="math/tex">\beta_z</script> coefficient. That’s not very impressive and actually leads to the wrong conclusions when trying to discern the dynamics of <script type="math/tex">x</script> and <script type="math/tex">z</script> on <script type="math/tex">y</script>.</p>
<p>The corresponding analysis for the granular model gives us better results. Mr. Granularity’s model is 12% off with respect to the true <script type="math/tex">\beta_x</script> coefficient, and 3% off with respect to the true <script type="math/tex">\beta_z</script> coefficient. This seems a lot better. But still, if we have a granular model, why are we so off on the intercept? Well if you remember the generating function from before it looked like this</p>
<script type="math/tex; mode=display">% <![CDATA[
\begin{align}
y_t &\sim N(\mu_t, 7)\\
\mu_t &= x_t + 7 z_t\\
x_t &\sim N(3, 1)\\
z_t &\sim N(1, 1)
\end{align} %]]></script>
<p>which is statistically equivalent with the follwoing formulation</p>
<script type="math/tex; mode=display">% <![CDATA[
\begin{align}
y_t &\sim N(\mu_t, 7)\\
\mu_t &= x_t + 7 z_t + 3\\
x_t &\sim N(0, 1)\\
z_t &\sim N(1, 1)\\
\end{align} %]]></script>
<p>which in turn would make the <script type="math/tex">x_t</script> variable nothing but noise. This can indeed
be confirmed if you simulate many times. This is one of the core problems behind
some models; identifiability. It’s a tough thing and the very reason why maximum
likelihood can not be used in general. You need to sample!</p>
<h2 id="conclusion">Conclusion</h2>
<p>I’ve shown you today the dangers of aggregating information into a single unit
and what those dangers are. There is a version of the strategy shown here which
brings the best of both worlds; Hierarchical pooling. This methodology pulls
data with low information content towards the mean of the other more highly
informative ones. The degree of pooling can be readily expressed as a prior
belief on how much the different subparts should be connected. As such; don’t
throw information away. If you believe they belong together, express that belief
as a prior. Don’t restrict your model to the same biases as you have! In
summary:</p>
<ul>
<li>Always add all the granularity you need to solve the problem</li>
<li>Don’t be afraid of complexity; it’s part of life</li>
<li>Always sample the posteriors when you have complex models</li>
<li>Embrace the uncertainty that your model shows</li>
<li>Be aware that the uncertainty quantified is the <strong>model’s</strong> uncertainty</li>
</ul>
<p>Happy Inferencing!</p>Dr. Michael GreenIn time series modeling you typically run into issues concerning complexity versus utility. What I mean by that is that there may be questions you need the answer to but are afraid of the model complexity that comes along with it. This fear of complexity is something that relates to identifiability and the curse of dimensionality. Fortunately for us probabilistic programming can handle these things neatly. In this post we’re going to look at a problem where we have a choice between a granular model and an aggregated one. We need to use a proper probabilistic model that we will sample in order to get the posterior information we are looking for. The generating model In order to do this exercise we need to know what we’re doing and as such we will generate the data we need by simulating a stochastic process. I’m not a big fan of this since simulated data will always be, well simulated, and as such not very realistic. Data in our real world is not random people. This is worth remembering, but as the clients I work with on a daily basis are not inclined to share their precious data, and academic data sets are pointless since they are almost exclusively too nice to represent any real challenge I resort to simulated data. It’s enough to make my point. So without further ado I give you the generating model. which is basically a gaussian mixture model. So that represents the ground truth. The time series generated looks like this where time is on the x axis and the response variable on the y axis. The first few lines of the generated data are presented below.A gentle introduction to reinforcement learning or what to do when you don’t know what to do2017-05-01T00:00:00+00:002017-05-01T00:00:00+00:00/2017/05/01/A-gentle-introduction-to-reinforcement-learning-or-what-to-do-when-you-dont-know-what-to-do<p>Today we’re going to have a look at an interesting set of learning algorithms
which does not require you to know the truth while you learn. As such this is a
mix of unsupervised and supervised learning. The supervised part comes from the
fact that you look in the rear view mirror after the actions have been taken and
then adapt yourself based on how well you did. This is surprisingly powerful as
it can learn whatever the knowledge representation allows it to. One caveat
though is that it is excruciatingly sloooooow. This naturally stems from the
fact that there is no concept of a right solution. Neither when you are making
decisions nor when you are evaluating them. All you can say is that “Hey, that
wasn’t so bad given what I tried before” but you cannot say that it was the best
thing to do. This puts a dampener on the learning rate. The gain is that we can
learn just about anything given that we can observe the consequence of our
actions in the environment we operate in.</p>
<p><img src="/images/figure/reinforcement.png" alt="plot of the reinforcement learning loop" /></p>
<p>As illustrated above, reinforcement learning can be thought of as an agent
acting in an environment and receiving rewards as a consequence of those
actions. This is in principle a Markov Decision Process (MDP) which basically
captures just about anything you might want to learn in an environment. Formally
the MDP consists of</p>
<ul>
<li>A set of states <script type="math/tex">[s_1, s_2, ..., s_M]</script></li>
<li>A set of actions <script type="math/tex">[a_1, a_2, ..., a_N]</script></li>
<li>A set of rewards <script type="math/tex">[r_1, r_2, ..., r_L]</script></li>
<li>A set of transition probabilities <script type="math/tex">[s_{11}, s_{12}, ..., s_{1M}, s_{21}, s_{22}, ..., s_{2M}, ..., s_{MM}]</script></li>
</ul>
<p>which looks surprisingly simple but is really all we need. The mission is to
learn the best transition probabilities that maximizes the expected total future
reward. Thus to move on we need to introduce a little mathematical notation.
First off we need a reward function <script type="math/tex">R(s_t, a_t)</script> which gives us the reward
<script type="math/tex">r_t</script> that comes from taking action <script type="math/tex">a_t</script> in state <script type="math/tex">s_t</script> at time <script type="math/tex">t</script>. We
also need a transition function <script type="math/tex">S(s_t, a_t)</script> which will give us the next
state <script type="math/tex">s_{t+1}</script>. The actions <script type="math/tex">a_t</script> are generated by the agent by following
one or several policies. A policy function <script type="math/tex">P(s_t)</script> therefore generates an
action <script type="math/tex">a_t</script> which will, to it’s knowledge, give the maximum reward in the
future.</p>
<h2 id="the-problem-we-will-solve---cart-pole">The problem we will solve - Cart Pole</h2>
<p>We will utilize an environment from the <a href="https://gym.openai.com/">OpenAI Gym</a> called the <a href="https://gym.openai.com/envs/CartPole-v0">Cart pole</a> problem. The task is basically learning how to balance a pole by controlling a cart. The environment gives us a new state every time we act in it. This state consists of four observables corresponding to position and movements. This problem has been illustrated before by <a href="https://gist.github.com/awjuliani/86ae316a231bceb96a3e2ab3ac8e646a">Arthur Juliani</a> using <a href="https://www.tensorflow.org/">TensorFlow</a>. Before showing you the implementation we’ll have a look at how a trained agent performs below.</p>
<p><img src="/images/figure/gymcartpolesolved.gif" alt="plot of a working solution" /></p>
<p>As you can see it performs quite well and actually manages to balance the pole by controlling the cart in real time. You might think that hey that sounds easy I’ll just generate random actions and it should cancel out. Well, put your mind at ease. Below you can see an illustration of that approach failing.</p>
<p><img src="/images/figure/gymcartpolenotsolved.gif" alt="plot of a working solution" /></p>
<p>So to the problem at hand. How can we model this? We need to make an agent that learns a policy that maximizes the future reward right? Right, so at any given time our policy can choose one of two possible actions namely</p>
<ol>
<li>move left</li>
<li>move right</li>
</ol>
<p>which should sound familiar to you if you’ve done any modeling before. This is basically a Bernoulli model where the probability distribution looks like this <script type="math/tex">P(y;p)=p^y(1-p)^{1-y}</script>. Once we know this the task is to model <script type="math/tex">p</script> as a function of the current state <script type="math/tex">s_t</script>. This can be done by doing a linear model wrapped by a sigmoid like this</p>
<script type="math/tex; mode=display">p_t=P(s_t; \omega)=\frac{1}{1+\exp(-\omega s_t)}</script>
<p>where <script type="math/tex">\omega</script> are the four parameters that will basically control which way we want to move. These four parameters makes up the policy. With these two pieces we can set up a likelihood function that can drive our learning.</p>
<script type="math/tex; mode=display">L(\omega, s, y)=\prod_{t=1}^T p_t^{y_t}(1-p_t)^{1-y_t}</script>
<p>where <script type="math/tex">p_t</script> is defined above. This likelihood we want to maximize and in order to do that we will turn it around and instead minimize the negative log likelihood</p>
<script type="math/tex; mode=display">l(\omega)=-\ln L(\omega, s, y)=-\sum_{t=1}^T \left(y_t \ln p_t + (1-y_t) \ln (1-p_t) \right)</script>
<p>which can be solved for our simple model by setting</p>
<script type="math/tex; mode=display">\frac{\partial l(\omega)}{\partial \omega}=0</script>
<p>and doing the math. However, we want to make this general enough to support more complex policies. As such we will employ gradient descent updates to our parameters <script type="math/tex">\omega</script>.</p>
<script type="math/tex; mode=display">\omega^{new}=\omega^{old}-\eta\frac{\partial l(\omega)}{\partial \omega}</script>
<p>where <script type="math/tex">\eta</script> is the learning rate. This can also be considered to change over time dynamically but for now let’s keep it plain old vanilla. This is it for the theory. Now let’s get to the implementation!</p>
<h2 id="implementation">Implementation</h2>
<p>As the AI Gym is mostly available in Python we’ve chosen to go with that language. This is by no means my preferred language for data science, and I could give you 10 solid arguments as to why it shouldn’t be yours either, but since this post is about machine learning and not data science I won’t expand my thoughts on that. In any case, Python is great for machine learning which is what we are looking at today. So let’s go ahead and import the libraries in Python3 that we’re going to need.</p>
<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="kn">import</span> <span class="nn">numpy</span> <span class="k">as</span> <span class="n">np</span>
<span class="kn">import</span> <span class="nn">math</span>
<span class="kn">import</span> <span class="nn">gym</span>
</code></pre></div></div>
<p>After this let’s look at initiating our environment and setting some variables and placeholders we are going to need.</p>
<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">env</span> <span class="o">=</span> <span class="n">gym</span><span class="o">.</span><span class="n">make</span><span class="p">(</span><span class="s">'CartPole-v0'</span><span class="p">)</span>
<span class="c"># Configuration</span>
<span class="n">state</span> <span class="o">=</span> <span class="n">env</span><span class="o">.</span><span class="n">reset</span><span class="p">()</span>
<span class="n">max_episodes</span> <span class="o">=</span> <span class="mi">2000</span>
<span class="n">batch_size</span> <span class="o">=</span> <span class="mi">5</span>
<span class="n">learning_rate</span> <span class="o">=</span> <span class="mi">1</span>
<span class="n">episodes</span> <span class="o">=</span> <span class="mi">0</span>
<span class="n">reward_sum</span> <span class="o">=</span> <span class="mi">0</span>
<span class="n">params</span> <span class="o">=</span> <span class="n">np</span><span class="o">.</span><span class="n">random</span><span class="o">.</span><span class="n">normal</span><span class="p">([</span><span class="mi">0</span><span class="p">,</span><span class="mi">0</span><span class="p">,</span><span class="mi">0</span><span class="p">,</span><span class="mi">0</span><span class="p">],</span> <span class="p">[</span><span class="mi">1</span><span class="p">,</span><span class="mi">1</span><span class="p">,</span><span class="mi">1</span><span class="p">,</span><span class="mi">1</span><span class="p">])</span>
<span class="n">render</span> <span class="o">=</span> <span class="bp">False</span>
<span class="c"># Define place holders for the problem</span>
<span class="n">p</span><span class="p">,</span> <span class="n">action</span><span class="p">,</span> <span class="n">reward</span><span class="p">,</span> <span class="n">dreward</span> <span class="o">=</span> <span class="mi">0</span><span class="p">,</span> <span class="mi">0</span><span class="p">,</span> <span class="mi">0</span><span class="p">,</span> <span class="mi">0</span>
<span class="n">ys</span><span class="p">,</span> <span class="n">ps</span><span class="p">,</span> <span class="n">actions</span><span class="p">,</span> <span class="n">rewards</span><span class="p">,</span> <span class="n">drewards</span><span class="p">,</span> <span class="n">gradients</span> <span class="o">=</span> <span class="p">[],[],[],[],[],[]</span>
<span class="n">states</span> <span class="o">=</span> <span class="n">state</span>
</code></pre></div></div>
<p>Other than this we’re going to use some functions that needs to be defined. I’m sure multiple machine learning frameworks have implemented it already but it’s pretty easy to do and quite instructional so why not just do it. ;)</p>
<h2 id="the-python-functions-youre-going-to-need">The python functions you’re going to need</h2>
<p>As we’re implementing this in Python3 and it’s not always straightforward what is Python3 and Python2 I’m sharing the function definitions with you that I created since they are indeed compliant with the Python3 libraries. Especially Numpy which is an integral part of computation in Python. Most of these functions are easily implemented and understood. Make sure you read through them and grasp what they’re all about.</p>
<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="k">def</span> <span class="nf">discount_rewards</span><span class="p">(</span><span class="n">r</span><span class="p">,</span> <span class="n">gamma</span><span class="o">=</span><span class="mi">1</span><span class="o">-</span><span class="mf">0.99</span><span class="p">):</span>
<span class="n">df</span> <span class="o">=</span> <span class="n">np</span><span class="o">.</span><span class="n">zeros_like</span><span class="p">(</span><span class="n">r</span><span class="p">)</span>
<span class="k">for</span> <span class="n">t</span> <span class="ow">in</span> <span class="nb">range</span><span class="p">(</span><span class="nb">len</span><span class="p">(</span><span class="n">r</span><span class="p">)):</span>
<span class="n">df</span><span class="p">[</span><span class="n">t</span><span class="p">]</span> <span class="o">=</span> <span class="n">np</span><span class="o">.</span><span class="n">npv</span><span class="p">(</span><span class="n">gamma</span><span class="p">,</span> <span class="n">r</span><span class="p">[</span><span class="n">t</span><span class="p">:</span><span class="nb">len</span><span class="p">(</span><span class="n">r</span><span class="p">)])</span>
<span class="k">return</span> <span class="n">df</span>
<span class="k">def</span> <span class="nf">sigmoid</span><span class="p">(</span><span class="n">x</span><span class="p">):</span>
<span class="k">return</span> <span class="mf">1.0</span><span class="o">/</span><span class="p">(</span><span class="mf">1.0</span><span class="o">+</span><span class="n">np</span><span class="o">.</span><span class="n">exp</span><span class="p">(</span><span class="o">-</span><span class="n">x</span><span class="p">))</span>
<span class="k">def</span> <span class="nf">dsigmoid</span><span class="p">(</span><span class="n">x</span><span class="p">):</span>
<span class="n">a</span><span class="o">=</span><span class="n">sigmoid</span><span class="p">(</span><span class="n">x</span><span class="p">)</span>
<span class="k">return</span> <span class="n">a</span><span class="o">*</span><span class="p">(</span><span class="mi">1</span><span class="o">-</span><span class="n">a</span><span class="p">)</span>
<span class="k">def</span> <span class="nf">decide</span><span class="p">(</span><span class="n">b</span><span class="p">,</span> <span class="n">x</span><span class="p">):</span>
<span class="k">return</span> <span class="n">sigmoid</span><span class="p">(</span><span class="n">np</span><span class="o">.</span><span class="n">vdot</span><span class="p">(</span><span class="n">b</span><span class="p">,</span> <span class="n">x</span><span class="p">))</span>
<span class="k">def</span> <span class="nf">loglikelihood</span><span class="p">(</span><span class="n">y</span><span class="p">,</span> <span class="n">p</span><span class="p">):</span>
<span class="k">return</span> <span class="n">y</span><span class="o">*</span><span class="n">np</span><span class="o">.</span><span class="n">log</span><span class="p">(</span><span class="n">p</span><span class="p">)</span><span class="o">+</span><span class="p">(</span><span class="mi">1</span><span class="o">-</span><span class="n">y</span><span class="p">)</span><span class="o">*</span><span class="n">np</span><span class="o">.</span><span class="n">log</span><span class="p">(</span><span class="mi">1</span><span class="o">-</span><span class="n">p</span><span class="p">)</span>
<span class="k">def</span> <span class="nf">weighted_loglikelihood</span><span class="p">(</span><span class="n">y</span><span class="p">,</span> <span class="n">p</span><span class="p">,</span> <span class="n">dr</span><span class="p">):</span>
<span class="k">return</span> <span class="p">(</span><span class="n">y</span><span class="o">*</span><span class="n">np</span><span class="o">.</span><span class="n">log</span><span class="p">(</span><span class="n">p</span><span class="p">)</span><span class="o">+</span><span class="p">(</span><span class="mi">1</span><span class="o">-</span><span class="n">y</span><span class="p">)</span><span class="o">*</span><span class="n">np</span><span class="o">.</span><span class="n">log</span><span class="p">(</span><span class="mi">1</span><span class="o">-</span><span class="n">p</span><span class="p">))</span><span class="o">*</span><span class="n">dr</span>
<span class="k">def</span> <span class="nf">loss</span><span class="p">(</span><span class="n">y</span><span class="p">,</span> <span class="n">p</span><span class="p">,</span> <span class="n">dr</span><span class="p">):</span>
<span class="k">return</span> <span class="o">-</span><span class="n">weighted_loglikelihood</span><span class="p">(</span><span class="n">y</span><span class="p">,</span> <span class="n">p</span><span class="p">,</span> <span class="n">dr</span><span class="p">)</span>
<span class="k">def</span> <span class="nf">dloss</span><span class="p">(</span><span class="n">y</span><span class="p">,</span> <span class="n">p</span><span class="p">,</span> <span class="n">dr</span><span class="p">,</span> <span class="n">x</span><span class="p">):</span>
<span class="k">return</span> <span class="n">np</span><span class="o">.</span><span class="n">reshape</span><span class="p">(</span><span class="n">dr</span><span class="o">*</span><span class="p">(</span> <span class="p">(</span><span class="mi">1</span><span class="o">-</span><span class="n">np</span><span class="o">.</span><span class="n">array</span><span class="p">(</span><span class="n">y</span><span class="p">))</span><span class="o">*</span><span class="n">p</span> <span class="o">-</span> <span class="n">y</span><span class="o">*</span><span class="p">(</span><span class="mi">1</span><span class="o">-</span><span class="n">np</span><span class="o">.</span><span class="n">array</span><span class="p">(</span><span class="n">p</span><span class="p">))),</span> <span class="p">[</span><span class="nb">len</span><span class="p">(</span><span class="n">y</span><span class="p">),</span><span class="mi">1</span><span class="p">])</span><span class="o">*</span><span class="n">x</span>
</code></pre></div></div>
<p>Armed with these function we’re ready to do the main learning loop which is where the logic of the agent and the training takes place. This will be the heaviest part to run through so take your time.</p>
<h2 id="the-learning-loop">The learning loop</h2>
<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="k">while</span> <span class="n">episodes</span> <span class="o"><</span> <span class="n">max_episodes</span><span class="p">:</span>
<span class="k">if</span> <span class="n">reward_sum</span> <span class="o">></span> <span class="mi">190</span> <span class="ow">or</span> <span class="n">render</span><span class="o">==</span><span class="bp">True</span><span class="p">:</span>
<span class="n">env</span><span class="o">.</span><span class="n">render</span><span class="p">()</span>
<span class="n">render</span> <span class="o">=</span> <span class="bp">True</span>
<span class="n">p</span> <span class="o">=</span> <span class="n">decide</span><span class="p">(</span><span class="n">params</span><span class="p">,</span> <span class="n">state</span><span class="p">)</span>
<span class="n">action</span> <span class="o">=</span> <span class="mi">1</span> <span class="k">if</span> <span class="n">p</span> <span class="o">></span> <span class="n">np</span><span class="o">.</span><span class="n">random</span><span class="o">.</span><span class="n">uniform</span><span class="p">()</span> <span class="k">else</span> <span class="mi">0</span>
<span class="n">state</span><span class="p">,</span> <span class="n">reward</span><span class="p">,</span> <span class="n">done</span><span class="p">,</span> <span class="n">_</span> <span class="o">=</span> <span class="n">env</span><span class="o">.</span><span class="n">step</span><span class="p">(</span><span class="n">action</span><span class="p">)</span>
<span class="n">reward_sum</span> <span class="o">+=</span> <span class="n">reward</span>
<span class="c"># Add to place holders</span>
<span class="n">ps</span><span class="o">.</span><span class="n">append</span><span class="p">(</span><span class="n">p</span><span class="p">)</span>
<span class="n">actions</span><span class="o">.</span><span class="n">append</span><span class="p">(</span><span class="n">action</span><span class="p">)</span>
<span class="n">ys</span><span class="o">.</span><span class="n">append</span><span class="p">(</span><span class="n">action</span><span class="p">)</span>
<span class="n">rewards</span><span class="o">.</span><span class="n">append</span><span class="p">(</span><span class="n">reward</span><span class="p">)</span>
<span class="c"># Check if the episode is over and calculate gradients</span>
<span class="k">if</span> <span class="n">done</span><span class="p">:</span>
<span class="n">episodes</span> <span class="o">+=</span> <span class="mi">1</span>
<span class="n">drewards</span> <span class="o">=</span> <span class="n">discount_rewards2</span><span class="p">(</span><span class="n">rewards</span><span class="p">)</span>
<span class="n">drewards</span> <span class="o">-=</span> <span class="n">np</span><span class="o">.</span><span class="n">mean</span><span class="p">(</span><span class="n">drewards</span><span class="p">)</span>
<span class="n">drewards</span> <span class="o">/=</span> <span class="n">np</span><span class="o">.</span><span class="n">std</span><span class="p">(</span><span class="n">drewards</span><span class="p">)</span>
<span class="k">if</span> <span class="nb">len</span><span class="p">(</span><span class="n">gradients</span><span class="p">)</span><span class="o">==</span><span class="mi">0</span><span class="p">:</span>
<span class="n">gradients</span> <span class="o">=</span> <span class="n">dloss</span><span class="p">(</span><span class="n">ys</span><span class="p">,</span> <span class="n">ps</span><span class="p">,</span> <span class="n">drewards</span><span class="p">,</span> <span class="n">states</span><span class="p">)</span><span class="o">.</span><span class="n">mean</span><span class="p">(</span><span class="n">axis</span><span class="o">=</span><span class="mi">0</span><span class="p">)</span>
<span class="k">else</span><span class="p">:</span>
<span class="n">gradients</span> <span class="o">=</span> <span class="n">np</span><span class="o">.</span><span class="n">vstack</span><span class="p">((</span><span class="n">gradients</span><span class="p">,</span> <span class="n">dloss</span><span class="p">(</span><span class="n">ys</span><span class="p">,</span> <span class="n">ps</span><span class="p">,</span> <span class="n">drewards</span><span class="p">,</span> <span class="n">states</span><span class="p">)</span><span class="o">.</span><span class="n">mean</span><span class="p">(</span><span class="n">axis</span><span class="o">=</span><span class="mi">0</span><span class="p">)))</span>
<span class="k">if</span> <span class="n">episodes</span> <span class="o">%</span> <span class="n">batch_size</span> <span class="o">==</span> <span class="mi">0</span><span class="p">:</span>
<span class="n">params</span> <span class="o">=</span> <span class="n">params</span> <span class="o">-</span> <span class="n">learning_rate</span><span class="o">*</span><span class="n">gradients</span><span class="o">.</span><span class="n">mean</span><span class="p">(</span><span class="n">axis</span><span class="o">=</span><span class="mi">0</span><span class="p">)</span>
<span class="n">gradients</span> <span class="o">=</span> <span class="p">[]</span>
<span class="k">print</span><span class="p">(</span><span class="s">"Average reward for episode"</span><span class="p">,</span> <span class="n">reward_sum</span><span class="o">/</span><span class="n">batch_size</span><span class="p">)</span>
<span class="k">if</span> <span class="n">reward_sum</span><span class="o">/</span><span class="n">batch_size</span> <span class="o">>=</span> <span class="mi">200</span><span class="p">:</span>
<span class="k">print</span><span class="p">(</span><span class="s">"Problem solved!"</span><span class="p">)</span>
<span class="n">reward_sum</span> <span class="o">=</span> <span class="mi">0</span>
<span class="c"># Reset all</span>
<span class="n">state</span> <span class="o">=</span> <span class="n">env</span><span class="o">.</span><span class="n">reset</span><span class="p">()</span>
<span class="n">y</span><span class="p">,</span> <span class="n">p</span><span class="p">,</span> <span class="n">action</span><span class="p">,</span> <span class="n">reward</span><span class="p">,</span> <span class="n">dreward</span><span class="p">,</span> <span class="n">g</span> <span class="o">=</span> <span class="mi">0</span><span class="p">,</span> <span class="mi">0</span><span class="p">,</span> <span class="mi">0</span><span class="p">,</span> <span class="mi">0</span><span class="p">,</span> <span class="mi">0</span><span class="p">,</span> <span class="mi">0</span>
<span class="n">ys</span><span class="p">,</span> <span class="n">ps</span><span class="p">,</span> <span class="n">actions</span><span class="p">,</span> <span class="n">rewards</span><span class="p">,</span> <span class="n">drewards</span> <span class="o">=</span> <span class="p">[],[],[],[],[]</span>
<span class="n">states</span> <span class="o">=</span> <span class="n">state</span>
<span class="k">else</span><span class="p">:</span>
<span class="n">states</span><span class="o">=</span><span class="n">np</span><span class="o">.</span><span class="n">vstack</span><span class="p">((</span><span class="n">states</span><span class="p">,</span> <span class="n">state</span><span class="p">))</span>
<span class="n">env</span><span class="o">.</span><span class="n">close</span><span class="p">()</span>
</code></pre></div></div>
<p>Phew! There it was, and it wasn’t so bad was it? We now have a fully working reinforcement learning agent that learns the CartPole problem by policy gradient learning. Now, for those of you who know me you know I’m always preaching about considering all possible solutions that are consistent with your data. So maybe there are more than one solution to the CartPole problem? Indeed there is. The next section will show you a distribution of these solutions across the four parameters.</p>
<h2 id="multiple-solutions">Multiple solutions</h2>
<p>So we have solved the CartPole problem using our learning agent and if you run it multiple times you will see that it converges to different solutions. We can create a distribution over all of these different solutions which will inform us about the solution space of all possible models supported by our parameterization. The plot is given below where the x axis are the parameter values and the y axis the probability density.</p>
<p><img src="/images/figure/solutiondistribution.png" alt="plot of all possible solutions" /></p>
<p>You can see that <script type="math/tex">X_0</script> and <script type="math/tex">X_1</script> should be around <script type="math/tex">0</script> meanwhile <script type="math/tex">X_2</script> and <script type="math/tex">X_3</script> should be around <script type="math/tex">1</script>. But several other solutions exist as illustrated. So naturally this uncertainty about what the parameters should exactly be could be taken into account by a learning agent.</p>
<h2 id="conclusion">Conclusion</h2>
<p>We have implemented a reinforcement learning agent who acts in an environment with the purpose of maximizing the future reward. We have also discounted that future reward in the code but not covered it in the math. It’s straightforward though. The concept of being able to learn from your own mistakes is quite cool and represents a learning paradigm which is neither supervised nor unsupervised but rather a combination of both. Another appealing thing about this methodology is that it is very similar to how biological creatures learn from interacting with their environment. Today we solved the CartPole but the methodology can be used to attack far more interesting problems.</p>
<p>I hope you had fun reading this and learned something.</p>
<p>Happy inferencing!</p>Dr. Michael GreenToday we’re going to have a look at an interesting set of learning algorithms which does not require you to know the truth while you learn. As such this is a mix of unsupervised and supervised learning. The supervised part comes from the fact that you look in the rear view mirror after the actions have been taken and then adapt yourself based on how well you did. This is surprisingly powerful as it can learn whatever the knowledge representation allows it to. One caveat though is that it is excruciatingly sloooooow. This naturally stems from the fact that there is no concept of a right solution. Neither when you are making decisions nor when you are evaluating them. All you can say is that “Hey, that wasn’t so bad given what I tried before” but you cannot say that it was the best thing to do. This puts a dampener on the learning rate. The gain is that we can learn just about anything given that we can observe the consequence of our actions in the environment we operate in.A few thoughts on apparent bimodality for regression problems!2017-04-09T00:00:00+00:002017-04-09T00:00:00+00:00/2017/04/09/A-few-thoughts-on-apparent-bimodality-for-regression-problems<p>Did you ever run into a scenario when your data is showing two distinctive
relationships but you’re trying to solve for it with one regression line? This
happens to me a lot. So I thought about having some fun with it intead of
dreading it and the nasty consequences that may arise from this behaviour. Below
you’ll see a plot featuring two variables, <script type="math/tex">x</script>, and <script type="math/tex">y</script> where we are tasked
with figuring out how the value of <script type="math/tex">y</script> depends on <script type="math/tex">x</script>.</p>
<div class="language-r highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">mydf</span><span class="o"><-</span><span class="n">tibble</span><span class="p">(</span><span class="n">x</span><span class="o">=</span><span class="n">seq</span><span class="p">(</span><span class="m">0</span><span class="p">,</span><span class="m">30</span><span class="p">,</span><span class="m">0.2</span><span class="p">),</span><span class="w">
</span><span class="n">z</span><span class="o">=</span><span class="n">ifelse</span><span class="p">(</span><span class="n">runif</span><span class="p">(</span><span class="m">1</span><span class="o">:</span><span class="nf">length</span><span class="p">(</span><span class="n">x</span><span class="p">))</span><span class="o">></span><span class="m">0.5</span><span class="p">,</span><span class="w"> </span><span class="m">1</span><span class="p">,</span><span class="w"> </span><span class="m">2</span><span class="p">),</span><span class="w">
</span><span class="n">y</span><span class="o">=</span><span class="n">x</span><span class="o">*</span><span class="n">ifelse</span><span class="p">(</span><span class="n">z</span><span class="o"><</span><span class="m">2</span><span class="p">,</span><span class="w"> </span><span class="m">1</span><span class="p">,</span><span class="w"> </span><span class="m">3</span><span class="p">)</span><span class="o">+</span><span class="n">rnorm</span><span class="p">(</span><span class="nf">length</span><span class="p">(</span><span class="n">x</span><span class="p">),</span><span class="w"> </span><span class="m">0</span><span class="p">,</span><span class="w"> </span><span class="m">5</span><span class="p">))</span><span class="w">
</span><span class="n">ggplot</span><span class="p">(</span><span class="n">mydf</span><span class="p">,</span><span class="w"> </span><span class="n">aes</span><span class="p">(</span><span class="n">y</span><span class="o">=</span><span class="n">y</span><span class="p">,</span><span class="w"> </span><span class="n">x</span><span class="o">=</span><span class="n">x</span><span class="p">))</span><span class="w"> </span><span class="o">+</span><span class="w"> </span><span class="n">geom_point</span><span class="p">()</span><span class="w"> </span><span class="o">+</span><span class="w"> </span><span class="n">theme_minimal</span><span class="p">()</span><span class="w">
</span></code></pre></div></div>
<p><img src="/images/figure/problemplot-1.png" alt="plot of chunk problemplot" /></p>
<p>Naturally, what comes to most peoples mind is that we need to model <script type="math/tex">y_t=\omega
f(x_t)+\epsilon</script> where <script type="math/tex">f</script> and <script type="math/tex">\omega</script> are currently unknown. The most
straightforward solution to this is to assume that we are in a linear regime and
consequently that <script type="math/tex">f(x)=I(x)=x</script> where <script type="math/tex">I</script> is the identity function. The
equation then quickly becomes <script type="math/tex">y_t=\omega x_t+\epsilon</script> at which time data
scientists usually rejoice and apply linear regression. So let’s do just that
shall we.</p>
<p><img src="/images/figure/unnamed-chunk-1-1.png" alt="plot of chunk unnamed-chunk-1" /></p>
<p>Most of us would agree that the solution with the linear model to the left is
not a very nice scenario. We’re always off in terms of knowing the real
expectation value. Conceptually this is not very difficult though. We humans do
this all the time. If I show you another solution which looks like the one to
the right then what would you say? Hopefully you would recognize this as
something you would approve of. The problem with this is that a linear model
cannot capture this. You need a transformation function to accomplish this.</p>
<p>But wait! We’re all Bayesians here aren’t we? So maybe we can capture this
behavior by just letting our model support two modes for the slope parameter? As
such we would never really know which slope cluster that would be chosen at any
given time and naturally the expectation would end up between the both lines
where the posterior probability is zero. Let’s have a look at what the following
model does when exposed to this data.</p>
<script type="math/tex; mode=display">% <![CDATA[
\begin{align}
y_t &\sim \mathcal N(\mu_t, \sigma)\\
\mu_t &=\beta x_t + \alpha\\
\beta &\sim \mathcal C(0, 10)\\
\alpha &\sim \mathcal N(0, 1)\\
\sigma &\sim \mathcal U(0.01, \inf)
\end{align} %]]></script>
<p>Below you can see the plotted simulated regression lines from the model. Not
great is it? Not only did our assumption of bimodality fall through but we’re
indeed no better of than before. Why? Well, in this case the mathematical
formulation of the problem was just plain wrong. Depending on multimodality to
cover up for your model specification sins is just bad practice.</p>
<p><img src="/images/figure/prediction plot-1.png" alt="plot of chunk prediction plot" /></p>
<p>Ok, so if the previous model was badly specified then what should we do to fix it? In principle we want the following behavior <script type="math/tex">y_t=x_t(\beta+\omega z_t)+\alpha</script> where <script type="math/tex">z_t</script> is a binary state variable indicating whether the current <script type="math/tex">x_t</script> has the first or the second response type. The full model we then might want to consider looks like this.</p>
<script type="math/tex; mode=display">% <![CDATA[
\begin{align}
y_t &\sim \mathcal N(\mu_t, \sigma)\\
\mu_t &=x_t(\beta+\omega z_t)+\alpha\\
\omega &\sim \mathcal N(0, 1)\\
z_t &\sim \mathcal{Bin}(1, 0.5)\\
\beta &\sim \mathcal C(0, 10)\\
\alpha &\sim \mathcal N(0, 1)\\
\sigma &\sim \mathcal U(0.01, \inf)
\end{align} %]]></script>
<p>This would allow the state to be modeled as a latent variable in time. This is very useful for a variety of problems where we know something to be true but lack observed data to quantify it. However, modeling discrete latent variables can be computationally demanding if all you are really looking for is an extra dimension. We can of course design this. So instead of viewing <script type="math/tex">z_t</script> as a latent state variable we can actually precode the state by unsupervised hierarchical clustering. The code in R would look like this.</p>
<div class="language-r highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">d</span><span class="o"><-</span><span class="n">dist</span><span class="p">(</span><span class="n">mydf</span><span class="p">[,</span><span class="w"> </span><span class="nf">c</span><span class="p">(</span><span class="s2">"y"</span><span class="p">,</span><span class="w"> </span><span class="s2">"x"</span><span class="p">)])</span><span class="w">
</span><span class="n">hc</span><span class="o"><-</span><span class="n">hclust</span><span class="p">(</span><span class="n">d</span><span class="p">)</span><span class="w">
</span><span class="n">mydf</span><span class="o"><-</span><span class="n">mutate</span><span class="p">(</span><span class="n">mydf</span><span class="p">,</span><span class="w"> </span><span class="n">zz</span><span class="o">=</span><span class="n">cutree</span><span class="p">(</span><span class="n">hc</span><span class="p">,</span><span class="w"> </span><span class="m">2</span><span class="p">))</span><span class="w">
</span></code></pre></div></div>
<p>which encodes the clustered state in a variable called <script type="math/tex">zz</script>. Consequently it would produce a hierarchical cluster like the one below.</p>
<p><img src="/images/figure/clusterplot-1.png" alt="plot of chunk clusterplot" /></p>
<p>This leaves us in a position where we can treat <script type="math/tex">z_t</script> as observed data even though we sort of clustered it. The revised math is given below.</p>
<script type="math/tex; mode=display">% <![CDATA[
\begin{align}
y_t &\sim \mathcal N(\mu_t, \sigma)\\
\mu_t &=x_t(\beta+\omega z_t)+\alpha\\
\omega &\sim \mathcal N(0, 1)\\
\beta &\sim \mathcal C(0, 10)\\
\alpha &\sim \mathcal N(0, 1)\\
\sigma &\sim \mathcal U(0.01, \inf)
\end{align} %]]></script>
<p>Comparing the results from our first model with the current one we can see that we’re obviously doing better. The clustering works pretty well. The graph to the left is the first model and the one to the right is the revised model with an updated likelihood.</p>
<p><img src="/images/figure/1vs2-1.png" alt="plot of chunk 1vs2" /></p>
<p>As is always instructional let’s look at the posteriors of the parameters of our second model. They are depicted below. You can clearly see that the “increase in slope” parameter <script type="math/tex">\omega</script> clearly captures the new behavior we wished to model.</p>
<p><img src="/images/figure/model results-1.png" alt="plot of chunk model results" /></p>
<h1 id="conclusion">Conclusion</h1>
<p>This post has been about not becoming blind with respect to the mathematical restrictions we impose on the total model by sticking to a too simplistic representation. Also in this case the Bayesian formalism does not save us with it’s bimodal capabilities since the model was misspecified.</p>
<ul>
<li>Think about all aspects of your model before you push the inference button</li>
<li>Be aware that something that might appear as a clear cut case for multimodality may actually be a pathological problem in your model</li>
<li>Also, be aware that sometimes multimodality <em>is</em> expected and totally ok</li>
</ul>
<p>Happy inferencing!</p>Dr. Michael GreenDid you ever run into a scenario when your data is showing two distinctive relationships but you’re trying to solve for it with one regression line? This happens to me a lot. So I thought about having some fun with it intead of dreading it and the nasty consequences that may arise from this behaviour. Below you’ll see a plot featuring two variables, , and where we are tasked with figuring out how the value of depends on .A first look at Edward!2017-04-01T00:00:00+00:002017-04-01T00:00:00+00:00/2017/04/01/A-first-look-at-Edward<p>There’s a new kid on the inference block called “Edward” who is full of potential and promises of merging probabilistic programming, computational graphs and inference! There’s also talk of a 35 times speed up compared to our good old reliable fellow “Stan”. Today I will run some comparisons for problems that currently interest me namely time series with structural hyperparameters.</p>
<p>To start things off and make sure we have all our ducks in a row for running edward we need to install it using the python installer called pip which is available in most linux distros. I will use pip3 here because I use python3 instead of python2. It shouldn’t matter which one you choose though. So go ahead and install “Edward”.</p>
<div class="language-bash highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="nb">sudo </span>pip3 install edward
</code></pre></div></div>
<p>If you ran this in the console you should now have a working version of Edward installed in your python environment. So far so good. Just to make sure it works we will run a small Bayesian Neural Network with Gaussian priors for the weights. This is the standard example from Edwards web page. It uses a Variational Inference approach to turn the sampling problem into an optimization problem by approximating the target posterior by a multivariate Gaussian. This approach works ok for quite a few problems. However, it is a tad hyped as a general purpose inference sampler and should not be considered as a replacement for a real sampler. In either case try to run the code below and check it out. For this toy dataset it works fine. ;)</p>
<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="kn">from</span> <span class="nn">__future__</span> <span class="kn">import</span> <span class="n">absolute_import</span>
<span class="kn">from</span> <span class="nn">__future__</span> <span class="kn">import</span> <span class="n">division</span>
<span class="kn">from</span> <span class="nn">__future__</span> <span class="kn">import</span> <span class="n">print_function</span>
<span class="kn">import</span> <span class="nn">edward</span> <span class="k">as</span> <span class="n">ed</span>
<span class="kn">import</span> <span class="nn">numpy</span> <span class="k">as</span> <span class="n">np</span>
<span class="kn">import</span> <span class="nn">tensorflow</span> <span class="k">as</span> <span class="n">tf</span>
<span class="kn">from</span> <span class="nn">edward.models</span> <span class="kn">import</span> <span class="n">Normal</span>
<span class="k">def</span> <span class="nf">build_toy_dataset</span><span class="p">(</span><span class="n">N</span><span class="o">=</span><span class="mi">50</span><span class="p">,</span> <span class="n">noise_std</span><span class="o">=</span><span class="mf">0.1</span><span class="p">):</span>
<span class="n">x</span> <span class="o">=</span> <span class="n">np</span><span class="o">.</span><span class="n">linspace</span><span class="p">(</span><span class="o">-</span><span class="mi">3</span><span class="p">,</span> <span class="mi">3</span><span class="p">,</span> <span class="n">num</span><span class="o">=</span><span class="n">N</span><span class="p">)</span>
<span class="n">y</span> <span class="o">=</span> <span class="n">np</span><span class="o">.</span><span class="n">cos</span><span class="p">(</span><span class="n">x</span><span class="p">)</span> <span class="o">+</span> <span class="n">np</span><span class="o">.</span><span class="n">random</span><span class="o">.</span><span class="n">normal</span><span class="p">(</span><span class="mi">0</span><span class="p">,</span> <span class="n">noise_std</span><span class="p">,</span> <span class="n">size</span><span class="o">=</span><span class="n">N</span><span class="p">)</span>
<span class="n">x</span> <span class="o">=</span> <span class="n">x</span><span class="o">.</span><span class="n">astype</span><span class="p">(</span><span class="n">np</span><span class="o">.</span><span class="n">float32</span><span class="p">)</span><span class="o">.</span><span class="n">reshape</span><span class="p">((</span><span class="n">N</span><span class="p">,</span> <span class="mi">1</span><span class="p">))</span>
<span class="n">y</span> <span class="o">=</span> <span class="n">y</span><span class="o">.</span><span class="n">astype</span><span class="p">(</span><span class="n">np</span><span class="o">.</span><span class="n">float32</span><span class="p">)</span>
<span class="k">return</span> <span class="n">x</span><span class="p">,</span> <span class="n">y</span>
<span class="k">def</span> <span class="nf">neural_network</span><span class="p">(</span><span class="n">x</span><span class="p">,</span> <span class="n">W_0</span><span class="p">,</span> <span class="n">W_1</span><span class="p">,</span> <span class="n">b_0</span><span class="p">,</span> <span class="n">b_1</span><span class="p">):</span>
<span class="n">h</span> <span class="o">=</span> <span class="n">tf</span><span class="o">.</span><span class="n">tanh</span><span class="p">(</span><span class="n">tf</span><span class="o">.</span><span class="n">matmul</span><span class="p">(</span><span class="n">x</span><span class="p">,</span> <span class="n">W_0</span><span class="p">)</span> <span class="o">+</span> <span class="n">b_0</span><span class="p">)</span>
<span class="n">h</span> <span class="o">=</span> <span class="n">tf</span><span class="o">.</span><span class="n">matmul</span><span class="p">(</span><span class="n">h</span><span class="p">,</span> <span class="n">W_1</span><span class="p">)</span> <span class="o">+</span> <span class="n">b_1</span>
<span class="k">return</span> <span class="n">tf</span><span class="o">.</span><span class="n">reshape</span><span class="p">(</span><span class="n">h</span><span class="p">,</span> <span class="p">[</span><span class="o">-</span><span class="mi">1</span><span class="p">])</span>
<span class="n">ed</span><span class="o">.</span><span class="n">set_seed</span><span class="p">(</span><span class="mi">42</span><span class="p">)</span>
<span class="n">N</span> <span class="o">=</span> <span class="mi">50</span> <span class="c"># number of data ponts</span>
<span class="n">D</span> <span class="o">=</span> <span class="mi">1</span> <span class="c"># number of features</span>
<span class="c"># DATA</span>
<span class="n">x_train</span><span class="p">,</span> <span class="n">y_train</span> <span class="o">=</span> <span class="n">build_toy_dataset</span><span class="p">(</span><span class="n">N</span><span class="p">)</span>
<span class="c"># MODEL</span>
<span class="n">W_0</span> <span class="o">=</span> <span class="n">Normal</span><span class="p">(</span><span class="n">mu</span><span class="o">=</span><span class="n">tf</span><span class="o">.</span><span class="n">zeros</span><span class="p">([</span><span class="n">D</span><span class="p">,</span> <span class="mi">2</span><span class="p">]),</span> <span class="n">sigma</span><span class="o">=</span><span class="n">tf</span><span class="o">.</span><span class="n">ones</span><span class="p">([</span><span class="n">D</span><span class="p">,</span> <span class="mi">2</span><span class="p">]))</span>
<span class="n">W_1</span> <span class="o">=</span> <span class="n">Normal</span><span class="p">(</span><span class="n">mu</span><span class="o">=</span><span class="n">tf</span><span class="o">.</span><span class="n">zeros</span><span class="p">([</span><span class="mi">2</span><span class="p">,</span> <span class="mi">1</span><span class="p">]),</span> <span class="n">sigma</span><span class="o">=</span><span class="n">tf</span><span class="o">.</span><span class="n">ones</span><span class="p">([</span><span class="mi">2</span><span class="p">,</span> <span class="mi">1</span><span class="p">]))</span>
<span class="n">b_0</span> <span class="o">=</span> <span class="n">Normal</span><span class="p">(</span><span class="n">mu</span><span class="o">=</span><span class="n">tf</span><span class="o">.</span><span class="n">zeros</span><span class="p">(</span><span class="mi">2</span><span class="p">),</span> <span class="n">sigma</span><span class="o">=</span><span class="n">tf</span><span class="o">.</span><span class="n">ones</span><span class="p">(</span><span class="mi">2</span><span class="p">))</span>
<span class="n">b_1</span> <span class="o">=</span> <span class="n">Normal</span><span class="p">(</span><span class="n">mu</span><span class="o">=</span><span class="n">tf</span><span class="o">.</span><span class="n">zeros</span><span class="p">(</span><span class="mi">1</span><span class="p">),</span> <span class="n">sigma</span><span class="o">=</span><span class="n">tf</span><span class="o">.</span><span class="n">ones</span><span class="p">(</span><span class="mi">1</span><span class="p">))</span>
<span class="n">x</span> <span class="o">=</span> <span class="n">x_train</span>
<span class="n">y</span> <span class="o">=</span> <span class="n">Normal</span><span class="p">(</span><span class="n">mu</span><span class="o">=</span><span class="n">neural_network</span><span class="p">(</span><span class="n">x</span><span class="p">,</span> <span class="n">W_0</span><span class="p">,</span> <span class="n">W_1</span><span class="p">,</span> <span class="n">b_0</span><span class="p">,</span> <span class="n">b_1</span><span class="p">),</span>
<span class="n">sigma</span><span class="o">=</span><span class="mf">0.1</span> <span class="o">*</span> <span class="n">tf</span><span class="o">.</span><span class="n">ones</span><span class="p">(</span><span class="n">N</span><span class="p">))</span>
<span class="c"># INFERENCE</span>
<span class="n">qW_0</span> <span class="o">=</span> <span class="n">Normal</span><span class="p">(</span><span class="n">mu</span><span class="o">=</span><span class="n">tf</span><span class="o">.</span><span class="n">Variable</span><span class="p">(</span><span class="n">tf</span><span class="o">.</span><span class="n">random_normal</span><span class="p">([</span><span class="n">D</span><span class="p">,</span> <span class="mi">2</span><span class="p">])),</span>
<span class="n">sigma</span><span class="o">=</span><span class="n">tf</span><span class="o">.</span><span class="n">nn</span><span class="o">.</span><span class="n">softplus</span><span class="p">(</span><span class="n">tf</span><span class="o">.</span><span class="n">Variable</span><span class="p">(</span><span class="n">tf</span><span class="o">.</span><span class="n">random_normal</span><span class="p">([</span><span class="n">D</span><span class="p">,</span> <span class="mi">2</span><span class="p">]))))</span>
<span class="n">qW_1</span> <span class="o">=</span> <span class="n">Normal</span><span class="p">(</span><span class="n">mu</span><span class="o">=</span><span class="n">tf</span><span class="o">.</span><span class="n">Variable</span><span class="p">(</span><span class="n">tf</span><span class="o">.</span><span class="n">random_normal</span><span class="p">([</span><span class="mi">2</span><span class="p">,</span> <span class="mi">1</span><span class="p">])),</span>
<span class="n">sigma</span><span class="o">=</span><span class="n">tf</span><span class="o">.</span><span class="n">nn</span><span class="o">.</span><span class="n">softplus</span><span class="p">(</span><span class="n">tf</span><span class="o">.</span><span class="n">Variable</span><span class="p">(</span><span class="n">tf</span><span class="o">.</span><span class="n">random_normal</span><span class="p">([</span><span class="mi">2</span><span class="p">,</span> <span class="mi">1</span><span class="p">]))))</span>
<span class="n">qb_0</span> <span class="o">=</span> <span class="n">Normal</span><span class="p">(</span><span class="n">mu</span><span class="o">=</span><span class="n">tf</span><span class="o">.</span><span class="n">Variable</span><span class="p">(</span><span class="n">tf</span><span class="o">.</span><span class="n">random_normal</span><span class="p">([</span><span class="mi">2</span><span class="p">])),</span>
<span class="n">sigma</span><span class="o">=</span><span class="n">tf</span><span class="o">.</span><span class="n">nn</span><span class="o">.</span><span class="n">softplus</span><span class="p">(</span><span class="n">tf</span><span class="o">.</span><span class="n">Variable</span><span class="p">(</span><span class="n">tf</span><span class="o">.</span><span class="n">random_normal</span><span class="p">([</span><span class="mi">2</span><span class="p">]))))</span>
<span class="n">qb_1</span> <span class="o">=</span> <span class="n">Normal</span><span class="p">(</span><span class="n">mu</span><span class="o">=</span><span class="n">tf</span><span class="o">.</span><span class="n">Variable</span><span class="p">(</span><span class="n">tf</span><span class="o">.</span><span class="n">random_normal</span><span class="p">([</span><span class="mi">1</span><span class="p">])),</span>
<span class="n">sigma</span><span class="o">=</span><span class="n">tf</span><span class="o">.</span><span class="n">nn</span><span class="o">.</span><span class="n">softplus</span><span class="p">(</span><span class="n">tf</span><span class="o">.</span><span class="n">Variable</span><span class="p">(</span><span class="n">tf</span><span class="o">.</span><span class="n">random_normal</span><span class="p">([</span><span class="mi">1</span><span class="p">]))))</span>
<span class="n">inference</span> <span class="o">=</span> <span class="n">ed</span><span class="o">.</span><span class="n">KLqp</span><span class="p">({</span><span class="n">W_0</span><span class="p">:</span> <span class="n">qW_0</span><span class="p">,</span> <span class="n">b_0</span><span class="p">:</span> <span class="n">qb_0</span><span class="p">,</span>
<span class="n">W_1</span><span class="p">:</span> <span class="n">qW_1</span><span class="p">,</span> <span class="n">b_1</span><span class="p">:</span> <span class="n">qb_1</span><span class="p">},</span> <span class="n">data</span><span class="o">=</span><span class="p">{</span><span class="n">y</span><span class="p">:</span> <span class="n">y_train</span><span class="p">})</span>
<span class="c"># Sample functions from variational model to visualize fits.</span>
<span class="n">rs</span> <span class="o">=</span> <span class="n">np</span><span class="o">.</span><span class="n">random</span><span class="o">.</span><span class="n">RandomState</span><span class="p">(</span><span class="mi">0</span><span class="p">)</span>
<span class="n">inputs</span> <span class="o">=</span> <span class="n">np</span><span class="o">.</span><span class="n">linspace</span><span class="p">(</span><span class="o">-</span><span class="mi">5</span><span class="p">,</span> <span class="mi">5</span><span class="p">,</span> <span class="n">num</span><span class="o">=</span><span class="mi">400</span><span class="p">,</span> <span class="n">dtype</span><span class="o">=</span><span class="n">np</span><span class="o">.</span><span class="n">float32</span><span class="p">)</span>
<span class="n">x</span> <span class="o">=</span> <span class="n">tf</span><span class="o">.</span><span class="n">expand_dims</span><span class="p">(</span><span class="n">tf</span><span class="o">.</span><span class="n">constant</span><span class="p">(</span><span class="n">inputs</span><span class="p">),</span> <span class="mi">1</span><span class="p">)</span>
<span class="n">mus</span> <span class="o">=</span> <span class="p">[]</span>
<span class="k">for</span> <span class="n">s</span> <span class="ow">in</span> <span class="nb">range</span><span class="p">(</span><span class="mi">10</span><span class="p">):</span>
<span class="n">mus</span> <span class="o">+=</span> <span class="p">[</span><span class="n">neural_network</span><span class="p">(</span><span class="n">x</span><span class="p">,</span> <span class="n">qW_0</span><span class="o">.</span><span class="n">sample</span><span class="p">(),</span> <span class="n">qW_1</span><span class="o">.</span><span class="n">sample</span><span class="p">(),</span>
<span class="n">qb_0</span><span class="o">.</span><span class="n">sample</span><span class="p">(),</span> <span class="n">qb_1</span><span class="o">.</span><span class="n">sample</span><span class="p">())]</span>
<span class="n">mus</span> <span class="o">=</span> <span class="n">tf</span><span class="o">.</span><span class="n">stack</span><span class="p">(</span><span class="n">mus</span><span class="p">)</span>
<span class="n">sess</span> <span class="o">=</span> <span class="n">ed</span><span class="o">.</span><span class="n">get_session</span><span class="p">()</span>
<span class="n">init</span> <span class="o">=</span> <span class="n">tf</span><span class="o">.</span><span class="n">global_variables_initializer</span><span class="p">()</span>
<span class="n">init</span><span class="o">.</span><span class="n">run</span><span class="p">()</span>
<span class="kn">import</span> <span class="nn">pandas</span> <span class="k">as</span> <span class="n">pd</span>
<span class="c"># Prior samples</span>
<span class="n">outputs</span> <span class="o">=</span> <span class="n">mus</span><span class="o">.</span><span class="nb">eval</span><span class="p">()</span>
<span class="n">mydf</span><span class="o">=</span><span class="n">pd</span><span class="o">.</span><span class="n">DataFrame</span><span class="p">(</span><span class="n">outputs</span><span class="p">)</span>
<span class="n">mydf</span><span class="o">.</span><span class="n">to_csv</span><span class="p">(</span><span class="s">"mydfprior.csv"</span><span class="p">)</span>
<span class="c"># Inference</span>
<span class="n">inference</span><span class="o">.</span><span class="n">run</span><span class="p">(</span><span class="n">n_iter</span><span class="o">=</span><span class="mi">500</span><span class="p">,</span> <span class="n">n_samples</span><span class="o">=</span><span class="mi">5</span><span class="p">)</span>
<span class="c"># Posterior samples</span>
<span class="n">outputs</span> <span class="o">=</span> <span class="n">mus</span><span class="o">.</span><span class="nb">eval</span><span class="p">()</span>
<span class="n">mydf</span><span class="o">=</span><span class="n">pd</span><span class="o">.</span><span class="n">DataFrame</span><span class="p">(</span><span class="n">outputs</span><span class="p">)</span>
<span class="n">mydf</span><span class="o">.</span><span class="n">to_csv</span><span class="p">(</span><span class="s">"mydfpost.csv"</span><span class="p">)</span>
</code></pre></div></div>
<h2 id="checking-the-priors">Checking the priors</h2>
<p>If you ran the code you will now have two files called mydfprior.csv and mydfpost.csv which contains, surprise, surprise, your prior and posterior curves based on the samples. As always we plot out the consequence of our priors and check what happens. In the plot below you can see that our priors for the Bayesian Neural Network do not really produce curves that resembles what we’re looking for. Not to worry my friends; we don’t actually need them to. Look through the graph and make sure you understand why the plot looks the way it does.</p>
<p><img src="/images/figure/Prior plot-1.png" alt="plot of chunk Prior plot" /></p>
<h2 id="checking-the-posteriors">Checking the posteriors</h2>
<blockquote>
<p>The problem with the world is not that people know too little. It’s that they know so many things that just ain’t so.
– Mark Twain</p>
</blockquote>
<p>So the priors hopefully makes sense to you now. How about our posterior? Well as you can see below this plot makes a lot more sense compared to the data we’re trying to make sense of. However, you can clearly see some uncertainty in there as well. This is key, since no matter which model we choose there will always be elements of uncertainty involved. The great thing about science is that we don’t have to pretend to know everything. We’re perfectly comfortable admitting our ignorence. The probabilistic framework allows us to quantify that ignorance! If you think that sounds like a bad idea I suggest you take a look at the quote above.</p>
<p><img src="/images/figure/Post plot-1.png" alt="plot of chunk Post plot" /></p>
<h1 id="a-more-real-world-problem">A more real world problem</h1>
<h2 id="a-first-look-at-the-data">A first look at the data</h2>
<p>There are few real world problems as pressing as that of global warming. Whenever I’m talking about global warming I feel there are two responses I get which are basically binomially distributed in which the majority of people quickly gets it and the other part are oblivious to facts presented to them. In reality there can be little to no doubt that humans are causing the green house effect. The plot below shows the evolution of the temperature anomaly over time since the 1850’s until present time.</p>
<p><img src="/images/figure/globwarmdata-1.png" alt="plot of chunk globwarmdata" /></p>
<p>As you see there is a clear trend showing that the world is getting warmer. So global warming is indeed happening and it’s causing some very measurable real problems for us. There are lobbyist organizations around the world who wishes to tell you that this is not really caused by our increased CO2 emissions since they have well ulterior motives. If we plot the number of passenger per year in the same plot as the deviance of temperature it’s also quite apparent that something here might be at least weakly related. However, in order to really prove that we need more work than I’m going to do in this post. This post is about Edward and Error correction models and not Global warming.</p>
<p><img src="/images/figure/jointplot-1.png" alt="plot of chunk jointplot" /></p>
<h2 id="specifying-a-model">Specifying a model</h2>
<p>As you might have noticed these data seem a bit like they are suffering from measurament errors. Especially the temperature but also the passengers. So when we also cannot trust the data to be point measurements what do we do? Well we create an error correction model. Since we are using probability we can express any kind of measurements as a probabilistic process. Below you’ll find code for the error correction model for the deviance in global temperature expressed in Edward. Mathematically the error correction model is</p>
<script type="math/tex; mode=display">% <![CDATA[
\begin{align}
X_t & \sim \mathcal N(\mu_{X,t}, \sigma_X)\\
\mu_{X,t} &\sim \mathcal N(0, 0.5)\\
\sigma_X &= 0.1
\end{align} %]]></script>
<p>where you can see that we fixed the noise so that we inform the model of the scale of the errors that we believe we will observe. If this is set too high then naturally nothing will emerge since the error is much larger than the signal. Be aware of these things in general when you express your likelihood functions!</p>
<p>Check out the math above and make sure you understand the code below to see how Edward materializes this model. It’s slightly different from Stan but you should be able to recognize most of the model setup. Do not worry too much about all the book keeping for extracting and merging the priors and posteriors. Especially the last part where I export the distributions. I do this because the plots I will show you soon will be done in R. Not because they cannot be done in Python, but because doing them in python makes me want to kill myself.</p>
<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="kn">import</span> <span class="nn">tensorflow</span> <span class="k">as</span> <span class="n">tf</span>
<span class="kn">import</span> <span class="nn">edward</span> <span class="k">as</span> <span class="n">ed</span>
<span class="kn">import</span> <span class="nn">pandas</span> <span class="k">as</span> <span class="n">pd</span>
<span class="kn">import</span> <span class="nn">numpy</span> <span class="k">as</span> <span class="n">np</span>
<span class="kn">from</span> <span class="nn">ggplot</span> <span class="kn">import</span> <span class="o">*</span>
<span class="kn">from</span> <span class="nn">edward.models</span> <span class="kn">import</span> <span class="n">Normal</span><span class="p">,</span> <span class="n">Bernoulli</span>
<span class="kn">import</span> <span class="nn">matplotlib.pyplot</span> <span class="k">as</span> <span class="n">plt</span>
<span class="n">mydf</span> <span class="o">=</span> <span class="n">pd</span><span class="o">.</span><span class="n">read_csv</span><span class="p">(</span><span class="s">"../humanglobwarm.csv"</span><span class="p">)</span>
<span class="n">N</span> <span class="o">=</span> <span class="n">mydf</span><span class="o">.</span><span class="n">shape</span><span class="p">[</span><span class="mi">0</span><span class="p">]</span>
<span class="n">D</span> <span class="o">=</span> <span class="n">mydf</span><span class="o">.</span><span class="n">shape</span><span class="p">[</span><span class="mi">1</span><span class="p">]</span>
<span class="c"># Define the model</span>
<span class="n">x_mu</span> <span class="o">=</span> <span class="n">Normal</span><span class="p">(</span><span class="n">mu</span><span class="o">=</span><span class="n">tf</span><span class="o">.</span><span class="n">zeros</span><span class="p">(</span><span class="n">N</span><span class="p">),</span> <span class="n">sigma</span><span class="o">=</span><span class="mf">0.5</span><span class="o">*</span><span class="n">tf</span><span class="o">.</span><span class="n">ones</span><span class="p">([]))</span>
<span class="n">x</span> <span class="o">=</span> <span class="n">Normal</span><span class="p">(</span><span class="n">mu</span><span class="o">=</span><span class="n">x_mu</span><span class="p">,</span> <span class="n">sigma</span><span class="o">=</span><span class="mf">0.1</span><span class="o">*</span><span class="n">tf</span><span class="o">.</span><span class="n">ones</span><span class="p">([]))</span>
<span class="c"># VI placeholder</span>
<span class="n">qx_mu</span> <span class="o">=</span> <span class="n">Normal</span><span class="p">(</span><span class="n">mu</span><span class="o">=</span><span class="n">tf</span><span class="o">.</span><span class="n">Variable</span><span class="p">(</span><span class="n">tf</span><span class="o">.</span><span class="n">zeros</span><span class="p">(</span><span class="n">N</span><span class="p">)),</span> <span class="n">sigma</span><span class="o">=</span><span class="n">tf</span><span class="o">.</span><span class="n">nn</span><span class="o">.</span><span class="n">softplus</span><span class="p">(</span><span class="n">tf</span><span class="o">.</span><span class="n">Variable</span><span class="p">(</span><span class="mf">0.5</span><span class="o">*</span><span class="n">tf</span><span class="o">.</span><span class="n">ones</span><span class="p">(</span><span class="n">N</span><span class="p">))))</span>
<span class="c"># Set up data and the inference method to Kullback Leibler</span>
<span class="n">x_train</span> <span class="o">=</span> <span class="n">mydf</span><span class="o">.</span><span class="n">get</span><span class="p">(</span><span class="s">"Tempdev"</span><span class="p">)</span><span class="o">.</span><span class="n">reshape</span><span class="p">([</span><span class="n">N</span><span class="p">,</span><span class="mi">1</span><span class="p">])</span>
<span class="n">sess</span> <span class="o">=</span> <span class="n">ed</span><span class="o">.</span><span class="n">get_session</span><span class="p">()</span>
<span class="n">data</span> <span class="o">=</span> <span class="p">{</span><span class="n">x</span><span class="p">:</span> <span class="n">x_train</span><span class="p">[:,</span><span class="mi">0</span><span class="p">]}</span>
<span class="n">inference</span> <span class="o">=</span> <span class="n">ed</span><span class="o">.</span><span class="n">KLqp</span><span class="p">({</span><span class="n">x_mu</span><span class="p">:</span> <span class="n">qx_mu</span><span class="p">},</span> <span class="n">data</span><span class="p">)</span>
<span class="c"># Set up for samples from models</span>
<span class="n">mus</span> <span class="o">=</span> <span class="p">[]</span>
<span class="k">for</span> <span class="n">i</span> <span class="ow">in</span> <span class="nb">range</span><span class="p">(</span><span class="mi">10</span><span class="p">):</span>
<span class="n">mus</span> <span class="o">+=</span> <span class="p">[</span><span class="n">qx_mu</span><span class="o">.</span><span class="n">sample</span><span class="p">()]</span>
<span class="n">mus</span> <span class="o">=</span> <span class="n">tf</span><span class="o">.</span><span class="n">stack</span><span class="p">(</span><span class="n">mus</span><span class="p">)</span>
<span class="c"># Inference: More controlled way of inference running</span>
<span class="n">inference</span><span class="o">.</span><span class="n">initialize</span><span class="p">(</span><span class="n">n_print</span><span class="o">=</span><span class="mi">10</span><span class="p">,</span> <span class="n">n_iter</span><span class="o">=</span><span class="mi">600</span><span class="p">)</span>
<span class="n">init</span> <span class="o">=</span> <span class="n">tf</span><span class="o">.</span><span class="n">global_variables_initializer</span><span class="p">()</span>
<span class="n">init</span><span class="o">.</span><span class="n">run</span><span class="p">()</span>
<span class="c"># Prior samples</span>
<span class="n">outputs</span> <span class="o">=</span> <span class="n">mus</span><span class="o">.</span><span class="nb">eval</span><span class="p">()</span>
<span class="n">priordf</span><span class="o">=</span><span class="n">pd</span><span class="o">.</span><span class="n">DataFrame</span><span class="p">(</span><span class="n">outputs</span><span class="p">)</span>
<span class="n">priordf</span><span class="p">[</span><span class="s">'Sample'</span><span class="p">]</span><span class="o">=</span><span class="p">[</span><span class="s">"Sample"</span><span class="o">+</span><span class="nb">str</span><span class="p">(</span><span class="n">x</span><span class="p">)</span> <span class="k">for</span> <span class="n">x</span> <span class="ow">in</span> <span class="nb">list</span><span class="p">(</span><span class="nb">range</span><span class="p">(</span><span class="mi">10</span><span class="p">))]</span>
<span class="n">priordf</span><span class="o">=</span><span class="n">pd</span><span class="o">.</span><span class="n">melt</span><span class="p">(</span><span class="n">priordf</span><span class="p">,</span> <span class="n">id_vars</span><span class="o">=</span><span class="s">"Sample"</span><span class="p">)</span>
<span class="n">ggplot</span><span class="p">(</span><span class="n">priordf</span><span class="p">,</span> <span class="n">aes</span><span class="p">(</span><span class="n">y</span><span class="o">=</span><span class="s">"value"</span><span class="p">,</span> <span class="n">x</span><span class="o">=</span><span class="s">"variable"</span><span class="p">,</span> <span class="n">color</span><span class="o">=</span><span class="s">"Sample"</span><span class="p">))</span> <span class="o">+</span> <span class="n">geom_line</span><span class="p">()</span>
<span class="n">priordf</span><span class="p">[</span><span class="s">'Type'</span><span class="p">]</span><span class="o">=</span><span class="s">'Prior'</span>
<span class="c"># Run Inference</span>
<span class="k">for</span> <span class="n">_</span> <span class="ow">in</span> <span class="nb">range</span><span class="p">(</span><span class="n">inference</span><span class="o">.</span><span class="n">n_iter</span><span class="p">):</span>
<span class="n">info_dict</span> <span class="o">=</span> <span class="n">inference</span><span class="o">.</span><span class="n">update</span><span class="p">()</span>
<span class="n">inference</span><span class="o">.</span><span class="n">print_progress</span><span class="p">(</span><span class="n">info_dict</span><span class="p">)</span>
<span class="n">inference</span><span class="o">.</span><span class="n">finalize</span><span class="p">()</span>
<span class="c"># Posterior samples</span>
<span class="n">outputs</span> <span class="o">=</span> <span class="n">mus</span><span class="o">.</span><span class="nb">eval</span><span class="p">()</span>
<span class="n">postdf</span><span class="o">=</span><span class="n">pd</span><span class="o">.</span><span class="n">DataFrame</span><span class="p">(</span><span class="n">outputs</span><span class="p">)</span>
<span class="n">postdf</span><span class="p">[</span><span class="s">'Sample'</span><span class="p">]</span><span class="o">=</span><span class="p">[</span><span class="s">"Sample"</span><span class="o">+</span><span class="nb">str</span><span class="p">(</span><span class="n">x</span><span class="p">)</span> <span class="k">for</span> <span class="n">x</span> <span class="ow">in</span> <span class="nb">list</span><span class="p">(</span><span class="nb">range</span><span class="p">(</span><span class="mi">10</span><span class="p">))]</span>
<span class="n">postdf</span><span class="o">=</span><span class="n">pd</span><span class="o">.</span><span class="n">melt</span><span class="p">(</span><span class="n">postdf</span><span class="p">,</span> <span class="n">id_vars</span><span class="o">=</span><span class="s">"Sample"</span><span class="p">)</span>
<span class="n">ggplot</span><span class="p">(</span><span class="n">postdf</span><span class="p">,</span> <span class="n">aes</span><span class="p">(</span><span class="n">y</span><span class="o">=</span><span class="s">"value"</span><span class="p">,</span> <span class="n">x</span><span class="o">=</span><span class="s">"variable"</span><span class="p">,</span> <span class="n">color</span><span class="o">=</span><span class="s">"Sample"</span><span class="p">))</span> <span class="o">+</span> <span class="n">geom_line</span><span class="p">()</span>
<span class="n">postdf</span><span class="p">[</span><span class="s">'Type'</span><span class="p">]</span><span class="o">=</span><span class="s">'Posterior'</span>
<span class="c"># One glorious data frame for export</span>
<span class="n">tmpdf</span> <span class="o">=</span> <span class="n">pd</span><span class="o">.</span><span class="n">concat</span><span class="p">([</span><span class="n">priordf</span><span class="p">,</span> <span class="n">postdf</span><span class="p">])</span>
<span class="n">tmpdf</span><span class="o">.</span><span class="n">to_csv</span><span class="p">(</span><span class="s">"errorcorrsamplesdf.csv"</span><span class="p">)</span>
</code></pre></div></div>
<p>Did you get through the code? Good, then let’s have a look at our dear posteriors and priors and data! We will start off by looking at 10 samples from the priors and posteriors. The x-axis in the plot below represents the years 1993 to 2015 where 0 is 1993 and 2015 is 22.</p>
<p><img src="/images/figure/errorcorrmodel-1.png" alt="plot of chunk errorcorrmodel" /></p>
<p>We may want to look into what the average effect of the error correction model is and compare it to the data we observed. As you can see here we extracted a new mean for every observed point. So it’s not surprising that it’s consistent with the data. However, do you have any observations regarding the uncertainty of the likelihood? I’m sure you do, so play around with it and check what happens. Remember that the sigma for the latent variable quantifies the uncertainty of the location of the mean but doesn’t state anything about what the likelihood will support.</p>
<p><img src="/images/figure/errorcorrmodelpostvsdata-1.png" alt="plot of chunk errorcorrmodelpostvsdata" /></p>
<p>Every data point in the graph above is marked with a red dot. The black lines are sampled from the posterior. These lines should be viewed as alternative realizations of the real unknown (latent) deviance in temperature. This construct allows us to not put undue confidence in the data we measured. In fact many data sources that are considered absolute are in fact inherently noisy.</p>
<h2 id="putting-it-in-a-regression-formulation">Putting it in a regression formulation</h2>
<p>We can of course formulate this as a regression problem where the likelihood is
set up on the error correction itself. This means that if we get noisy
measurements it’s the random variable that’s regressed. The benefit is that we
don’t have to take the data at face value and neither does the model. Remember,
uncertainty is not a bad thing as long as it can be quantified. Mathematically
it looks like this</p>
<script type="math/tex; mode=display">% <![CDATA[
\begin{align}
X_t & \sim \mathcal N(\mu_{X,t}, \sigma_X)\\
\mu_{X,t} &\sim \mathcal N(\beta Z_t+\alpha, 0.5)\\
\beta & \sim \mathcal N(0, 1)\\
\alpha & \sim \mathcal N(0, 0.1)\\
\sigma_X &= 0.1
\end{align} %]]></script>
<p>where <script type="math/tex">X_t</script> and <script type="math/tex">Z_t</script> is the deviance in temperature and number of
passengers transported by airplane at year <script type="math/tex">t</script> respectively. The other
variables quantify dynamics and uncertainty. The priors are specified quite
widely to capture a broad spectrum of possibly consistent models. This model can
also be sampled using Edward but I’ll leave that as an exercise for you to
solve.</p>
<h1 id="conclusions-from-my-simulations">Conclusions from my simulations</h1>
<p>So I played around with Edward and it took a bit of time to get used to the
Edward way of doing things. Mainly because it’s so tightly connected to
tensorflow. In general I like the idea of Edward and the flexibility in modeling
that it allows for. That being said I think as of now the language is a bit
young still. The Variational Inference approach (which is the only one I used in
this post) works ok but when your priors are not in the vicinity of the final
solution it rarely finds the correct posterior. There also seems to be quite
heavy problems with varying scales of covariates. As such you should probably
always work with normalized data when using Edward. To summarize the
recommendations from this post</p>
<ul>
<li>Always normalize your data when working with Edward</li>
<li>Make sure your priors are in the vicinity of the final solution, i.e., for
simple models you can use a maximum likelihood estimate as a starting point</li>
<li>Never base your final inference on a variational algorithm; Instead always run
a full sampler to verify and obtain the true posterior</li>
<li>Edward is cool and I will keep following it but for now I will stay with Stan
for my probabilistic programming needs</li>
</ul>
<p>Happy inferencing!</p>Dr. Michael GreenThere’s a new kid on the inference block called “Edward” who is full of potential and promises of merging probabilistic programming, computational graphs and inference! There’s also talk of a 35 times speed up compared to our good old reliable fellow “Stan”. Today I will run some comparisons for problems that currently interest me namely time series with structural hyperparameters.On the equivalence of Bayesian priors and Ridge regression2017-01-18T00:00:00+00:002017-01-18T00:00:00+00:00/2017/01/18/On-the-equivalence-of-Bayesian-priors-and-Ridge-regression<p>Today I’m going to take you through the comparison of a Bayesian formalism for
regression and compare it to Ridge regression which is a penalized version of
OLS. The rationale I have for doing so is that many times in my career I’ve come
across “frequentists” who claim that parameters can be controlled via a process
called shrinkage, regularization, weight decay, or weight elimination depending
on whether you’re using GLM’s, SVM’s or Neural networks. This statement is in
principle correct while misguided. The regularization can be seen to arise as a
consequence of a probabilistic formulation. I would go so far as to say that
there is no such thing as frequentist statistics; there are only those who
refuse to add prior information to their model! Before we get started I would
like to warn you that this post is going to get a tad mathematical. If that
scares you, you might consider skipping the majority of this post and go
directly to the summary. Now, let’s go!</p>
<h2 id="a-probabilistic-formulation">A probabilistic formulation</h2>
<p>Any regression problem can be expressed as an implementation of a probabilistic
formulation. For instance what we typically have at our hand is a dependent
variable <script type="math/tex">y</script>, a matrix <script type="math/tex">X</script> of covariates and a parameter vector <script type="math/tex">\beta</script>.
The dependent variable consists of data we would like to learn something about
or be able to explain. As such we wish to model it’s dynamics via the <script type="math/tex">\beta</script>
through <script type="math/tex">X</script>. The joint probability distribution for these three ingredients is
given simply as <script type="math/tex">p(y, X, \beta)</script>. This is the most general form of
representing a regression problem probabilistically. However, it’s not very
useful, so in order to make it a bit more tangible let’s decompose this joint
probability like this.</p>
<script type="math/tex; mode=display">p(y, X, \beta)=p(\beta\vert y, X)p(y, X)=p(\beta\vert y, X)p(y)p(X)</script>
<p>In this view it is clear that we want to learn something about <script type="math/tex">\beta</script> since
that’s the unknowns. The other parts we have observed data on. So we would like
to say something clever about <script type="math/tex">p(\beta\vert y, X)</script>. How do we go about doing
that? Well for starters we need to realize that <script type="math/tex">p(y, X, \beta)</script> can actually
be written as</p>
<script type="math/tex; mode=display">p(y, X, \beta)=p(y\vert \beta, X)p(\beta, X)=p(y\vert \beta, X)p(\beta)p(X)</script>
<p>which means that</p>
<script type="math/tex; mode=display">p(\beta\vert y, X)p(y)p(X)=p(y\vert \beta, X)p(\beta)p(X)</script>
<p>and therefor</p>
<script type="math/tex; mode=display">p(\beta\vert y, X)=\frac{p(y\vert \beta, X)p(\beta)}{p(y)}</script>
<p>which is just a derivation of Baye’s rule. Now we actually have something a bit more useful at our hands which is ready to be interpreted and implemented. What do I mean by implemented? Seems like an odd thing to say about probability distributions right? As weird as it may seem we actually haven’t given the probability distributions a concise mathematical representation. This is of course necessary for any kind of inference. So let’s get to it. The first term I would like to describe is the likelihood i.e. the <script type="math/tex">p(y\vert \beta, X)</script> which describes the likelihood of observing the data given the covariance matrix <script type="math/tex">X</script> and a set of parameters <script type="math/tex">\beta</script>. For simplicity let’s say this probability distribution is gaussian thus taking the following form <script type="math/tex">p(y\vert \beta, X)=\mathcal{N}(y-\beta X; 0, \sigma)</script>. This corresponds to setting up a measurement model <script type="math/tex">y_t = \beta x_t + \epsilon</script> where <script type="math/tex">\epsilon=\mathcal{N}(0, \sigma)</script>.</p>
<p>The second term in the nominator on the right hand side is our prior
<script type="math/tex">p(\beta)</script> which we will also consider gaussian. Thus, we will set
<script type="math/tex">p(\beta)=\mathcal{N}(0, \alpha I)</script> indicating that the parameters are
independant from each other and most likely centered around <script type="math/tex">0</script> with a known
standard deviation of <script type="math/tex">\alpha</script>. The last term is the denominator <script type="math/tex">p(y)</script>
which in this setting functions as the evidence. This is also the normalizing
constant that makes sure that we can interpret the right hand side
probabilistically.</p>
<p>That’s it! We now have the pieces we need to push the inference button. This is
often for more complicated models done by utilizing Markov Chain Monte Carlo
methods to sample the distributions. If we are not interested in the
distribution but only the average estimates for the parameters we can just turn
this into an optimization problem instead by realizing that</p>
<script type="math/tex; mode=display">p(\beta\vert y, X)=\frac{p(y\vert \beta, X)p(\beta)}{p(y)}\propto p(y\vert \beta, X)p(\beta)</script>
<p>since <script type="math/tex">p(y)</script> just functions as a normalizing constant and doesn’t change the
location of the <script type="math/tex">\beta</script> that would yield the maximum probability. Thus we can
set up the optimization problem as</p>
<script type="math/tex; mode=display">\mathcal{L}(\beta)=\prod_{t=1}^T \mathcal{N}(y_t-\beta x_t; 0, \sigma)\mathcal{N}(\beta; 0, \alpha I)</script>
<p>and maximize this function. Normally when we solve optimization problems it’s easier and nicer to turn it into a minimization problem instead of a maximization problem. This is easily done by minimizing</p>
<script type="math/tex; mode=display">-\ln \mathcal{L}(\beta)=-\sum_{t=1}^T \ln \mathcal{N}(y_t-\beta x_t; 0, \sigma)- \ln\mathcal{N}(\beta; 0, \alpha I)</script>
<p>as opposed to the equation before. For the sake of clarity let’s assume from now on that we only have one independent variable and only one parameter <script type="math/tex">\beta</script>. Since we know that</p>
<script type="math/tex; mode=display">\mathcal{N}(x;\mu, \sigma)=\frac{1}{\sqrt{2\pi\sigma^2}}\exp\left(-\frac{(x-\mu)^2}{2\sigma^2}\right)</script>
<p>we can easily unfold the logarithms to reveal</p>
<script type="math/tex; mode=display">-\ln \mathcal{L}(\beta)=-\sum_{t=1}^T\left( -C_1-\frac{\left(y_t-\beta x_t- 0\right)^2}{2\sigma^2}\right) - C_2 + \frac{(\beta-0)^2}{2\alpha^2}</script>
<p>which can be more nicely written as</p>
<script type="math/tex; mode=display">-\ln \mathcal{L}(\beta)=\sum_{t=1}^T\frac{\left(y_t-\beta x_t\right)^2}{2\sigma^2} + \frac{\beta^2}{2\alpha^2} + C</script>
<p>where <script type="math/tex">C=TC_1 - C_2</script>. As such putting a Gaussian prior on your <script type="math/tex">\beta</script> is
equivalent penalizing solutions that differ from <script type="math/tex">0</script> by a factor of
<script type="math/tex">1/(2\alpha^2)</script> i.e. 1 divided by two times the variance of the Gaussian.
Thus, the smaller the variance, the higher our prior confidence is that the
solution should be close to zero. The larger the variance the more uncertain we
are about where the solution should end up.</p>
<h2 id="ridge-regression">Ridge regression</h2>
<p>The problem of regression can be formulated differently than we did previously,
i.e., we don’t need to formulate it probabilistically. In essance what we could
do is state that we have a set of independent equations that we would like to
solve like this</p>
<script type="math/tex; mode=display">\Vert X\beta-y\Vert^2</script>
<p>where the variables and parameters have the same interpretation as before. This
is basically Ordinary Least Squares (OLS) which suffers from overfitting and
sensitivity to outliers and multicollinearity. So what Ridge regression does is
to introduce a penalty term to this set of equations like this</p>
<script type="math/tex; mode=display">\Vert X\beta-y\Vert^2+\Vert \Gamma\beta\Vert^2</script>
<p>where <script type="math/tex">\Gamma</script> is typically chosen to be <script type="math/tex">\gamma I</script>. This means that all
values in the parameter vector <script type="math/tex">\beta</script> should be close to 0. Continuing along
this track we can select a dumbed down version of this equation to show what’s
going on for a simple application of one variable <script type="math/tex">x</script> and one parameter
<script type="math/tex">\beta</script>. In this case</p>
<script type="math/tex; mode=display">\Vert X\beta-y\Vert^2+\Vert \gamma I\beta\Vert^2</script>
<p>turns into</p>
<script type="math/tex; mode=display">\sum_{t=1}^T(y_t-\beta x_t)^2+\gamma^2\beta^2</script>
<p>which you may recognize from before. Not convinced? Well let’s look into the differences.</p>
<table>
<thead>
<tr>
<th>Probabilistic formulation</th>
<th>Ridge regression</th>
</tr>
</thead>
<tbody>
<tr>
<td><script type="math/tex">\sum_{t=1}^T\frac{\left(y_t-\beta x_t\right)^2}{2\sigma^2} + \frac{\beta^2}{2\alpha^2} + C</script></td>
<td><script type="math/tex">\sum_{t=1}^T(y_t-\beta x_t)^2+\gamma^2\beta^2</script></td>
</tr>
</tbody>
</table>
<p>Here it’s pretty obvious to see that they are equivalent. The constant <script type="math/tex">C</script> plays no role in the minimization of these expressions. Neither does the denominator <script type="math/tex">2\sigma^2</script>. Thus if we set <script type="math/tex">\lambda=\gamma^2=1/(2\alpha^2)</script> the equivalence is clear and we see that what we are really minimizing is</p>
<script type="math/tex; mode=display">\sum_{t=1}^T(y_t-\beta x_t)^2+\lambda\beta^2</script>
<p>which concludes my point.</p>
<h2 id="summary">Summary</h2>
<p>So I’ve just shown that ridge regression and a Bayesian formulation with
Gaussian priors on the parameters are in fact equivalent mathematically and
numerically. One big question remains; Why the hell would anyone in their right
mind use a probabilistic formulation for something as simple as penalized OLS?
The answer you are looking for here is “freedom”. What if we would have selected
a different likelihood? How about a different prior? All of this would have
changed and we would have ended up with a different problem. The benefit of the
probabilistic approach is that it is agnostic with respect to which
distributions you choose. It’s a consistent inferential framework that just
allows you the freedom to model things as you see fit. Ridge regression has
already made all the model choices for you which is convenient but hardly
universal.</p>
<p>My point is this; whatever model you decide to use and however you wish to model
it is your prerogative. Embrace this freedom and don’t let old school convenient
tools dictate your way towards solving a specific problem. Be creative, be free
and most of all: Be honest to yourself.</p>
<p>Happy inferencing!</p>Dr. Michael GreenToday I’m going to take you through the comparison of a Bayesian formalism for regression and compare it to Ridge regression which is a penalized version of OLS. The rationale I have for doing so is that many times in my career I’ve come across “frequentists” who claim that parameters can be controlled via a process called shrinkage, regularization, weight decay, or weight elimination depending on whether you’re using GLM’s, SVM’s or Neural networks. This statement is in principle correct while misguided. The regularization can be seen to arise as a consequence of a probabilistic formulation. I would go so far as to say that there is no such thing as frequentist statistics; there are only those who refuse to add prior information to their model! Before we get started I would like to warn you that this post is going to get a tad mathematical. If that scares you, you might consider skipping the majority of this post and go directly to the summary. Now, let’s go! A probabilistic formulation Any regression problem can be expressed as an implementation of a probabilistic formulation. For instance what we typically have at our hand is a dependent variable , a matrix of covariates and a parameter vector . The dependent variable consists of data we would like to learn something about or be able to explain. As such we wish to model it’s dynamics via the through . The joint probability distribution for these three ingredients is given simply as . This is the most general form of representing a regression problem probabilistically. However, it’s not very useful, so in order to make it a bit more tangible let’s decompose this joint probability like this. In this view it is clear that we want to learn something about since that’s the unknowns. The other parts we have observed data on. So we would like to say something clever about . How do we go about doing that? Well for starters we need to realize that can actually be written as which means that and therefor which is just a derivation of Baye’s rule. Now we actually have something a bit more useful at our hands which is ready to be interpreted and implemented. What do I mean by implemented? Seems like an odd thing to say about probability distributions right? As weird as it may seem we actually haven’t given the probability distributions a concise mathematical representation. This is of course necessary for any kind of inference. So let’s get to it. The first term I would like to describe is the likelihood i.e. the which describes the likelihood of observing the data given the covariance matrix and a set of parameters . For simplicity let’s say this probability distribution is gaussian thus taking the following form . This corresponds to setting up a measurement model where . The second term in the nominator on the right hand side is our prior which we will also consider gaussian. Thus, we will set indicating that the parameters are independant from each other and most likely centered around with a known standard deviation of . The last term is the denominator which in this setting functions as the evidence. This is also the normalizing constant that makes sure that we can interpret the right hand side probabilistically. That’s it! We now have the pieces we need to push the inference button. This is often for more complicated models done by utilizing Markov Chain Monte Carlo methods to sample the distributions. If we are not interested in the distribution but only the average estimates for the parameters we can just turn this into an optimization problem instead by realizing that since just functions as a normalizing constant and doesn’t change the location of the that would yield the maximum probability. Thus we can set up the optimization problem as and maximize this function. Normally when we solve optimization problems it’s easier and nicer to turn it into a minimization problem instead of a maximization problem. This is easily done by minimizing as opposed to the equation before. For the sake of clarity let’s assume from now on that we only have one independent variable and only one parameter . Since we know that we can easily unfold the logarithms to reveal which can be more nicely written as where . As such putting a Gaussian prior on your is equivalent penalizing solutions that differ from by a factor of i.e. 1 divided by two times the variance of the Gaussian. Thus, the smaller the variance, the higher our prior confidence is that the solution should be close to zero. The larger the variance the more uncertain we are about where the solution should end up. Ridge regression The problem of regression can be formulated differently than we did previously, i.e., we don’t need to formulate it probabilistically. In essance what we could do is state that we have a set of independent equations that we would like to solve like this where the variables and parameters have the same interpretation as before. This is basically Ordinary Least Squares (OLS) which suffers from overfitting and sensitivity to outliers and multicollinearity. So what Ridge regression does is to introduce a penalty term to this set of equations like this where is typically chosen to be . This means that all values in the parameter vector should be close to 0. Continuing along this track we can select a dumbed down version of this equation to show what’s going on for a simple application of one variable and one parameter . In this case turns into which you may recognize from before. Not convinced? Well let’s look into the differences. Probabilistic formulation Ridge regression Here it’s pretty obvious to see that they are equivalent. The constant plays no role in the minimization of these expressions. Neither does the denominator . Thus if we set the equivalence is clear and we see that what we are really minimizing is which concludes my point. Summary So I’ve just shown that ridge regression and a Bayesian formulation with Gaussian priors on the parameters are in fact equivalent mathematically and numerically. One big question remains; Why the hell would anyone in their right mind use a probabilistic formulation for something as simple as penalized OLS? The answer you are looking for here is “freedom”. What if we would have selected a different likelihood? How about a different prior? All of this would have changed and we would have ended up with a different problem. The benefit of the probabilistic approach is that it is agnostic with respect to which distributions you choose. It’s a consistent inferential framework that just allows you the freedom to model things as you see fit. Ridge regression has already made all the model choices for you which is convenient but hardly universal. My point is this; whatever model you decide to use and however you wish to model it is your prerogative. Embrace this freedom and don’t let old school convenient tools dictate your way towards solving a specific problem. Be creative, be free and most of all: Be honest to yourself. Happy inferencing!