In the last post we had a simple stepping algorithm, and a gradient descent implementation, for fitting a line to a set of points with one variable and one ‘outcome’. As I mentioned though, it’s fairly straightforward to extend that to multiple variables, and even to curves, rather than just straight lines.

For this example I’ve reorganised the code slightly into a class to make life a little easier, but the main changes are just the hypothesis and learn functions. For the hypothesis, we just need to calculate our output value based on all of the parameters, so we turn our calculation into a loop:

Similarly, we need to update our learning function. In the single parameter case we were using updating our intercept based on the score, and then we were multiplying our gradient (the second parameter) by the example row’s value - so our gradient would be multiplied by the example’s x score. We’re going to do the same here, updating each parameter after the first with the value multiploed by the example’s xn value - so parameter1 * x0, parameter2 * x1 and so on.

The rest stays pretty much the same - each step we update the parameters in the learn function, and check that our score is reducing properly. However, one of the other issues we didn’t look very much in the previous post was what learning rate to choose, and what number of iterations to limit to. These are often hard to say precisely, but it can help to plot the change in the score for different values on a graph. We can do that for some data with our class pretty easily.

Plotting the output gives us a graph like this:

We can see from the graph that the error rate is reducing well with 0.1, so that would be a good start point for this data. However, if the learning rate is too high, the graph may bounce around, or not converge as well, which would be a sign to reduce it.

One other new thing we’re doing in this code is trying to standardise the data a bit by scaling it between -0.5 and 0.5. To do that, we calculate the min and max of any given feature (so e.g. the min and max of values x1 say) to give us the range of values. We also calculate the average for each entry.

For each value we then subtract the average (giving it a nice mean of 0), and divide by the different between min and max, which gives us the -0.5 to 0.5 range. This allows us to treat data at different scales similarly, which tends to be a win for the algorithm (though not always, so your mileage may vary) - as in real world problems we’ll often be dealing with data that has entries at very diverse ranges - for example if we’re trying to estimate the value of companies based on the number of employees and number of office locations, we’re could well be talking about numbers in the 1000s for employees by 10s for office. Gradient descent will still work without feature scaling, but it’ll take longer.

One other interesting thing we can do once we have this kind of setup is to extend it to lines (or surfaces in the > 2D cases) that are not straight - using polynomials (anything with an x2 or higher power) in it. All we have to do for that is update our hypothesis and learn functions again so that we calculate the hypothesis with the right powers on the variables, and update our learn function.

If you recall from the previous article, the learning function was updating based on the partial derivative of our ‘cost’ function (the sum of the squared errors between prediction and actual values) - we’re doing the same here. For the linear case we had to multiply the score by the value of xn, so parameter1’s was updated by the score multiplied by x0 and so on. In this case, we need to make sure we include the power, so if our hypothesis is is y = parameter0 + parameter1*x02 then we need to multiply parameter1 by x02 in the learning function.

We can then call that function exactly the same way. As the ml-class lectures point out, there is a quicker way of getting this for many cases (ones with reasonable numbers of features, or xn values), which is called the normal method - but that involves some linear algebra in calculating the inverse of a matrix - something that there isn’t a library for in PHP (as far as I’m aware), and hence a bit of a pain to write -but if you are interested in solving these types of problems, it’s worth looking into.