Transcript for Wednesday 14 November (fit data)

| categories: Transcripts

Today we talked about how to fit equations to data.

# IPython log file

cd Dropbox/116

# first I faked up some noisy data
x = np.random.random_integers(0, 100, 100) / 10.0
y = 50 + 10 * x + x**2 + 5 * np.random.randn(100)

# look at it plotted at points
plot(x, y, '.')
#[Out]# [<matplotlib.lines.Line2D at 0xb0e57ec>]

# just listing the values isn't very illuminating.
x[:5]
#[Out]# array([ 7.2,  1.6,  3.5,  6.7,  4.9])
y[:5]
#[Out]# array([ 173.19489273,   73.59377412,   99.23120207,  159.96533158,
#[Out]#         118.72251928])

# Let's fit a line to the data.
# Our line equation will be of the form y = m * x + b
# where m is the slope and b is the intercept

# We can set this up as a system of N equations
# y[0] = b + m * x[0]
# y[1] = b + m * x[1]
# ...
# y[N-1] = b + m * x[N-1]

# we have tools to solve a matrix equation of the form M * V = Y
# where M is a N by 2 matrix
# and V is a 2 x 1 matrix
# and Y is a N by 1 matrix

# To get our system of equations into this form we need a matrix to evaluate b + m * x
# we can get b simply added by multiplying in by 1 in our matrix

# That matrix is easy to construct with numpy
# I'll transpose it because I want N rows by 2 columns
#  not 2 rows of N columns
M = np.array([np.ones_like(x), x]).T
M.shape
#[Out]# (100, 2)
M[:5]
#[Out]# array([[ 1. ,  7.2],
#[Out]#        [ 1. ,  1.6],
#[Out]#        [ 1. ,  3.5],
#[Out]#        [ 1. ,  6.7],
#[Out]#        [ 1. ,  4.9]])

# numpy includes a function to produce the least-squares solution to a system of
# linear equations. It is careful to maintain the numerical accuracy of the result

R = np.linalg.lstsq(M, y)

# the result includes several values, the first part is what we want
R
#[Out]# (array([ 33.36284698,  19.86799503]),
#[Out]#  array([ 6930.03699749]),
#[Out]#  2,
#[Out]#  array([ 60.43050388,   4.5764834 ]))

# extract the first part of the solution and call it fit
fit = R[0]
fit
#[Out]# array([ 33.36284698,  19.86799503])
# it is saying the intercept is 33.3 and the slope is about 19.9

# now we can get the estimated y value for each x using this equation
ye = np.dot(M, fit)

# we plotted it and it looks pretty good though it curves a bit at the ends
plot(x, ye, 'r.')
#[Out]# [<matplotlib.lines.Line2D at 0xb0f78ac>]

# let's look at the error
residual = y - ye

# in a new figure
figure(2)
#[Out]# <matplotlib.figure.Figure at 0xb0f7d2c>
plot(x, residual, '.')

# we could see that it was pretty curved. Perhaps this function is quadratic
#[Out]# [<matplotlib.lines.Line2D at 0xab5860c>]

# fitting a quadratic is just as easy, we just add a column of x squared
M2 = np.array([np.ones_like(x), x, x**2]).T
M2.shape
#[Out]# (100, 3)

# get the quadratic fit
R2 = np.linalg.lstsq(M2, y)
R2
#[Out]# (array([ 50.66106284,  10.2079896 ,   0.94856269]),
#[Out]#  array([ 2199.3046329]),
#[Out]#  3,
#[Out]#  array([ 464.38897231,   15.38634848,    2.80652423]))
figure(1)
#[Out]# <matplotlib.figure.Figure at 0xb06dd2c>
fit2 = R2[0]

# now we see it is estimating an intercept of 50.7,
#   slope of 10.2
#   and quadratic coefficient of 0.95.
fit2
#[Out]# array([ 50.66106284,  10.2079896 ,   0.94856269])

# we can get the new estimate
ye2 = np.dot(M2, fit2)

# and plot it to see it fits better
plot(x, ye2, 'g.')
#[Out]# [<matplotlib.lines.Line2D at 0xb0f13cc>]

# we can also look at the residual
residual2 = y - ye2
figure(2)
#[Out]# <matplotlib.figure.Figure at 0xb0f7d2c>

# which now looks like noise instead of having obvious structure
plot(x, residual2, 'g.')
#[Out]# [<matplotlib.lines.Line2D at 0xab58eec>]

# note that our linear fit isn't simply a subset of the quadratic
fit
#[Out]# array([ 33.36284698,  19.86799503])
fit2
#[Out]# array([ 50.66106284,  10.2079896 ,   0.94856269])

# we could fit other types of equations by transforming the data using logarithms.