Lecture #17

Scientific Computing Using Python - PHYS:4905 - Fall 2018

Lectures #17 (10/25), #18 (10/30) - Prof. Kaaret

Homework #12 is at the bottom of the page.

Lecture #17

Fitting more complicated curves to data

Linear curve fitting as described in lectures 14-17 is extremely useful and has many advantages such as the fact that it is always possible to find the best possible fit of the model parameters for a given data set. However, many models are more complex that a linear relation or than an relation like an exponential that can be transformed to a linear relation. It is possible to extend the linear curve fitting routines to polynomials, but polynomials still don't cover the entire possible range of functions.

To fit any arbitrary function to a set of data, we use a technique called non-linear least squares fitting. The basic inputs are the same as for linear least squares fitting. In particular, we take a set of data points with each data point is specified by

the value of the independent variable (x_i),
the value of the dependent or measured variable (y_i),
the uncertainty or "error" on the measured variable (σ_i).

And we also use chi-squared as our metric of goodness-of-fit and attempt the minimize the value of chi-squared to find the best fit.

χ^2 = \underset{i}{∑}\left[ \frac{y_i - y(x_i)}{ℴ_i}\right]^2

However, the technique to find the lowest chi-squared is very different. In linear least squares, we calculated a few sums and then combined those sums to find the best estimates of the linear parameters and also the uncertainty in those parameters. The process is linear (in terms of coding) and always produces the guaranteed best estimate of the model parameters.

In non-linear least squares, we make a guess at the parameters and then iteratively adjust the parameters to decrease the chi-squared until we find that we can't get any lower. The process can be thought of as hiking around a mountainous landscape trying to find the lowest point. Indeed, if we have a model with two parameters (a,b), then we could calculate chi-squared at each point (a,b) and make a topographical map where chi-squared plays the role of elevation and (a,b) plays the role (longitude, latitude). The 3D plot below gives an example. Finding the best fit parameters in non-linear least squares is then equivalent to finding the lowest point in the map. This idea, but with the map flipped over so one tries to find the peak rather than the valley, is called the "fitness landscape" in evolutionary biology. That term has carried over to many other fields.

This idea of searching around a landscape for the lowest point is very useful in thinking about non-linear squares and it is helpful to keep this geometric picture in mind when thinking about algorithms. The key part of our "algorithm" for non-linear least squares is to "iteratively adjust the parameters to decrease the chi-squared". In the landscape analogy, this "algorithm" is basically "go down hill". There are lots of different ways to find the best direction in which to go down hill fastest, and we'll discuss a few of them.

The landscape picture also highlights some issues with non-linear least squares. The example landscape shown above is quite simple and we will likely reach the minimum of χ² regardless of where we start our search. But how about the landscape shown below?

This landscape has several local minima in addition to one global minimum. Our algorithm to "always go down hill" can get us stuck in a local minimum, preventing us from ever reaching the global minimum. Also, the particular local minimum that we find will depend on where we start our search. This is an important and generic problem in non-linear least squares curve fitting. The only sure fire way to be know that you have reached the global minimum is to try every possible set of the parameters. This is sometimes feasible if you have two or a few parameters, but becomes very infeasible for models with a large number of parameters. Often you have to check your fits by using several different starting points or do searches where you vary a particular parameter across some range of values, even if the first few values actually increase the chi-squared.

An application to X-ray polarimetry

Polarization necessarily reveals information about the geometry of a system because it is a vector quantity. In astrophysics, measurement of X-ray polarization has the potential to reveal the geometry of systems, like the accretion disks around black holes or the magnetic fields of neutron stars, that cannot be resolved with imaging. The University of Iowa has contributed to studies of several satellites designed to measure astrophysical polarization.

During one such study, we measured the polarization of an X-ray source to be used for calibration of the X-ray polarimeter on-orbit, the Modulated X-ray Source (MXS) engineering test unit number 3 (MXS #3) built by NASA's Goddard Space Flight Center (GSFC). The MXS produces X-ray by accelerating electrons to kilovolt energies and flinging them at a Titanium target. Titanium (Ti) has a characteristic X-ray line, the Ti K-alpha line at 4.51 keV, produced when electrons in its inner shell are excited. Since we were intending to use the MXS to calibrate a polarimeter, we needed to measure the polarization of the X-ray it produces.

At Iowa, we constructed a "Bragg X-ray polarimeter" consisting of a silicon crystal used to reflect X-rays and an X-ray detector. X-rays generally don't reflect well because their wavelengths, on the order of Angstroms, are too close to the typical atomic space in solid materials. Bragg reflection offers a means to efficiently reflect X-ray within a narrow energy range. You can read more about it at https://en.wikipedia.org/wiki/Bragg%27s_law. Bragg reflection is useful for polarimetry because the Bragg reflector acts as a polarization analyzer if the reflection angle is close to 90 degrees. By this, we mean that an X-ray with its electric field vector perpendicular to the plane of reflection will be efficiently reflected, while an X-ray with its electric field vector in the plane of reflection will not be reflected.

We can build a polarimeter using a Bragg reflection crystal and an X-ray detector, as shown in the figure above. We must rotate the Bragg reflector and the detector around the axis defined by the incoming X-rays. By measuring the count rate versus rotation angle, we can determine the polarization of the X-ray beam. If the X-rays are highly polarized, then the detector will record a high count rate when the plane of reflection is perpendicular to the X-ray electric field vector and a near zero count rate when the polarimeter is rotated by 90 degrees. The plot of count rate versus rotation angle is called a rotation curve, as shown in the figure below. Note that polarization is symmetric under a rotation by 180 degrees, so we include only rotation angles between 0 and 180 degrees.

We describe the modulation curve using an equation of the form

S(ϕ) = A + B \cos^2(ϕ-ϕ_0)

The polarization position angle ϕ₀ is the angle at which the maximum intensity is recorded, A describes the unpolarized component of the intensity, and B describes the polarized intensity. The modulation amplitude is

a = \frac{S_{max}-S_{min}}{S_{max}+S_{min}} = \frac{B}{2A+B}

In Bragg polarimeters, the modulation amplitude is a good estimate of the polarization fraction of the X-ray beam.

Polarized Python

We'll now figure out how to fit the data in the figure above to the equation for the modulation curve using Python. The data are available in the file mxs_polarization.txt. The file contains a few comment line at the beginning that begin with # and then has rows with three values corresponding to the rotation angle, measured count rate, and uncertainty on the count rate. X-ray generation is a Poisson process like radioactive decay and the uncertainties were calculated using Poisson statistics.

The program plotrot.py reads in the data, plots it, fits a constant, and then fit a modulation curve using the form given above and plots that. Much of the code should be familiar by now. The new stuff is the non-linear least squares fitting routine and the supporting code. We use the curve_fit routine in the scipy optimize library. This routine takes as input:

a callable function that specifies the model that we want to fit. The first argument of the function must be the independent variable. The other arguments are the model parameters.
xdata - an array of the values of the independent variable of the data points.
ydata - the corresponding measured values.
p0 - an initial guess for the parameters. Unlike linear least squares, you need to guess a starting point for the search for the minimum chi-squared. As discussed above (and below), the starting point can affect your results. For curve_fit, the initial guess needs to have all of the model parameters rolled up into an array.
sigma - the measurement errors.
method - the algorithm to use to find the minimum chi-squared. There are several choices for algorithm. We'll start with the default which is the Levenberg–Marquardt algorithm.

There are a bunch of other parameters for curve_fit. Most of these are set to reasonable default values. We'll discuss how to change some of them in the next lecture.

To use curve_fit, we need to define a 'callable function'. In our case, we want to use the modulation curve function as described above. Defining the function in a form compatible with curve_fit (independent variable first, followed by the model parameters given individually), we get

# define function used to fit modulation curve
def func_cos2t(phi, a, b, phi0):
print 'In func_cos2t a, b, phi0 = ', a, b, phi0
return a + b*np.cos((phi-phi0)*np.pi/180)**2

The print statement let's us look into the workings of curve_fit. You'll probably want to comment it out once you're done with the exercises below.

Load plotrot.py into your computer and have a look at it. The only uncommented line to set the variable pinit should be

pinit = [1.0, 1.0, 1.0] # the default initial guess

If you don't supply curve_fit with an initial guess, it will try setting all the parameters to 1. Remember to load mxs_polarization.txt into the same directory and then try running plotrot.py. You should get a screenful of lines from the print statement in func_cos2t, a plot with the data and best fitted model, and a print out of the best fitted parameters.

We can get some insight into how the Levenberg–Marquardt algorithm works by looking at the output from the print statement in func_cos2t. One way to "iteratively adjust the parameters to decrease the chi-squared" is a method called "Gradient descent". The method is just what it sounds like. Given that you are at some position (a,b), you find the direction in which χ² decreases fastest. This is the direction of the negative of the gradient of χ². This is what the first few calls to func_cos2t are used to do. The first call to func_cos2t is, of course, with the initial parameters. It is a mystery why it gets called three times with the same set of parameters. Then func_cos2t is called with small deltas to each parameter in turn. (The deltas are equal to the current value of the parameter multiplied by the parameter diff_step, which can be set when calling curve_fit, but usually the default value is good enough unless your chi-squared landscape has flat parts.) By looking at

Δχ^2/Δa, \; Δχ^2/Δb, \; Δχ^2/Δϕ_0

, one can estimate the gradient. curve_fit takes a step in that direction with a magnitude equal to its best guess of where the minimum will be. It then makes four calls to func_cos2t, one at the new search point and three offset by small deltas to each parameter in turn. Again, curve_fit is calculating a numerical approximation to the gradient. It then takes another step and repeats the process.

The curve_fit routine stops guessing when one of three conditions occur: the chi-squared value (or "cost function") has a fractional change between steps of less than the parameter ftol, the independent variables have a fractional change less than xtol, or the gradient changes by less than gtol. All of these parameters have a default value of 1E-8 and can be set as optional parameters when you call curve_fit. In looking at the output, you might note that the last two search points are very close together and that curve_fit needed to do only 8 evaluations of the gradient to get there.

Note that we have not discussed how the Levenberg–Marquardt algorithm decides how far each step should be. It does this by making a linear approximation to the chi-squared landscape at the each search point and employing a "damping factor" to adjust the length of the step. Marquardt's paper introducing the technique or the wikipedia page on the Levenberg–Marquardt algorithm are good references.

For the remainder of the class, you should try different initial search points. You can uncomment the different lines that set pinit in the code and write your own. Why do different starting points produce different best fitted parameters? Given the symmetries of the modulation curve function, what is true about the different best fitted parameters?

Also, the modulation is calculated from the best fit values for A and B, but it doesn't include an estimate of the error. Work out the error on the modulation using propagation of errors and modify the print statement to also print out the error on the modulation.

Lecture #18

Uncertainty in Fitting

Let's look at the parameter error estimates. With two, or more, parameter there may be correlations between the fitted parameters. The plot below (from Numerical Recipes by Press, Teukolsky, Vetterline, and Flannery) shows many repeated measurements of a pair of values. The values are partially correlated as shown by the bias arrow. To derive 1-sigma error bars, we wish to draw intervals that contain 68% of the measurements. There are a few different ways we might interpret this. We could consider measurements of a₁ alone. In that case, we would draw upper and lower bounds that include 68% of the measurements of a₁ and get the 68% confidence interval marked with a vertical arrow. We could do the same for of a₀. Or, we could consider the points together and draw an ellipse that contains 68% of the points. That is marked as the joint confidence region. Note that the maximum value for a₁ inside the 68% joint confidence region is larger than the upper end of the 68% confidence interval on of a₁ alone.

The same considerations apply when we are fitting model parameters to a set of data. We may wish to examine the fitted parameters individually or in groups. This is a choice of the "parameters of interest" and is somewhat open to interpretation. Some people feel strongly that all parameters for which best fitted parameters are obtained are parameters of interest while others are happy to choose one or a small number of parameters, since they don't care about the others.

To figure out how to define the error range of the fitted parameters, let's go back to the chi-squared landscape. The best fitted parameters the ones at the minimum in chi-squared. It may seem sensible that how accuracy on can determine the the location of the minimum depends on the shape of the landscape around the minimum. In particular, we define the errors on the fitted parameters using the change in chi-squared relative to the minimum or the

Δχ^2

. Looking at the figure below, if we want to find a 1-sigma error on b, then we start at the minimum and walk in the direction of increasing b until the chi-squared increases by 1. This gives us the upper bound on b. To find the lower bound, we go back to the minimum and walk in the direction of decreasing b until the chi-squared increases by 1. This gives us our upper and lower bounds on b for the case of one parameter of interest.

If we consider both a and b to be of interest, then we need to cover a 2D grid in a and b around the minimum in chi-squared and define a contour where the chi-squared increases by a certain value over the minimum. The "certain value" depends on your desired confidence level (I recommend power poses to increase your confidence level before talks) and your number of parameters of interest. The table below lists

Δχ^2

values for a number of different p-values/number of sigma for different numbers of parameters of interest. In general the

Δχ^2

contour could have any shape. In practice, they tend to look elliptical. If the major axis of the ellipse is not parallel to one of the parameters, then the parameters are correlated. If the contour is circular, then the parameters are not correlated. (If the major axis of the ellipse is along one of the parameters, then you can make the ellipse into a circle by re-scaling one of the parameters.)

Confidence level		Δχ² for number of parameters of interest
σ	p	1	2	3	4
1	68.3%	1.00	2.30	3.53	4.72
1.65	90%	2.71	4.61	6.25	7.78
2	95.4%	4.00	6.17	8.02	9.70
3	99.73%	9.00	11.8	14.2	16.3

Error ellipses in Python

Now let's try this out in Python. Load the code plotrot2.py, which is the director's cut of plotrot.py with added scenes for plotting error ellipses. Search for "find errors on parameters B and phi0" which is the start of the new stuff. The new code finds the joint confidence interval for B and phi0. We first generate arrays containing trial values for B and phi0 and then create and fill an array with the change in chi-squared relative to the minimum value for each pair of trial values. Note that you may have to adjust the range of the trial values if you use this code to do other fitting tasks. Picking the range of values is usually a trial and error process. You pick a range of values and then check if your delta chi-squared contour of interest lies within the range. Also, you may want to use a coarser grid when adjusting your range of values and a finer grid in finding your final chi-squared contour.

The next block of code generates a contour plot using the arrays of trial values and delta chi-squared. The contour levels correspond to 1, 2 and 3 sigma for two parameters of interest.

The following block of code finds the edges of the error ellipse in the contour plot. Note that it is currently set to give 1-sigma or 68% confidence intervals. The block also printed the best fitted value along with the upper and lower error bars for each parameter of interest. The final block plots those bounds on the contour plot. Note that if your bounds plotted on dotted line don't line up with the contours, your grid is probably too coarse.

After running the code look at the error ranges calculates from the chi-squared search and compare them with the estimates from curve_fit. Note that a direct search of the chi-squared landscape is always a better way to estimate the uncertainties on your fitted parameters.

Doing things in a linear fashion

It is possible to describe the modulation curve as the sum of a constant, a sine function, and a cosine function,

S(ϕ) = I + Q \, \cos(2ϕ) + U \, \sin(2ϕ)

This relates the modulation curve to the "Stokes parameters" of the X-ray beam. For more details see the chapter on X-Ray Polarimetry in The WSPC Handbook of Astronomical Instrumentation, David Burrows (ed.), published by the World Scientific Publishing Company.

This transforms our non-linear problem into a linear one. Also, it provides a nice example of a linear space. We can define a linear operator such that

L \,\begin{pmatrix} 1 \\ 0 \\ 0 \end{pmatrix} = 1

L \,\begin{pmatrix} 0 \\ 1 \\ 0 \end{pmatrix} = \cos(2ϕ)

, and

L \,\begin{pmatrix} 0 \\ 0 \\ 1 \end{pmatrix} = \sin(2ϕ)

.

The linear space

ℝ^3

defined as points (I, Q, and U) acted on by L then includes all possible modulation functions described by our function

S(ϕ)

with the parameters I, Q, and U above. Writing a program that allows you to input (I, Q, and U) and then plot the resulting function may help to conceptualize linear spaces defined on functions.

Homework #12

Write a program that fits the modulation curve

S(ϕ)

with the parameters I, Q, and U as defined above to the polarization data from the MXS. You may modify plotrot2.py or use a linear fitting routine. Calculate the modulation amplitude and its error in terms of I, Q, and U. Your code should print out the best fitted modulation amplitude and error and also the best fitted polarization angle and error in a nicely formatted fashion. It should also plot the polarization data with the best fitted modulation curve and the joint confidence interval for Q and U.

If you are ambitious, you can also try plotting the joint confidence interval for the modulation amplitude and polarization angle.