Lecture #04

Scientific Computing Using Python - PHYS:4905

Lecture #04 - Prof. Kaaret

Importing, exporting, and plotting data

Code snippets

When you start using libraries to perform tasks, a large part of the effort is reading about the library to figure out what function calls you need to make in order to get it to do what you want. After you do this once, it is a good idea to save that code so that you can copy it, rather than having to figure out everything again. It is even easier, if you let someone else figure out how to interface with the library and then copy their snippets. The web site stackexchange.com is a useful resource as are general web searches. This lecture presents several code snippets that you may find useful.

Importing data

Often, particularly if you are an experimentalist, you'll want to process data produced by some instrument. To do so, you'll need to get those data into Python. That requires code that can read the data files and correctly interpret their format.

Reading text files - A surprising amount of data comes in the form of text files. Text files have the advantage that humans can read them in a text editor, like the one in Spyder, but the disadvantage that they tend to take more disk space than binary format files. Most text data files are organized with each line in the file containing a related set of data. For example, if you have a probe submerged in a river taking repeated measurements, you might have on one line: the time of the measurement, the temperature of the water, and the flow speed of the water. The other lines would have the same format, but, of course, different data.

To read a text data file into Python, you need to know the format. One common format is to have values separated by commas. These are called comma separated values files or csv files and are sometimes given the extension .csv. Here is an example of a csv file containing orbit data for the HaloSat CubeSat, halosat_orbit.csv. The first column is time which is given in TAI seconds since 1 January 2000. International Atomic Time (TAI, from the French name temps atomique international) doesn't include leap seconds. The other three columns give the position of the spacecraft in a reference frame centered on the Earth and rotating with the Earth (called ECEF). The z-axis extends through true north, which is basically the average direction of the Earth's rotational axis, and the x-axis intersects the sphere of the earth at 0° latitude (the equator) and 0° longitude (prime meridian in Greenwich).

The following program will read the file into Python and put the data into four numpy arrays called time, posx, posy, and posz.

# read in halosat_orbit.csv
import numpy as np
f = open('halosat_orbit.csv') # open the text data file
t1, t2, t3, t4 = [], [], [], [] # temporary list to hold data
for line in f : # read each line in the file in turn
print(line)
s1, s2, s3, s4 = line.split(',') # split the input line at the commas
t1.append(float(s1)) # convert the string to a float and add to the list
t2.append(float(s2))
t3.append(float(s3))
t4.append(float(s4))
f.close() # close the text data file
time = np.array(t1) # make numpy arrays from the lists
posx = np.array(t2)
posy = np.array(t3)
posz = np.array(t4)

The second line opens the file with the given name and creates an object that points to the file. The third line makes empty lists to store the data. Why do we use lists rather than numpy arrays at this point?

The forth line sets up a loop that reads each line from the file, one at a time. In the body of the loop, we first print out the line just read in. You might want to comment this out after you know the code works. Then we split the line into smaller strings at the commas; this gives us individual strings for each of the numbers without any commas. Then we convert each string into a float and append it at the end of the appropriate list. After the loop is done, we close the file, which is very important because we won't be able to open the file again until it is closed. Then we convert the lists into numpy arrays.

Run the program to check if it works. Note that it doesn't print anything out, so after running look at the variables in the Variable explorer pane. Check if time has the same length as the position variables.

People often have to read in text data file, so NumPy has a routine to do this called loadtxt. The code above could be shortened to

(time, posx, posy, posz) = np.loadtxt('halosat_orbit.csv', delimiter=',', unpack=True)

The loadtxt routine reads in a text file and then splits each line into pieces, just like our code above. The data file name is the only required argument. Using delimiter=',' tells loadtxt to divide each line using commas. The default output of loadtxt is a 2D array. Using unpack=True makes loadtxt transpose its output so you can write each column to a variable using a tuple on the left hand side of the equal sign. The routine has lots of options, as shown at https://docs.scipy.org/doc/numpy/reference/generated/numpy.loadtxt.html. loadtxt allows comment lines in the file. By default any line that starts with # is a comment, but you can change the character if you like. Also, you can skip rows at the beginning of the file; this is useful if your file starts with a block of text describing the contents. loadtxt does the job in most cases, but sometimes the data are formatted in a way that it can't handle. In that case, you can write a custom routine starting with the code in the first example.

Reading Flexible Image Transport System (FITS) files - A disadvantage of using text files for data is that they tend to be large. For example "175.5928192138672," is 18 bytes while a float in Python uses 8 bytes. Also, it is difficult to preserve the structure of data in standard format text files like csv. There are many, many data formats that store numbers in binary formats. Examples include mp3 and wav for audo data, mp4 for video data, and Hierarchical Data Format for particle physics data (currently HDF5). Astrophysicists like to use the Flexible Image Transport System of FITS. FITS is a standard file format used for everything from images from optical telescopes to spacecraft orbit data to X-ray event lists. FITS can store a great variety of data structures and preserve those structures. FITS is also self commenting. Just as commenting code is important, comment data may be even more important. If you forget the conditions under which some data were obtained, those data often become useless.

This file halosat_orbit.fits contains orbit data for the HaloSat CubeSat in FITS format. Let's run the program below to read the data into Python.

# read data from a FITS file
# import libraries
import numpy as np
import astropy.io.fits as pyfits

fitsfile = 'halosat_orbit.fits' # name of FITS file
f = pyfits.open(fitsfile) # read in the FITS file
h = f[1].header # copy the FITS header
d = f[1].data # put all of the data into the variable d
f.close() # close the FITS file

# move the data from d to numpy arrays
time = array(d['TIME']) # TAI Time
pos = array(d['POSITION']) # position as a 3-component vector in ECEF [km]

The program uses the AstroPy library that includes a bunch of astronomically inclined functions and ones to read FITS files. The name of the file is set in the variable fitsfile. The function open in the astropy.io.fits library, which we have decided to refer to with the name pyfits, is similar to the open function built into Python, but reads FITS files. It returns an object that points to the file, similar to the standard open function. However, FITS files can have multiple extensions, each of which can contain different types of data. The data of interest in our file is in extension 1, so we access it using f[1]. After you run the code, try typing f[1].header at the command line. This will print out the header for extension 1 containing a description of the data organized as a bunch of keywords with values and descriptions. NAXIS2 tells you how many data points you have. The TTYPE1, TFORM1, and TUNIT1 keywords give you a description, format, and units for the TAI time data. You get at the data in extension 1 using f[1].data. We put that data into the variable d and then close the file. The last two lines move data from d into numpy arrays. Note that position is a three component vector. Using FITS, we can preserve the structure of the original data (as a vector rather than 3 floats).

Exporting data

You may also want to save the results of calculations that you have done with Python or you may want to use Python to acquire data from instruments and save those to files.

Writing text files - So far, we have printed to the console in order to get results out. If you have a large set of results, you may want to print to a file instead. The open function can also be used to create and write to files. Let's try the program below.

# write squares from 1 to 9 to a csv file
import numpy as np
f = open('squares.csv', 'w') # open the file for writing
f.write('# integer, square of integer\n')
for x in np.arange(1, 10) :
f.write(str(x)+', '+str(x**2)+'\n')
f.close()

In the third line, adding a second parameter of 'w' to the open statement makes open create a new file and write to it rather than read from it. If the file already exists, it will be overwritten, so be careful about your files. In the fourth line, we use the object f associated with the file, sometimes called a handle, to write a comment line to the file. The comment starts with a # and ends with a \n. Back slashes are used in Python (and C and C++) to indicate special characters. \n is a newline. In contrast with the print function, we have to explicitly add newlines when using the write function. We then loop over the integers from 1 to 9 and write a line to the file with each new integer and its square, separated by a comma and ending the line with \n. To finish, we close the file. That makes sure that everything gets recorded. If you forget to close the file, then sometime the last few writes can get lost.

Writing FITS files - Writing the same data to a FITS file is slightly more complicated. The piece of the code that wrote the HaloSat FITS file you used above is at write_fits.py. It will make errors if you run it because the rest of the code that filled the data arrays is not included. Many of the lines are used to make the file self documenting. For example, it is standard when writing FITS files that one includes a MISSION keyword that specifies the mission or the name of the ground-based telescope or instrument that produced the data. Also, one typically records when the file was created and the time range over which the data were collected. This particular code snippet stores individual NumPy arrays as columns in the file. The units for each column and a short description of the data are included in the file. The documentation for writing FITS files along with examples is available at http://docs.astropy.org/en/stable/io/fits/.

Note that I haven't discussed NumPy's built in procedures to store data. I do not recommend using them since they introduce yet another data format that is not particularly self documenting. If you want your data to have a long life, use a self-documenting file format like FITS or a text file format with comments.

Plotting

There are no functions to make plots built into Python. However, there is a very extensive plotting package called Matplotlib that is very widely used in scientific applications of Python. The best way to learn Matplotlib is to look at examples. Let's start with the
program below. Copy it into Spyder, save it with an appropriate file name, and run it.

# plot y = x^2 - 9x for x = 0 to 10
import numpy as np
import matplotlib.pyplot as plt

x = np.linspace(0.0, 10.0, 100) # generate 100 evenly spaced points from 0 to 10
y = -x**2 + 9*x

plt.ion() # make plotting interactive
plt.figure('Quadratic') # make a plot window called Quadratic
plt.clf() # clear the plot window
plt.xlabel('X') # label the X-axis
plt.ylabel('Y') # labdel the Y-axis
plt.title('Quadratic y = -x^2 + 9x') # give the plot a title
#plt.xscale('log') # scale is linear by default
#plt.yscale('log')
plt.plot(x, y, '-k', label='quadratic') # plot the quadratic as a solid, black line
plt.plot(x, 0*x, ':r', label='zero') # plot y=0 as a dotted red line
plt.legend(loc='upper left') # add a legend to explain the curves
plt.show() # display the plot

To begin, we import the pyplot part of the matplotlib library and we call it plt for convenience. Then, we fill the x with our chosen plot range and then compute y according to our chosen equation. All of the lines starting with plt are calls to matplotlib. There are several different ways to access matplotlib and pyplot provides a particularly easy one. The individual calls are documented. The basic idea with pyplot is that we create a plot window and then the following commands modify the plot in that window. Because Python is designed to conveniently handle whole lists and arrays (via numpy), it usually works better to create a set of lists or numpy arrays with your whole set of data and plot them with a single command. This is different from many other programming languages where one plots each point individually with a call to the plotting library inside of a loop. Keep this in mind when doing the homework assignment.

Making Spyder use external plot windows and saving plots for use elsewhere - By default, Sypder will put plots in the console pane. A great feature of matplotlib is that you can interact with the plots and this is lost if the plot is in the console. To make Spyder give each plot its own window, do Tools/Preferences then click on 'Ipython console' and then on the Graphics tab and change 'Graphics backend' to Automatic. Then click Apply. Now run your the plotting program again and it should produce a new window for the plot. If you move your mouse cursor inside the plot, you'll see the coordinates of the cursor in the plot variables. You can use the magnifying glass button to zoom in on a region of the plot and the home buttom to get back to the whole plot. You can save the plot with the floppy disk button (have any of you seen an actual floppy disk?). You can choose a variety of file formats. I use png (Portable Network Graphics) for saving plots for presentations and eps (encapsulated postscript) for saving plots for publications. The textbook suggests svg (scalable vector graphics) for saving plots in a format that you can then edit if you like.

Matplotlib resources - Matplotlib is extremely capable. If you have an idea of a plot that you would like to make, go to https://matplotlib.org/gallery.html and see if there is an image similar to what you want. You can then click on image to find the code that produced it.

Assignment

The remainder of today's class will be devoted to working through chapter 4 of the textbook where all of these topics are covered. As you read the chapter, enter the commands in the purple highlights into Spyder and see what you get. When you are done with the chapter, start on homework #4.

HW #4 is due at the beginning of the next class. The random walk assignment is a reasonably challenging programming task and additional help is available if you ask.