2. Starting out: data files and first plots

[status: mostly-complete-needs-polishing-and-proofreading]

2.1. Motivation, prerequisites, plan

Data plotting and data visualization is the key

2.2. Very first data plots with gnuplot

Our first goal is to become comfortable with data files and with plotting. We first get the students to renew their acquaintance with creating files with an editor and make a file with some hand-crafted data.

Use your favorite editor (possibly emacs for those who have taken my previous course, but vi or gedit should also work) to open a file called simpledata.dat

Enter two columns of simple data into this file. For example:

1970   16.7
1980   17.2
1990   17.5
2000   18.2
2010   19.8
2020   20.1
2030   22.7
2060   28

Then save it, and enter gnuplot to plot this data:

$ sudo apt-get install gnuplot-x11
$ gnuplot
gnuplot> plot 'simpledata.dat'
../_images/simpledata.svg

Figure 2.2.1 First example of data plot.

then have the students plot the data with slightly different options in gnuplot:

$ gnuplot
gnuplot> plot 'simpledata.dat' with lines
gnuplot> set xlabel 'this is the "x" axis'
gnuplot> set ylabel 'this is the "y" axis'
gnuplot> plot 'simpledata.dat' with linespoints

2.3. Plotting functions with gnuplot

More examples of using gnuplot. We don’t assume knowledge of trigonometry from younger students, so we tell the story as “this is the sin function, which you will learn about some day; it plots these waves.”

On the other hand the polynomial \(y = x^3 - 3x\) should be within their reach: I might call on the class to tell me “what’s \(-10^3\) and \(-2^3\), \(2^3\), and \(10^3\)” - this establishes that the plot goes down on the left hand side, and up on the right hand side. The interesting play in the middle can be narrated by showing that \((1/2)^3\) is smaller than \(3 * (1/2)\), so that the negative term dominates briefly (see Figure 2.3.1)..

gnuplot> plot sin(x)
gnuplot> plot x*x*x - 3*x
gnuplot> plot x**3 - 3*x
## note that this last one did show a dip in the middle,
## but zooming in on the range from -3 to 3 shows the
## interesting features in the middle of the plot.  This
## plot has two flat points (one local maximum and one
## local minimum), rather than one saddle point.
gnuplot> plot [-3:3] x**3 - 3*x

If I am getting a good mathematical vibe from the classroom I will briefly step into calculus territory and ask students to predict the local max and min for \(y = x^3 - 3x\). I will then quickly show them that

\[\frac{d(x^3 - 3x)}{dx} = 3 x^2 - 3\]

and using the quadratic formula

\[x = \frac{-b \pm \sqrt{b^2 - 4ac}} { 2a }\]

we get \(\pm 6 / 6 = \pm 1\), which is exactly where the local max and min are located (see Figure 2.3.1).

You can now set the grid in the plot and see if this calculation matches the plot:

Listing 2.3.1 Instructions to plot a simple function and its derivative, in this case the function \(f(x) = x^3 - 3 x\) and its derivative \(df(x)/dx = 3 x^2 - 3\).
##CAPTION: Simple line plot.
set grid
## while we're at it also show the derivative:
plot [-3:3] x**3 - 3*x, 3*x**2 - 3

The two plots, superimposed in Figure 2.3.1, show that where the derivative function (\(3x^2-3\)) is zero, the original function (\(x^3-3x\)) has its flat point. This can be presented, especially to older kids, in a rapid way that conveys “you don’t have to understand this, but if you have heard about slopes then note that the second curve shows the slope of the first one…” For the younger kids we can emphasize that the figure shows two functions and looks intriguing.

../_images/simplefunction.svg

Figure 2.3.1 A simple plot of the function \(y = x^3 - 3x\) and its derivative \(y = 3x^2 - 3\).

Visualizing functions like this is cool and it can be useful during the exploratory phase of a research project, but it seldom comes up in the bulk of the work and in the final scientific write-up. We will now move on to tasks which come up quite frequently.

2.4. Reading and writing files, in brief

First let us make sure we know a couple of shell commands to look at a file. Here I usually will take a portion of the board and write a boxed inset “cheat sheet” with some useful shell commands. [4]

Since we already have a file called simpledata.dat which we created earlier, let us look at three shell commands that give us a quick glance at what’s in that file: head, tail and less.

$ head simpledata.dat
$ tail simpledata.dat
$ less simpledata.dat
## (when using less be sure to quit with 'q')

These are simple ways to peek at a file, and will work with any text file. You should always remember these commands.

Next we will look at how to read a file in a Python program. This is a crucial pattern and we will use it a lot. Type in the program reader.py shown in Listing 2.4.1 and run it to see what happens.

Listing 2.4.1 reader.py – Reads a simple set of 2-column data
#! /usr/bin/env python3

"""show a simple paradigm for reading a file with two columns of
data"""

def main():
    fname = 'simpledata.dat'    # the file we wrote out by hand
    dataset = read_file(fname)
    print('I just read file %s with %d lines' % (fname, len(dataset)))
    print('I will now print the first 10 lines')
    N = min(10, len(dataset))
    for i in range(N):
        print(dataset[i])

def read_file(fname):
    dataset = []
    f = open(fname, 'r')
    for line in f.readlines():
        words = line.split()
        x = float(words[0])
        y = float(words[1])
        dataset.append((x, y))
    f.close()
    return dataset

main()

Remember that after writing and saving the program you do the following to make it executable and then run it:

$ chmod +x reader.py
$ ./reader.py

Finally let us see how to write files to disk. We will extend to do an easy manipulation of the file simpledata.dat and then write it back out to a new file simpledata.dat.sums. This new program will be called simple-writer.py, so we need to copy it first:

$ cp reader.py simple-writer.py

and edit simple-writer.py to look like Listing 2.4.2.

Listing 2.4.2 simple-writer.py – Reads a simple set of 2-column data and sums the second column and then writes out line with (x y sumy).
#! /usr/bin/env python3

"""show a simple paradigm for writing a file after reading it and
adding some content to it"""

def main():
    fname = 'simpledata.dat'    # the file we wrote out by hand
    dataset = read_file(fname)
    print('I just read file %s with %d lines' % (fname, len(dataset)))
    print('I will now print the first 10 lines')
    for i in range(10):
        print(dataset[i])
    print('I will now modify the data')
    summed_data = append_sums(dataset)
    write_file(fname + '.sums', summed_data)
    print('I wrote the modified data to %s' % (fname + '.sums'))

def read_file(fname):
    dataset = []
    f = open(fname, 'r')
    for line in f.readlines():
        words = line.split()
        x = float(words[0])
        y = float(words[1])
        dataset.append((x, y))
    return dataset

def append_sums(dataset):
    sum_y = 0
    summed_data = []
    for pair in dataset:
        sum_y = sum_y + pair[1]
        triplet = (pair[0], pair[1], sum_y)
        summed_data.append(triplet)
    return summed_data

def write_file(fname, dataset):
    f = open(fname, 'w')
    for triplet in dataset:
        f.write('%g   %g   %g\n' % triplet)
    f.close()

main()

At this point we are comfortable with basic programming patterns for reading and writing data. This is very important: this kind of file manipulation is one of the big steps in becoming a confident programmer. This is a good time to make the kids familiar with the terminology “input/output” (I/O).

2.5. Generating our own data to plot

Now we look at an example of writing a python program to generate some data which we will then plot. Initially this just feels like silly data: it calculates the \(sin\) function for many opints and prints out the values. There is a reason to start with this very simple model: it will allow us (in the more advanced course on Fourier analysis) to give clear cut examples of deeper data anlysis. We will not be lazy: in the more advanced courses we will look at real signals instead of the toy ones.

Start with the python3 command line and let us see the few lines of code that generate \(sin\) wave data:

Listing 2.5.1 simplewave.py – simple \(sin\) wave generator. You can run it with python3 simplewave.py
#! /usr/bin/env python3

"""Generate a simpole sine wave"""

import math

for i in range(20):
    x = i/10.0
    signal = math.sin(x)
    print('%g    %g' % (x, signal))
0    0
0.1    0.0998334
0.2    0.198669
0.3    0.29552
0.4    0.389418
0.5    0.479426
0.6    0.564642
0.7    0.644218
0.8    0.717356
0.9    0.783327
1    0.841471
1.1    0.891207
1.2    0.932039
1.3    0.963558
1.4    0.98545
1.5    0.997495
1.6    0.999574
1.7    0.991665
1.8    0.973848
1.9    0.9463

Comments on this:

  • This just prints values to the terminal, so we limited it to some 20 values of \(x\). When we write it to a file for plotting we will do much more.

  • The classic for loop generates a sequence of integers, which does not do well at all for plotting: we want to space our \(x\) values much more tightly to get a nice plot, so we divide the integer i by 10.0 to get finely spaced values of \(x\).

Let us now put this into a file called simplewave-write.py

Listing 2.5.2 simplewave-write.py – simple \(sin\) wave generator, writes output to a file called simplewave-write.dat
#! /usr/bin/env python3

import math

def main():
    out_fname = 'simplewave-write.dat'
    f = open(out_fname, 'w')
    for i in range(200):
        x = i/10.0
        signal = math.sin(x)
        f.write('%g    %g\n' % (x, signal))
    f.close()
    print('finished writing file %s' % out_fname)

main()

We run the program and do a quick scan of its output with:

$ python3 simplewave-write.py
$ ls -l simplewave-write.dat
$ head simplewave-write.dat
$ tail simplewave-write.dat

and when we have seen the first and last few lines of the output file we realize we can plot it (after going back to our gnuplot window) with:

Listing 2.5.3 Instructions to plot the output of simplewave-write.py
##CAPTION: Plot the simple sin wave from a file.
set grid
plot 'simplewave-write.dat' using 1:2 with linespoints
../_images/simplewave-write.svg

Figure 2.5.1 Plot of simplewave-write.dat which was output by our simple wave generator.

Describing what this program does is rather straightforward, and it can be compared to the gnuplot instruction plot sin(x).

At this point, at the blackboard, I will suggest to the students a couple of edits they “should have already tried on their own”:

  • See what happens if we don’t divide \(i/100.0\): try both x = i (for this use range(7)) and some intermediate value x = i/4.0 (for this use range(28)). Note the loss of resolution.

  • Use python’s sys.argv to take the file name as a command line argument.

We then show a cleaner version of this program which adds some comments, makes clear what the \(sin\) period is, and uses some robust proramming paradigms such as using command line arguments.

Listing 2.5.4 simplewave-cleaner.py – More elaborate version of simplewave-write.py
#! /usr/bin/env python3

"""
Generate samples of a simple sin() wave and save them to a
file.  The file has two columns of data separated by white 
space.  To run the program and plot its output try:
$ python3 generate-sin-cleaner.py 'sin_output.dat'
$ gnuplot
gnuplot> plot 'sin_output.dat' using 1:2 with lines
"""

import math
import sys

def main():
    out_fname = 'simplewave-write.dat'  # default output filename
    if len(sys.argv) == 2:
        out_fname = sys.argv[1]
    elif len(sys.argv) > 2:
        print('error: program takes at most one argument')
        sys.exit(1)

    f = open(out_fname, 'w')
    for i in range(700):
        x = i/100.0
        signal = math.sin(x)
        f.write('%g    %g\n' % (x, signal))
    f.close()
    print('finished writing file %s' % out_fname)

main()

Note that the comments also tell you how to run a program and visualize the output. This use of comments to explain behavior is an important detail, even for small programs.

In the more advanced course on Fourier analysis we will revisit this simple program and make it generate more complex waveforms.

2.6. The broad landscape of plotting software

Now that we have seen some examples, let us talk broadly about plotting software, since you will soon be bombarded with people telling you about their favorites.

By now we know well that in the free software world there is often a dizzying variety of tools to do any task, with fans advocating each approach. Gnuplot is full-featured, stable, actively maintained and ubiquitous so I have chosen it, there are several other valid choices.

There are at least three main categories of plotting tools:

  1. The “just a plotting program” kind,

  2. The “plotting program with some data analysis that grew into a full programming language”,

  3. The “plotting library for a well-established programming language”.

Gnuplot is clearly one of the first, R and Octave the second, Python with Matplotlib and Cern’s Root are examples of the third.

Often it comes down to where a particular scientist did her early research work: a boss will tell you to “use this tool because it’s what I use”. I recommend forming a broad knowlege of scientific tools so that you can use the most appropriate tool instead of the tool that makes your boss comfortable. You will often find that astrophysicists often use Python with matplotlib, particle physicists use Cern’s Root, biologists use R or Python, social scientists who do much statistical work use R. You shoud always know the tool your community uses (if it’s a free software tool), as well as some others which might be more appropriate.

There are many proprietary plotting packages. I advocate against the use of proprietary software, and it is certainly unacceptable to do science with proprietary tools, but I will mention a couple of packages so that when you come across users you will be able to categorize and compare them and offer an effective free software approach to the same problem.

CricketGraph, often used in the 1990s, was a light-weight plotting program, later supplanted by KaleidaGraph. I would recommend gnuplot as a good way of doing what those programs did.

Matlab and IDL are simple plotting and data analysis programs that grew out of control and added an ad-hoc programming language to the package. The languages were never meant for large software applications, but are often used to write very large programs. It is interesting to note that these programs always start with the stated intention of not requiring a scientist to know how to program, but they end up channeling scientists into using an ill-designed language for large programs.

Matlab and IDL programs can be written in a much cleaner way using Python for the programming parts, and the matplotlib plotting library for graphics.

There is a final outlier in the proprietary data analysis world, which is the “using a spreadsheet to do data analysis” approach, often with the proprietary Excel spreadsheet. There is no saving grace to this approach: apart from technical concerns with the validity of the numerical subroutines, there is also the complete lack of reproducibility of a person moving a mouse around over cells. One of the most embarrassing cases of incorrect analysis was in a much-cited Economics paper about debt rations in European countries [RR10]. The analysis was done with an Excel spreadsheet, and some readers concluded that the authors selected the wrong range of data with a mouse movement. There is no reproducibility when mouse movements are part of the data analysis. The economics article was disgraced because of the faulty analysis as well as other problems with their methodology.

2.7. Data formats

In Section 2.2 and Section 2.4 we saw the simplest examples of data files: columns of numbers separated by white space. These are the simplest to work with, and if your files are smaller than about 10 megabytes you should always treat yourself to that simplicity. This format is often called “whitespace separated ascii” or names similar to that.

Often you will find that the columns of data are separated by commas. This format is called “comma separated values” (csv) and the files often end with the .csv extension. The format has been around almost half a century. It has some advantages over the whitespace separated columns and is used by almost all spreadsheet programs as an import/export format. In gnuplot you handle this with the instruction set datafile separator comma as we see in Section 8.1.

Sometimes files are in a variety of binary formats, which you cannot read directly. We will not deal with these at this time, since we are not yet working with very big files, but later on we will show how to convert mp3 files to an ascii format which is easily read by our programs and by gnuplot.

2.8. Simple surface plots

So far we have looked at line plots. Let us now look at another type of plot: the surface plot. A surface plot comes from a function of two variables: \(z = f(x, y)\). In these plots the value of the function is plotted as the height over the x, y plane. Here is an example:

Listing 2.8.1 Instructions to plot a simple surface, in this case the function \(f(x, y) = e^{\frac{-x^2 - 1.7 y^2}{10}}\)
##CAPTION: Simple surface plot.
set grid
set pm3d
set xlabel 'x'
set ylabel 'y'
set samples 50
set isosamples 50
splot exp((-x*x-1.7*y*y)/10.0) with pm3d
../_images/simplesurfacefunction.svg

Figure 2.8.1 The surface plot of \(f(x, y) = e^{\frac{-x^2 - 1.7 y^2}{10}}\)

The plot in Figure 2.8.1 is visually gratifying, and even more so because when you generate it interactively with gnuplot you can grab it with the left mouse button and rotate it to get more insight into what the surface looks like from different angles.

Another way of showing the same information is a heat map. Figure 2.8.2 shows the same function, but this time the value of the function is represented with color instead of height.

Listing 2.8.2 Instructions to plot a heat map, in this case the function \(f(x, y) = e^{\frac{-x^2 - 1.7 y^2}{10}}\)
##CAPTION: Heat map plot.
set size ratio -1
set view map
set samples 50
set isosamples 50
set xlabel 'x'
set ylabel 'y'
splot exp((-x*x-1.7*y*y)/50.0) with pm3d
../_images/heatmap.svg

Figure 2.8.2 The heat map for of \(f(x, y) = e^{\frac{-x^2 - 1.7 y^2}{10}}\)

2.9. Topics we have covered

  • data files

  • plots

  • gnuplot

  • reading features from simple plots

  • simple surface plots