Script 3.2. of the online course "An introduction to agent-based modeling with Python" by Claudius Gräbner

For the course homepage and links to the acompanying videos (in German) see: http://claudius-graebner.com/introabmdt.html

For the Enlish version (currently without videos) see: http://claudius-graebner.com/introabmen.html

Last updated July 18 2019

Import and export of data

Importing and exporting data to Python happens all the time, for example:

  • You download statistics that you want to explore visually or analyze quantitatively
  • Your want to save the results of your ABM for further analysis
  • You want to load the results of a previously run ABM to analyze them quantitatively
  • You validate the results of your model against empirical data
  • ...

There are many different ways to read and write data. Here we focus on approaches that build on the pandas library. There are many reasons for this:

  • pandas provides functions to read and write basically every useful data type (e.g. Excel, csv, dta, ...)
  • pandas also provides many useful classes and function for manipulating and analyzing data, such as functions to rename variables, reshape data frames, taking subsets and filtering your data, etc.
  • pandas is almost indispensable when you want to perform more advanced data analysis, such as regressions or machine learning
  • pandas provides you some very time- and space-efficient methods for storing your data
  • pandas interacts well with matplotlib, the standard plotting library in Python, which we will explore below

All these, and probably some more reasons should convince you that learning a bit of pandas is a good investment, irrespective of the purpose you want Python to use for.

As for numpy, there is a general convention on how to import pandas:

In [1]:
import pandas as pd

The basic data structure of Pandas: DataFrames

In numpy, arrays were of central importance to basically everything the library does. In pandas, the class DataFrame plays this role. You think of them best as tables that contain information about variables and observations.

When you import a dataset into pandas, e.g. from an Excel sheet that you have downloaded from Eurostat, then it will be of the type DataFrame.

But before we actually import a real data set, we will work through an example of how to create data frames on our own (which is what we will do later when building ABM).

Creating data frames from scratch

We will cover two ways of creating a DataFrame, both will happen when working with ABM:

  1. Creating a DataFrame from a dictionary
  2. Creating a DataFrame from a numpy.ndarray

After considering these examples we quickly discuss the arguments that can be passed to DataFrame.

Creating data frames from dictionaries

Now consider the the following dictionary:

In [2]:
import numpy as np
cons_dict = {"time": list(range(10)), "consumption" : np.random.randint(10, size=10)}
cons_dict
Out[2]:
{'consumption': array([1, 8, 5, 3, 0, 0, 9, 2, 0, 9]),
 'time': [0, 1, 2, 3, 4, 5, 6, 7, 8, 9]}

This dictionary contains information about the level of consumption at five different time steps. Outcomes of ABM often have this form. To make this data amendable for further analysis we transform it into a DataFrame:

In [3]:
dat_frame_1 = pd.DataFrame(data=cons_dict)
dat_frame_1.head(5) # only print the first five rows
Out[3]:
consumption time
0 1 0
1 8 1
2 5 2
3 3 3
4 0 4

pandas automatically turns the keys of the dictionary into the column names, and the values of the dictionary into the observations. This is nice, and is in line with the tidy data approach, which you should always follow when working with data.

If you want to consider only some of the columns:

In [4]:
dat_frame_1 = pd.DataFrame(data=cons_dict, columns=['consumption'])
dat_frame_1.head() # only print the first five rows
Out[4]:
consumption
0 1
1 8
2 5
3 3
4 0

Sometimes you want to make sure that the data type of the values of the dictionary is set explicitly. This is particularly relevant when the values are codes, which may start with a leading zero. In such cases, turning them into floats or integers would discard relevant information:

In [5]:
code = "0105"
print(int(code))
105

To avoid this, you might set the data types for the observations explicitly:

In [6]:
dat_frame_1 = pd.DataFrame(data=cons_dict, dtype=float)
dat_frame_1.head() # only print the first five rows
Out[6]:
consumption time
0 1.0 0.0
1 8.0 1.0
2 5.0 2.0
3 3.0 3.0
4 0.0 4.0

In any case, the result will be a pandas.DataFrame, which can be analyzed, manipulated, and saved to external files.

Creating data frames from arrays

The second way data frames are regularly created in the context of ABM is from numpy.ndarrays. Suppose we have the following array:

In [7]:
random_matrix = np.random.random(size=(5, 5))
random_matrix
Out[7]:
array([[0.17005374, 0.42555618, 0.68734496, 0.58590812, 0.17959497],
       [0.66817322, 0.9575838 , 0.45203858, 0.06683117, 0.10663936],
       [0.9050873 , 0.916165  , 0.84837789, 0.13685657, 0.7875758 ],
       [0.41042188, 0.13492038, 0.06439143, 0.67338557, 0.31656837],
       [0.11155004, 0.3271465 , 0.44604165, 0.55051836, 0.24824863]])

This could be, for example, an interaction matrix indicating the interaction structure among the agents. To save this as a data frame simply do the following:

In [8]:
dat_frame_2 = pd.DataFrame(data=random_matrix)
dat_frame_2.head()
Out[8]:
0 1 2 3 4
0 0.170054 0.425556 0.687345 0.585908 0.179595
1 0.668173 0.957584 0.452039 0.066831 0.106639
2 0.905087 0.916165 0.848378 0.136857 0.787576
3 0.410422 0.134920 0.064391 0.673386 0.316568
4 0.111550 0.327146 0.446042 0.550518 0.248249

Arguments of the DataFrame function

Before moving forward we quickly go through the arguments of the function that creates a DataFrame. According to the official documentation, DataFrame takes the following arguments:

  1. data
  2. index
  3. columns
  4. dtype
  5. copy

We already used the data argument: this is the basic data that shall be transformed, e.g. a dictionary or a np.ndarray as used in the examples above.

The index argument can be used to name the rows of the DataFrame explicitly. Be default, these are just numbers starting from zero:

In [9]:
dat_frame_2 = pd.DataFrame(data=random_matrix) # standard row names
dat_frame_2.head()

dat_frame_3 = pd.DataFrame(data=random_matrix, index=["row_" + str(i) for i in range(random_matrix.shape[1])])
dat_frame_3.head()
Out[9]:
0 1 2 3 4
row_0 0.170054 0.425556 0.687345 0.585908 0.179595
row_1 0.668173 0.957584 0.452039 0.066831 0.106639
row_2 0.905087 0.916165 0.848378 0.136857 0.787576
row_3 0.410422 0.134920 0.064391 0.673386 0.316568
row_4 0.111550 0.327146 0.446042 0.550518 0.248249

This will not be used frequently. The columns argument, on the other hand, is used frequently, and we have actually done so above. It is used to import only a subset of the columns of the original data (if column names are alread present), or to add column names:

In [10]:
dat_frame_1 = pd.DataFrame(data=cons_dict, columns=['consumption']) # The dict has two columns: "time" and "consumption"
dat_frame_1.head(2) 
Out[10]:
consumption
0 1
1 8
In [11]:
dat_frame_4 = pd.DataFrame(data=random_matrix, columns=['A', 'B', 'C', "D", "E"])
dat_frame_4.head(2)
Out[11]:
A B C D E
0 0.170054 0.425556 0.687345 0.585908 0.179595
1 0.668173 0.957584 0.452039 0.066831 0.106639

The argument dtype was also used above. It is used to specify the data type of the observations:

In [13]:
dat_frame_5 = pd.DataFrame(data=random_matrix, dtype=str)
dat_frame_5.head()
Out[13]:
0 1 2 3 4
0 0.17005373846819194 0.4255561803920467 0.6873449601520477 0.5859081165391146 0.1795949703261832
1 0.6681732153985006 0.9575837958962751 0.4520385786773493 0.06683116699282576 0.10663936131877483
2 0.9050873019903296 0.9161650043739393 0.8483778912457298 0.1368565743223482 0.7875758038729648
3 0.4104218813062618 0.13492037714959026 0.0643914347821064 0.6733855686420844 0.3165683678118967
4 0.111550040657029 0.3271464981076774 0.4460416481765922 0.5505183612842459 0.24824862779547952

The final argument is copy, which takes a boolean value as an input and defaults to False. It basically specifies whether the underlying data should be copied. You do not need to worry about this argument for the moment.

Reading and writing data

Now that we know how data frames look like we can look at how to read data from external data files.

Some remarks on the working directory

There are two ways of how you can specify the path to the file you want to import: you can give a relative path or a general path.

A relative path is relative to the current working directory. The working directory is the main folder of your project. Remember, a project file can be created with Spyder, and whenever you load the project file, the current working directory is set to the folder in which the project file resides in. This does not need to be the folder your script is located. For example, suppose you use a exemplary folder structure like this:

/project file /output/ <- folder with output files such as plots and data /code/ <- folder with all your scripts /data/ <- folder with all your data files /notes/ <- folder with all your notes, doc files, etc.

Then, to read a file called nice_data.csv that is located in the folder data, you provide the following path to pandas:

In [14]:
rel_path = "data/nice_data.csv"
data_frame = pd.read_csv(rel_path)
data_frame.head(5)
Out[14]:
time capital consumption
0 0 0.005715 3
1 1 0.295124 9
2 2 0.153343 9
3 3 0.773451 1
4 4 0.430681 8

If you were using absolute paths you provide the full path, starting with your home directory. In my case, this would be for the example above:

In [15]:
abs_path = "/Users/graebnerc/Dropbox/fancy_project/data/nice_data.csv"

The advantage of absolute paths is that they are unique on your computer. If, for some reasons, you change your working directory during your work (e.g. with the command sys.cd) and you are using a relative path, the file will probably not be found. Whenever you give an absolute path, however, there is not problem.

Yet, in almost all cases relative paths are preferable. Why? Suppose you want to collaborate with others and you share the project directory (e.g. via Dropbox or, better, GitHub). Then the code using absolute paths will be working only for the person who has written the code since the place the project folder is located is different for every collaborator. However, relative paths are the same for all collaborators, so using them always works for everybody because all share the same project folder:

In [16]:
abs_path = "/Users/graebnerc/Dropbox/fancy_project/data/nice_data.csv"
# This path works only on my computer
rel_path = "data/nice_data.csv"
# This path works for everybody using the same project directory

Reading and writing csv data

csv stands for 'comma seperated values' and is among the most standard and useful formats, simply because it can be read by basically every program on every operation system.

Suppose we have the following data frame:

In [17]:
cons_dict2 = {"time": list(range(50)), "consumption" : np.random.randint(10, size=50), "capital" : np.random.random(size=50)}
expl_frame = pd.DataFrame(cons_dict2)
expl_frame = expl_frame[['time', 'capital', 'consumption']]
expl_frame.head(4)
Out[17]:
time capital consumption
0 0 0.427115 7
1 1 0.855862 9
2 2 0.744055 4
3 3 0.633713 4

To save this frame to a csv file we simple use the method to_csv:

In [18]:
rel_path = "data/nice_data.csv"
expl_frame.to_csv(rel_path)

To read csv files we use the function pd.read_csv:

In [19]:
data_frame_imported = pd.read_csv(rel_path)
data_frame_imported.head(4)
Out[19]:
Unnamed: 0 time capital consumption
0 0 0 0.427115 7
1 1 1 0.855862 9
2 2 2 0.744055 4
3 3 3 0.633713 4

Note that our code above saves the row indices as a column proper. This is usually not what we want. To surpress this use:

In [20]:
expl_frame.to_csv(rel_path, index=False)
data_frame = pd.read_csv(rel_path)
data_frame.head(5)
Out[20]:
time capital consumption
0 0 0.427115 7
1 1 0.855862 9
2 2 0.744055 4
3 3 0.633713 4
4 4 0.792698 2

For further information on the arguments of to_csv see the documentation page. Typical adjustments that need to be done when reading csv data is: which sign is used to separate the columns (which can be specified by the argument sep)? What is the decimal sign (which can be specified by the argument decimal)? Should the first row be used as column names (which can be specified by the argument header)? Etc...

Reading and writing HDF5 files

While csv is a good file format for many tasks, and the standard format for official statistics offices, it also faces some disadvantages: you can only store one single table in one file, the compression rate is low, reading and writing speed is not too great, you cannot store acompanying meta data, etc...

A very useful alternative for saving results of an ABM is the HDF5 format file format.

Consider our data frame from above:

In [21]:
expl_frame.head(4)
Out[21]:
time capital consumption
0 0 0.427115 7
1 1 0.855862 9
2 2 0.744055 4
3 3 0.633713 4

To begin, we first need to create a store instance:

In [23]:
store = pd.HDFStore('output/h5_results.h5')

We can now add data to the store, for example the data frame, as well as two of the data frames we have defined earlier:

In [24]:
store['expl_frame'] = expl_frame
store['old_frame_1'] = dat_frame_3
store['old_frame_2'] = dat_frame_5

Then, we close the store:

In [25]:
store.close()
In [26]:
store.is_open
Out[26]:
False

We can also summarize the code above using the following syntax:

In [27]:
with pd.HDFStore('output/h5_results.h5') as store:
    store['expl_frame'] = expl_frame
    store['old_frame_1'] = dat_frame_3
    store['old_frame_2'] = dat_frame_5

This also automatically closes the store.

To read such data in a new Python session simple do the following:

In [28]:
imported_store = pd.HDFStore('output/h5_results.h5')
old_frame = imported_store['old_frame_1']
imported_store.close()
old_frame.head(2)
Out[28]:
0 1 2 3 4
row_0 0.170054 0.425556 0.687345 0.585908 0.179595
row_1 0.668173 0.957584 0.452039 0.066831 0.106639

For more information you might check out the documentation page.

Reading and writing feathers

A very new and practical file format is feather. It has excellent compression rates, is fast to read and to write, and can be understood both by Python and R. This is the format I am using to save the results of my ABM.

We use again our example frame:

In [29]:
expl_frame.head(4)
Out[29]:
time capital consumption
0 0 0.427115 7
1 1 0.855862 9
2 2 0.744055 4
3 3 0.633713 4

To write it to a feather file simple do:

In [30]:
expl_frame.to_feather('data/expl_frame.feather')

And to read it:

In [31]:
expl_frame_loaded = pd.read_feather('data/expl_frame.feather')

Check whether the two are the same:

In [32]:
expl_frame_loaded.equals(expl_frame)
Out[32]:
True

Further reading

Basically, all import functions work in a similar way. Here you can get an overview over all functions available, which basically allow you to import all relevant data into pandas.

To get a concise intro to pandas that also covers data frame manipulation, analysis and visualization, you might have a look at the official - yet euphemistically named - introduction 10 minutes to Pandas. It also contains a lot of good references for further reading.

In [2]:
expl_frame_loaded = pd.read_feather('data/example_output.feather')
expl_frame_loaded.head(3)
Out[2]:
id t n_c n_d sh_c_top sh_c_top_low sh_d_top_top sh_d_top_low tr_fail tr_rejct ... tr_explts tr_defcts int_strgr int_prtnr int_coprts int_dfctrs pp_int_c pp_int_d eff_temp eff_ovrl
0 0 0 187 63 0.748000 0.748000 0.252000 0.252000 0.0 0 ... 0.000000 0.000000 0.000000 0.000000 0 0 0.000000 0.000000 0.0000 0.00000
1 0 1 160 90 0.135135 0.891892 0.864865 0.108108 14.0 55 ... 0.395062 0.053498 0.958848 0.041152 728 244 4.550000 2.711111 0.8240 0.82400
2 0 2 136 114 0.189189 0.729730 0.810811 0.270270 18.0 220 ... 0.379668 0.099585 0.842324 0.157676 685 279 5.036765 2.447368 0.7765 0.80025

3 rows × 22 columns