Script 3.2 for the video lectures accompanying the online course "An Introduction to Agent-Based Modelling in Python" by Claudius Gräbner
For the course homepage see: https://claudius-graebner.com/teaching/introduction-abm-deutsch.html
Importing and exporting data to Python happens all the time, for example:
There are many different ways to read and write data.
Here we focus on approaches that build on the pandas
library.
There are many reasons for this:
pandas
provides functions to read and write basically every useful data type (e.g. Excel, csv, dta, ...)pandas
also provides many useful classes and function for manipulating and analyzing data, such as functions to rename variables, reshape data frames, taking subsets and filtering your data, etc.pandas
is almost indispensable when you want to perform more advanced data analysis, such as regressions or machine learning pandas
provides you some very time- and space-efficient methods for storing your datapandas
interacts well with matplotlib
, the standard plotting library in Python, which we will explore belowAll these, and probably some more reasons should convince you that learning a bit of pandas
is a good investment, irrespective of the purpose you want Python to use for.
As for numpy
, there is a general convention on how to import pandas
:
import pandas as pd
In numpy
, arrays
were of central importance to basically everything the library does.
In pandas
, the class DataFrame
plays this role.
You think of them best as tables that contain information about variables and observations.
When you import a dataset into pandas
, e.g. from an Excel sheet that you have downloaded from Eurostat, then it will be of the type DataFrame
.
But before we actually import a real data set, we will work through an example of how to create data frames on our own (which is what we will do later when building ABM).
We will cover two ways of creating a DataFrame
, both will happen when working with ABM:
DataFrame
from a dictionaryDataFrame
from a numpy.ndarray
After considering these examples we quickly discuss the arguments that can be passed to DataFrame
.
Now consider the the following dictionary
:
import numpy as np
cons_dict = {"time": list(range(10)), "consumption" : np.random.randint(10, size=10)}
cons_dict
This dictionary contains information about the level of consumption at five different time steps.
Outcomes of ABM often have this form.
To make this data amendable for further analysis we transform it into a DataFrame
:
dat_frame_1 = pd.DataFrame(data=cons_dict)
dat_frame_1.head(5) # only print the first five rows
pandas
automatically turns the keys of the dictionary into the column names, and the values of the dictionary into the observations.
This is nice, and is in line with the tidy data approach, which you should always follow when working with data.
If you want to consider only some of the columns:
dat_frame_1 = pd.DataFrame(data=cons_dict, columns=['consumption'])
dat_frame_1.head() # only print the first five rows
Sometimes you want to make sure that the data type of the values of the dictionary is set explicitly. This is particularly relevant when the values are codes, which may start with a leading zero. In such cases, turning them into floats or integers would discard relevant information:
code = "0105"
print(int(code))
To avoid this, you might set the data types for the observations explicitly:
dat_frame_1 = pd.DataFrame(data=cons_dict, dtype=float)
dat_frame_1.head() # only print the first five rows
In any case, the result will be a pandas.DataFrame
, which can be analyzed, manipulated, and saved to external files.
The second way data frames are regularly created in the context of ABM is from numpy.ndarrays
.
Suppose we have the following array:
random_matrix = np.random.random(size=(5, 5))
random_matrix
This could be, for example, an interaction matrix indicating the interaction structure among the agents. To save this as a data frame simply do the following:
dat_frame_2 = pd.DataFrame(data=random_matrix)
dat_frame_2.head()
Before moving forward we quickly go through the arguments of the function that creates a DataFrame
.
According to the official documentation, DataFrame
takes the following arguments:
We already used the data
argument: this is the basic data that shall be transformed, e.g. a dictionary
or a np.ndarray
as used in the examples above.
The index
argument can be used to name the rows of the DataFrame
explicitly.
Be default, these are just numbers starting from zero:
dat_frame_2 = pd.DataFrame(data=random_matrix) # standard row names
dat_frame_2.head()
dat_frame_3 = pd.DataFrame(data=random_matrix, index=["row_" + str(i) for i in range(random_matrix.shape[1])])
dat_frame_3.head()
This will not be used frequently. The columns
argument, on the other hand, is used frequently, and we have actually done so above.
It is used to import only a subset of the columns of the original data (if column names are alread present), or to add column names:
dat_frame_1 = pd.DataFrame(data=cons_dict, columns=['consumption']) # The dict has two columns: "time" and "consumption"
dat_frame_1.head(2)
dat_frame_4 = pd.DataFrame(data=random_matrix, columns=['A', 'B', 'C', "D", "E"])
dat_frame_4.head(2)
The argument dtype
was also used above.
It is used to specify the data type of the observations:
dat_frame_5 = pd.DataFrame(data=random_matrix, dtype=str)
dat_frame_5.head()
The final argument is copy
, which takes a boolean value as an input and defaults to False
.
It basically specifies whether the underlying data should be copied.
You do not need to worry about this argument for the moment.
Now that we know how data frames look like we can look at how to read data from external data files.
There are two ways of how you can specify the path to the file you want to import: you can give a relative path or a general path.
A relative path is relative to the current working directory. The working directory is the main folder of your project. Remember, a project file can be created with Spyder, and whenever you load the project file, the current working directory is set to the folder in which the project file resides in. This does not need to be the folder your script is located. For example, suppose you use a exemplary folder structure like this:
/project file
/output/ <- folder with output files such as plots and data
/code/ <- folder with all your scripts
/data/ <- folder with all your data files
/notes/ <- folder with all your notes, doc files, etc.
Then, to read a file called nice_data.csv
that is located in the folder data
, you provide the following path to pandas
:
rel_path = "data/nice_data.csv"
data_frame = pd.read_csv(rel_path)
data_frame.head(5)
If you were using absolute paths you provide the full path, starting with your home directory. In my case, this would be for the example above:
abs_path = "/Users/graebnerc/Dropbox/fancy_project/data/nice_data.csv"
The advantage of absolute paths is that they are unique on your computer.
If, for some reasons, you change your working directory during your work (e.g. with the command sys.cd
) and you are using a relative path, the file will probably not be found.
Whenever you give an absolute path, however, there is not problem.
Yet, in almost all cases relative paths are preferable. Why? Suppose you want to collaborate with others and you share the project directory (e.g. via Dropbox or, better, GitHub). Then the code using absolute paths will be working only for the person who has written the code since the place the project folder is located is different for every collaborator. However, relative paths are the same for all collaborators, so using them always works for everybody because all share the same project folder:
abs_path = "/Users/graebnerc/Dropbox/fancy_project/data/nice_data.csv"
# This path works only on my computer
rel_path = "data/nice_data.csv"
# This path works for everybody using the same project directory
csv
stands for 'comma seperated values' and is among the most standard and useful formats, simply because it can be read by basically every program on every operation system.
Suppose we have the following data frame:
cons_dict2 = {"time": list(range(50)), "consumption" : np.random.randint(10, size=50), "capital" : np.random.random(size=50)}
expl_frame = pd.DataFrame(cons_dict2)
expl_frame = expl_frame[['time', 'capital', 'consumption']]
expl_frame.head(4)
To save this frame to a csv file we simple use the method to_csv
:
rel_path = "data/nice_data.csv"
expl_frame.to_csv(rel_path)
To read csv files we use the function pd.read_csv
:
data_frame_imported = pd.read_csv(rel_path)
data_frame_imported.head(4)
Note that our code above saves the row indices as a column proper. This is usually not what we want. To surpress this use:
expl_frame.to_csv(rel_path, index=False)
data_frame = pd.read_csv(rel_path)
data_frame.head(5)
For further information on the arguments of to_csv
see the documentation page.
Typical adjustments that need to be done when reading csv data is: which sign is used to separate the columns (which can be specified by the argument sep
)? What is the decimal sign (which can be specified by the argument decimal
)? Should the first row be used as column names (which can be specified by the argument header
)? Etc...
While csv is a good file format for many tasks, and the standard format for official statistics offices, it also faces some disadvantages: you can only store one single table in one file, the compression rate is low, reading and writing speed is not too great, you cannot store acompanying meta data, etc...
A very useful alternative for saving results of an ABM is the HDF5 format file format.
Consider our data frame from above:
expl_frame.head(4)
To begin, we first need to create a store
instance:
store = pd.HDFStore('output/h5_results.h5')
We can now add data to the store, for example the data frame, as well as two of the data frames we have defined earlier:
store['expl_frame'] = expl_frame
store['old_frame_1'] = dat_frame_3
store['old_frame_2'] = dat_frame_5
Then, we close the store:
store.close()
store.is_open
We can also summarize the code above using the following syntax:
with pd.HDFStore('output/h5_results.h5') as store:
store['expl_frame'] = expl_frame
store['old_frame_1'] = dat_frame_3
store['old_frame_2'] = dat_frame_5
This also automatically closes the store.
To read such data in a new Python session simple do the following:
imported_store = pd.HDFStore('output/h5_results.h5')
old_frame = imported_store['old_frame_1']
imported_store.close()
old_frame.head(2)
For more information you might check out the documentation page.
expl_frame.head(4)
To write it to a feather file simple do:
expl_frame.to_feather('data/expl_frame.feather')
And to read it:
expl_frame_loaded = pd.read_feather('data/expl_frame.feather')
Check whether the two are the same:
expl_frame_loaded.equals(expl_frame)
Basically, all import functions work in a similar way.
Here you can get an overview over all functions available, which basically allow you to import all relevant data into pandas
.
To get a concise intro to pandas that also covers data frame manipulation, analysis and visualization, you might have a look at the official - yet euphemistically named - introduction 10 minutes to Pandas. It also contains a lot of good references for further reading.
expl_frame_loaded = pd.read_feather('data/example_output.feather')
expl_frame_loaded.head(3)