Pandas is brilliant.
If you are looking to manipulate data in Python, use it. I am refactoring most of my codes with this library. It provides many useful data structures to manipulate data in 2, 3, 4 dimensions and more. For R users, it will be obvious: it implements the DataFrame structure and the functionalities such as melting. A great feature is also the fact that Pandas is linked to matplotlib and proposes the plot, histogram and boxplot plotting tools in many of its data structures. Yet, it takes a little bit of practice to fully understand the possiblities.
The Pandas tutorial and user guide are great so please have a look at them. There are some very basic data manipulations that are not that intuitive when you don’t know them. So, I summarised here some very standard functionalities that took me more time than expected and hope it will be handy for beginners. There are maybe better or faster solutions and experience or feedback will tell me.
Creating a data frame and accessing columns, rows, values
First of all, let us create a DataFrame instance, which is nothing else than a matrix with names on the row and columns. In Pandas terminology, rows are called index, so let us create a DataFrame starting from a dictionary (there are many ways of creating data frames; the read_csv function could be handy)
>>> import pandas as pd
>>> df1 = pd.DataFrame({"weight": [80,60,70], "age":[20,40,50]}
>>> prinbt(df1)
age weight
0 20 80
1 40 60
2 50 70
[3 rows x 2 columns] |
>>> import pandas as pd
>>> df1 = pd.DataFrame({"weight": [80,60,70], "age":[20,40,50]}
>>> prinbt(df1)
age weight
0 20 80
1 40 60
2 50 70
[3 rows x 2 columns]
What about accessing to the age columns:
Note that the results is a time series and so you can now access to an item using its index as you would do with a list (e.g., df1[‘age’][0] would return 20)
However, in general, you may request several columns:
in which case the access to the first index requires the use of a special property called ix. Here, we access to the first row:
>>> df1.ix[0]
Out[1592]:
age 20
weight 80
Name: 0, dtype: int64 |
>>> df1.ix[0]
Out[1592]:
age 20
weight 80
Name: 0, dtype: int64
that would be equivalent to :
>>> df1[['age', 'weight']].ix[0]
age 20
weight 80
Name: 0, dtype: int64 |
>>> df1[['age', 'weight']].ix[0]
age 20
weight 80
Name: 0, dtype: int64
Keep in mind that you can access a value by using indices or names. For instance all these statements are equivalent:
df1.ix[0,'age']
df1.ix[0,0]
df1['age'][0] |
df1.ix[0,'age']
df1.ix[0,0]
df1['age'][0]
append rows and columns
To append a column, it’s quite easy: you can use set a new column as follows:
df1['gender'] = ['m','f','m'] |
df1['gender'] = ['m','f','m']
To add a new row, you can add use several data structure. Here, we use a list of dictionaries. Each dictionary correspond to one new row. You may use another data frame:
df1 = df1.append([{'age':25,'weight':75, 'gender':'m'}]) |
df1 = df1.append([{'age':25,'weight':75, 'gender':'m'}])
plotting
Of course, as mentionned there are plotting functionalities. Here is one quick and simple example but handy. You can select the variable x and y otherwise all columns (variables) are plotted together, which may be very useful as well.
df1.plot(x="age", y="weight") |
df1.plot(x="age", y="weight")