Starting to use Python Pandas library

Pandas is brilliant.

If you are looking to manipulate data in Python, use it. I am refactoring most of my codes with this library. It provides many useful data structures to manipulate data in 2, 3, 4 dimensions and more. For R users, it will be obvious: it implements the DataFrame structure and the functionalities such as melting. A great feature is also the fact that Pandas is linked to matplotlib and proposes the plot, histogram and boxplot plotting tools in many of its data structures. Yet, it takes a little bit of practice to fully understand the possiblities.

The Pandas tutorial and user guide are great so please have a look at them. There are some very basic data manipulations that are not that intuitive when you don’t know them. So, I summarised here some very standard functionalities that took me more time than expected and hope it will be handy for beginners. There are maybe better or faster solutions and experience or feedback will tell me.

Creating a data frame and accessing columns, rows, values

First of all, let us create a DataFrame instance, which is nothing else than a matrix with names on the row and columns. In Pandas terminology, rows are called index, so let us create a DataFrame starting from a dictionary (there are many ways of creating data frames; the read_csv function could be handy)

>>> import pandas as pd
>>> df1 = pd.DataFrame({"weight": [80,60,70], "age":[20,40,50]}
>>> prinbt(df1)
   age  weight
0   20      80
1   40      60
2   50      70
 
[3 rows x 2 columns]

What about accessing to the age columns:

df1['age']

Note that the results is a time series and so you can now access to an item using its index as you would do with a list (e.g., df1[‘age’][0] would return 20)
However, in general, you may request several columns:

df1[ ['age', 'weight']]

in which case the access to the first index requires the use of a special property called ix. Here, we access to the first row:

>>> df1.ix[0]
Out[1592]: 
age       20
weight    80
Name: 0, dtype: int64

that would be equivalent to :

>>> df1[['age', 'weight']].ix[0]
age       20
weight    80
Name: 0, dtype: int64

Keep in mind that you can access a value by using indices or names. For instance all these statements are equivalent:

df1.ix[0,'age']
df1.ix[0,0]
df1['age'][0]

append rows and columns

To append a column, it’s quite easy: you can use set a new column as follows:

df1['gender'] = ['m','f','m']

To add a new row, you can add use several data structure. Here, we use a list of dictionaries. Each dictionary correspond to one new row. You may use another data frame:

df1 = df1.append([{'age':25,'weight':75, 'gender':'m'}])

plotting

Of course, as mentionned there are plotting functionalities. Here is one quick and simple example but handy. You can select the variable x and y otherwise all columns (variables) are plotted together, which may be very useful as well.

df1.plot(x="age", y="weight")
Please follow and like us:
This entry was posted in Python and tagged , . Bookmark the permalink.

Leave a Reply

Your email address will not be published.