Thomas Cokelaer's blog | Notes on Data Analysis, Computer Science, Python, Biology, …

pandas.read_csv: how to skip empty lines

Posted on May 16, 2014 by Thomas Cokelaer

Let us suppose that we start with a CSV file that has empty rows:

A, B, C
 
1, 2, 3

If you read this file with Pandas library, and look at the content of your dataframe, you have 2 rows including the empty one that has been filled with NAs

>>> import pandas as pd
>>> df = pd.read_csv("test.csv", sep=",")
>>>> print(df)
    A   B   C
0 NaN NaN NaN
1   1   1   1
 
[2 rows x 3 columns]

There is no option to ignore the row in the function read_csv, so you need to do it yourself. Hopefully, there is a dropna method that is handy:

df.dropna(how="all", inplace=True)

Posted in Python | Tagged pandas | 2 Comments

ipython notebook and matplotlib

Posted on April 25, 2014 by Thomas Cokelaer

In order to have the plots to be shown within a ipython notebook, you should include the following code at the beginning.

%pylab inline

Images may be tiny. You can set the size by playing with the matplotlib configuration file.

matplotlib.rcParams['savefig.dpi'] = 2 * matplotlib.rcParams['savefig.dpi']

See the notebook for an example within nbviewer

http://nbviewer.ipython.org/gist/anonymous/3301035

Posted in Python | Tagged ipython, nbviewer | Leave a comment

Python : How to flush output of print

Posted on March 24, 2014 by Thomas Cokelaer

Using the sys module, just type:

import sys
sys.stdout.flush()

Posted in Python | Tagged python | Leave a comment

python argparse issues with the help argument (TypeError: %o format: a number is required, not dict)

Posted on March 18, 2014 by Thomas Cokelaer

Even though Python has a great documentation, once in a while you get stuck on a single problem more than expected.
It happenned to me today with the argparse module, which I thought I knew enough to quickly code a user interface application.
Here is an example of what I was trying to do (in a more complex code but we will get the idea).

import argparse
parser = argparse.ArgumentParser()
parser.add_argument('--elitism', type=int, default=5, dest="elitism", 
help='should be an integer less than popsize. Ideally, about 10%')
args = parser.parse_args()

I got, this kind of error

TypeError: %o format: a number is required, not dict

and the reason is that in the help string there is a % sign, which is not recognised!!!! As simple as that…. The error message being misleading and the code used was embedded in more complex code, so this was not easy to track done… a bit frustrating

And the solution is to double the percent character

help="about 10%%"

Posted in Python | Tagged python | 36 Comments

Fourier transformation explained

Posted on February 5, 2014 by Thomas Cokelaer

I love the following image found in this blog.

Posted in Data analysis | Tagged Fourier | 1 Comment

Pandas dataframe: grouping column by name

Posted on January 31, 2014 by Thomas Cokelaer

Let us consider this data set in CSV format:

DIG1, DIG1, DIG1, DIG2
1, 3, 5, 8
2, 4, 6, 9
3, 5, 7,  10

The problem is to read the data and average the columns that have the same name.

You can read the CSV file with pandas.read_csv function or build the data frame manually as follows:

import pandas as pd
df = pd.DataFrame({"DIG2":[8,9,10], "DIG1": [1,2,3], "DIG1.1": [3,4,5], "DIG1.2": [5,6,7]})

Note that pandas appends suffix after column names that have identical name (here DIG1) so we will need to deal with this issue. First, let us transpose the data

>>> df = df.transpose()
>>> df
        0  1  2
DIG1    1  2  3
DIG1.1  3  4  5
DIG1.2  5  6  7
DIG2    8  9  10

Let us call reset_index so that indices are now in a column

>>> df = df.reset_index()
>>> df
    index  0  1  2
0    DIG1  1  2  3
1  DIG1.1  3  4  5
2  DIG1.2  5  6  7
3    DIG2  8  9  10

We then rename the index column to get rid of the extra dots

df['index'] = [this.split(".")[0] for this in df['index']]

and finally, we can group by name and transpose back

>>> df.groupby("index").mean().transpose()
index  DIG1  DIG2
0         3     8
1         4     9
2         5    10

There may be a better solution to do that but it works for now.

Posted in Python | Tagged pandas | 5 Comments

Starting to use Python Pandas library

Posted on January 29, 2014 by Thomas Cokelaer

Pandas is brilliant.

If you are looking to manipulate data in Python, use it. I am refactoring most of my codes with this library. It provides many useful data structures to manipulate data in 2, 3, 4 dimensions and more. For R users, it will be obvious: it implements the DataFrame structure and the functionalities such as melting. A great feature is also the fact that Pandas is linked to matplotlib and proposes the plot, histogram and boxplot plotting tools in many of its data structures. Yet, it takes a little bit of practice to fully understand the possiblities.

The Pandas tutorial and user guide are great so please have a look at them. There are some very basic data manipulations that are not that intuitive when you don’t know them. So, I summarised here some very standard functionalities that took me more time than expected and hope it will be handy for beginners. There are maybe better or faster solutions and experience or feedback will tell me.

Creating a data frame and accessing columns, rows, values

First of all, let us create a DataFrame instance, which is nothing else than a matrix with names on the row and columns. In Pandas terminology, rows are called index, so let us create a DataFrame starting from a dictionary (there are many ways of creating data frames; the read_csv function could be handy)

>>> import pandas as pd
>>> df1 = pd.DataFrame({"weight": [80,60,70], "age":[20,40,50]}
>>> prinbt(df1)
   age  weight
0   20      80
1   40      60
2   50      70
 
[3 rows x 2 columns]

What about accessing to the age columns:

df1['age']

Note that the results is a time series and so you can now access to an item using its index as you would do with a list (e.g., df1[‘age’][0] would return 20)
However, in general, you may request several columns:

df1[ ['age', 'weight']]

in which case the access to the first index requires the use of a special property called ix. Here, we access to the first row:

>>> df1.ix[0]
Out[1592]: 
age       20
weight    80
Name: 0, dtype: int64

that would be equivalent to :

>>> df1[['age', 'weight']].ix[0]
age       20
weight    80
Name: 0, dtype: int64

Keep in mind that you can access a value by using indices or names. For instance all these statements are equivalent:

df1.ix[0,'age']
df1.ix[0,0]
df1['age'][0]

append rows and columns

To append a column, it’s quite easy: you can use set a new column as follows:

df1['gender'] = ['m','f','m']

To add a new row, you can add use several data structure. Here, we use a list of dictionaries. Each dictionary correspond to one new row. You may use another data frame:

df1 = df1.append([{'age':25,'weight':75, 'gender':'m'}])

plotting

Of course, as mentionned there are plotting functionalities. Here is one quick and simple example but handy. You can select the variable x and y otherwise all columns (variables) are plotted together, which may be very useful as well.

df1.plot(x="age", y="weight")

Posted in Python | Tagged pandas, python | Leave a comment

Reading Nikkon Raw image with Gimp plugin UFRaw

Posted on January 28, 2014 by Thomas Cokelaer

I recently started to play with RAW images on a Nikkon camera. The first issue was that gimp could open the images but only in a kind of thumbnail version. Browsing around it looks like one tool to read the RAW images is UFRaw and it provides a gimp plugin. Under Fedora 17, I could not find any package, so installation from source was required. Here are the steps I followed:

First, get the UFRaw source files from

http://ufraw.sourceforge.net/Install.html

Uncompress in a directory. Go to the directory that has been created and install the package manually as follows

mkdir ~/installation
cd ~/installation
mv ~/Downloads/ufraw-0.19.2.tar.gz .
tar xvfw ufraw-0.19.2.tar.gz
cd ufraw-0.19.2
./configure --with-gimp
sudo make install

Note the –with-gimp option. I got a couple of configuration/errors that surely will depend on you system/distribution. Here are the issues I got on Fedora 17 and how it was resolved

First, at the configure stage, the lcms and gtkimageview libraries and devel libraries were missing, which was simply solved using yum:

sudo yum install gtkimageview gtkimageview-devel lcms lcms-devel

Then the –with-gimp options requires the devel package of gimp:

sudo yum install gimp-devel

Finally, once UFRaw is installed, you can start gimp and open a .NEF file normally. The UFRaw window will pop up automatically.

Posted in photos | Tagged gimp | Leave a comment

pandas import fails with ImportError: cannot import name hashtable

Posted on January 13, 2014 by Thomas Cokelaer

Import of pandas raises the following error:

ImportError: cannot import name hashtable

Starting a python session, I typed

from pandas import hastable

and got this new message:

ImportError: cannot import name hashtable

The issue was thast the wrong version of numpy was picked up: I ws in a virtual environement that picked up the numpy installed globally instead of the local one installed in the virtual environment.

Posted in Python | Tagged pandas | Leave a comment

Line starting ‘ …’ is malformed! error

Posted on November 10, 2013 by Thomas Cokelaer

Playing with a Python wrapper of an R package, I suddenly got this error after a trivial change of the code:

Line starting '<!DOCTYPE html> ...' is malformed!

Wondering for a couple of minutes where it could come from without success, I google the error and realised that it had nothing to do with python or my code. Actually, the error vanished after a couple of minutes.

My code is accessing the bioconductor website and it looks like the error was apparently due to the BioConductor built system, which prevented the website to be accessed to.

The error message is the top of a redirect page. So it if you see the error just wait a couple of minutes.

Posted in R | Tagged bioconductor, python, R | Leave a comment

Search for:
Follow me
Recent Posts
calendar
July 2025

M T W T F S S

1 2 3 4 5 6

7 8 9 10 11 12 13

14 15 16 17 18 19 20

21 22 23 24 25 26 27

28 29 30 31

« Apr
Recent Comments
Archives
Archives
Categories
- amusement
- bioinformatics
- biology
- Computer Science
- Data analysis
- Internet related
- Life Science
- Linux
- Notes
- Others
- photos
- Python
- R
- raspberryPi
- Software development
- Uncategorized
- wordpress
Meta

Creating a data frame and accessing columns, rows, values

append rows and columns

plotting

Follow me

Recent Posts

calendar

Recent Comments

Archives

Categories

Meta