git : How to remove a big file wrongly committed

I added a large file to a git repository (102Mb), commited and push and got an error due to size limit limitations on github

remote: error: GH001: Large files detected. You may want to try Git Large File Storage - https://git-lfs.github.com.
remote: error: Trace: 7d51855d4f834a90c5a5a526e93d2668
remote: error: See http://git.io/iEPt8g for more information.
remote: error: File coverage/sensitivity/simulated.bed is 102.00 MB; this exceeds GitHub's file size limit of 100.00 MB

Here, you see the path of the file (coverage/sensitivity/simualted.bed).

So, the solution is actually quite simple (when you know it): you can use the filter-branch command as follows:

git filter-branch --tree-filter 'rm -rf path/to/your/file' HEAD
git push
Posted in Computer Science | Tagged , | 5 Comments

git and github : skip password typing with https

if you clone a github repository using the https:// method (instead of ssh), you will have to type your username and passwor all the time.

In order to avoid having to type you password all the time, you can use the credential helpers since git 1.7.9 and later.

git config --global credential.helper "cache --timeout=7200"

where

--timeout=7200

means “keep the credentials cached for 2 hours. (default is 15 minutes).

You can also store the credentials permanently using

git config credential.helper store
Posted in Computer Science | Tagged | Leave a comment

failed to convert from cram to bam (parse error CIGAR character)

In order to convert a bioinformatic file from CRAM to BAM format, I naively used the samtools command available on a cluster but got this error:

samtools view -T reference.fa -b -o output.bam input.cram
[sam_header_read2] 3366 sequences loaded.
[sam_read1] reference 'VN:1.4' is recognized as '*'.
Parse error at line 1: invalid CIGAR character

After a few commands trying to fix the issue, I realised that the error message contained the SAM label. This indicates that samtools version is a bit old. And indeed it was. I then used version 1.6 of samtools and it worked out of the box.

Posted in bioinformatics | Tagged | Leave a comment

How to mount and create a partition on a hard drive dock (fedora)

I got a new hard drive (2.7Tb) but wanted to use it with a docking station. Here are the steps required to use it under my Fedora box.

First, I naively went into the Nautilus File Browser hoping to see the hard drive mounted automatically. Of course it was not there: the hard drive is new and has no partition.

So, first, let us discover and check that the drive can be seen. We can use the fdisk command:

sudo fdisk -l
Disk /dev/sdb: 2.7 TiB, 3000592982016 bytes, 5860533168 sectors
Units: sectors of 1 * 512 = 512 bytes
Sector size (logical/physical): 512 bytes / 4096 bytes
I/O size (minimum/optimal): 4096 bytes / 33553920 bytes
Disklabel type: gpt
Disk identifier: 3822C676-2317-437F-83E0-2358BA655039

You can see in this case that the disk is on device /dev/sdb.

I then started the tool gparted and in the top right corner you can see the /dev/sdb device that should also indicate the size of your hard drive as shown in this image:

As you can see the partition and file system are unallocated. First, you need to go to the menu

Device/Create Partition Table

to create a partition table on this hard drive.

Then, you can create a new partition by going to

Partition/New

Here, you get a new window that looks like:

I allocated the entire space to one partition. In the menu you need to give a label and a name. The name is for you, the label is for the system so for the label remain simple and do not use special characters (except if you know what you are doing).

For the filesystem I kept the default (gpt). Finally, once you are done, you need to press the apply button. You should be ready in a few seconds.

Go back to Nautilus File Browser and here you can see the new hard drive partition (in theory).

Change permission

Finally, you will see that in Nautilus, you can not create any folder or files: you do not have permissions. To change this, you need to be in the list of sudo users. Then, go the path where your hard disk is mounted and type:

sudo chmod 0777 /run/media/yourdisk_path
Posted in Linux | Tagged , | Leave a comment

AWK: convert into lower or upper cases

In order to convert a bash variable to lower case with awk, just use this command:

a="UPPER CASE"
echo "$a" | awk '{print tolower($0)}'

If you want to convert the content of a file (called data.csv) to lower case:

awk '{print tolower($0)}' data.csv

Of course to convert into upper case, simply use the function toupper() instead of tolower().

Note also that a better tool to avoid issues with special characters might be the tr unix command:

tr [:upper:] [:lower:] < input
Posted in Linux | Tagged , | Leave a comment

How to sort a dictionary by values in Python

By definition, dictionary are not sorted (to speed up access). Let us consider the following dictionary, which stores the age of several persons as values:

d = {"Pierre": 42, "Anne": 33, "Zoe": 24}

If you want to sort this dictionary by values (i.e., the age), you must use another data structure such as a list, or an ordered dictionary.

Use the sorted function and operator module

import operator
sorted_d = sorted(d.items(), key=operator.itemgetter(1))

Sorted_d is a list of tuples sorted by the second element in each tuple. Each tuple contains the key and value for each item found in the dictionary. If you look at the content of this variable, you should see:

[ ('Zoe', 24), ('Anne', 33), ('Pierre', 42)]

Use the sorted function and lambda function

If you do not want to use the operator module, you can use a lambda function:

sorted_d = sorted(d.items(), key=lambda x: x[1])
# equivalent version
# sorted_d = sorted(d.items(), key=lambda (k,v): v)

The computation time is of the same order of magnitude as with the operator module. Would be interesting to test on large dictionaries.

Use the sorted function and return an ordered dictionary

In the previous methods, the returned objects are list of tuples. So we do not have a dictionary anymore. You can use an OrderedDict if you prefer:

>>> from collections import OrderedDict
>>> dd = OrderedDict(sorted(d.items(), key=lambda x: x[1]))
>>> print(dd)
OrderedDict([('Pierre', 24), ('Anne', 33), ('Zoe', 42)])

Use sorted function and list comprehension

Another method consists in using list comprehension and use the sorted function on the tuples made of (value, key).

sorted_d = sorted((value, key) for (key,value) in d.items())

Here the output is a list of tuples where each tuple contains the value and then the key:

[(24, 'Pierre'), (33, 'Anne'), (42, 'Zoe')]

A note about Python 3.6 native sorting

In previous version on this post, I wrote that “In Python 3.6, the iteration through a dictionary is sorted.”. This is wrong. What I meant is that in Python 3.6 dictionary keeps insertion sorted.

It means that if you insert your items already sorted, the new python 3.6 implementation will be this information. Therefore, there is no need to sort the items anymore. Of course, if you insert the items randomly, you will still need to use one of the method mentioned above.

For instance, taking care of the age, we now create our list as follows (sorting by ascending age):

d = {("Zoe": 24)}
d.update({'Anne': 33})
d.update({'Pierre': 42})

Now you can iterate through the items and they will be in the same order as in the creation of the dictionary. So you can just create a list from your items very easily:

list(d.items())
Out[15]: [('Zoe', 24), ('Anne', 33), ('Pierre', 42)]

Benchmark

Here is a quick benchmark made using the small dictionary from the above examples. Would be interesting to redo the test with a large dictionary.

What you can see is that the native Python dictionary sorting is pretty cool followed by the combination of the lambda + list comprehension method. Overall using one of these methods would be equivalent though (factor 2/3 at most).

This image was created with the following code.

import operator                                                  
import pylab
from easydev import Timer
 
times1, times2, times3 = [], [], []
pylab.clf()
d = {"Pierre": 42, "Anne": 33, "Zoe": 24}
for j in range(20):
    N = 1000000
    with Timer(times3):
        for i in range(N):
         sorted_d = sorted((key, value) for (key,value) in d.items())
    with Timer(times2):
        for i in range(N):
            sorted_d = sorted(d.items(), key=lambda x: x[1])
    with Timer(times1):
        for i in range(N):
            sorted_d = sorted(d.items(), key=operator.itemgetter(1))
    print(j)
pylab.boxplot([times1, times2, times3])
pylab.xticks([1,2,3], ["operator", "lambda", "list comprehension and lambda"])
pylab.ylabel("Time (seconds) 1 million sorting \n (repeated 20 times)")
pylab.grid()
pylab.title("Performance sorted dictionary by values")
Posted in Python, Uncategorized | Tagged , | 5 Comments

Python: how to copy a list

To explain how to create a copy of a list, let us first create a list. We will use a simple list of 4 items:

list1 = [1, 2, "a", "b"]

Why do we want to create a copy anyway ? Well, because in Python, this assignement creates a reference (not a new independent variable):

list2 = list1

To convince yourself, change the first item of list2 and then check the content of list1, you should see that the two lists have been modified and contain the same items.

So, to actually copy a list, you have several possibilities. From the simplest to the most complex:

  • you can slice the list.
    list2 = 1ist1[:]
  • you can use the list() built in function
    list2 = list(1ist1)
  • you can use the copy() function from the copy module. This is slower than the previous methods though.
    import copy
    list2 = copy.copy(list1)
  • finally, if items of the list are objects themselves, you should use a deep copy (see example below):
    import copy
    list2 = copy.deecopy(list1)
  • To convince yourself about the interest of the latter method, consider this list:

    import copy
    list1 = [1, 2, [3, 4]]
    list2 = copy.copy(list1)
    list2[2][1] = 40

    you should see that changing list2, you also changed list1. If this is not the intended behviour, you should consider using the deepcopy.

    Posted in Python | Tagged , | Leave a comment

    Python: ternary operator

    In C language (and many other languages), there is a compact ternary conditional operator that is a compact if-else conditional construct. For instance, in C, a traditional if-else construct looks like:

    if (a &gt; b) {
        result = x;
    } else {
        result = y;
    }

    and the equivalent ternary operator looks like:

    result = a>b ? x : y;

    As in the if-else code, only one expression x or y is evaluated.

    In Python, from version 2.5, you would write:

    results = x if a > b else y

    More formally the ternary operator is written as:

    x if condition else y

    So condition is evaluated first then either x or y is returned based on the boolean value of condition.

    You can use ternary operator within list comprehension. For example:

    [1 if item > else -1 for item in [0,1,-5,2]]
    Posted in Python | Leave a comment

    Difference between __repr__ and __str__ in Python

    When implementing a class in Python, you usually implement the __repr__ and __str__ methods.

    1. __str__ should print a readable message
    2. __repr__ should print a message that is unambigous (e.g. name of an identifier, class name, etc).

    You can see __str__ as a method for users and __repr__ as a method for developers.

    Here is an implementation example for a class that simply stores an attribute (data).

    class Length():
        def __init__(self, data):
            self.data = data

    __str__ is called when a user calls the print() function while __repr__ is called when a user just type the name of the instance:

    >>> l = Length([1,2,3])
    >>> print(l)    # should call __str__ if it exists
    <__main__.Length at 0x7faf240acc18>
    >>> l
    <__main__.Length object at 0x7faf240acc18>

    By default when no __str__ or __repr__ methods are defined, the __repr__ returns the name of the class (Length) and __str__ calls __repr__.

    Now, let us define the __repr__ method ourself to be more explicit:

    class Length():
        def __init__(self, data):
            self.data = data
        def __repr__(self):
            return "Length(%s) " % (len(self.data))

    we could use it as follows:

    >>> l = Length([1,2,3])
    >>> print(l)     # calls __str__
    Length(3)
    >>> l            # calls __repr__
    Length(3, 140175447410224)

    When using the print() function in Python, the __str__ is called (if found) and otherwise, __repr__.

    class Length():
        def __init__(self, data):
            self.data = data
        def __repr__(self):
            return "Length(%s, %s) " % (len(self.data), id(self))
        def __str__(self):
            return "Length(%s) " % (len(self.data))

    so now __repr__ and __str__ have different behaviours:

    >>> l = Length([1,2,3])
    >>> print(l)     # calls __str__
    Length(3)
    >>> l            # calls __repr__
    Length(3, 140175447410224)
    Posted in Python | Tagged , | 3 Comments

    python: how to merge two dictionaries

    Let us suppose two dictionaries storing ages of different individuals:

    list1 = {"Pierre": 28, "Jeanne": 27}
    list2 = {"Marc": 32, "Helene": 34}

    If you do mind losing the contents of either list1 or list2 variable, you can update one of the other as follows:

    list1.update(list2)

    Now list1 variable contains:

    {"Pierre": 28, "Jeanne": 27, "Marc": 32, "Helene": 34}

    while list2 is unchanged.

    Usually, this is not what you want though. Instead, you would prefer to create a third variable keeping list1 and list2 unchanged.

    In Python 3.5 or greater, you can use the following syntax:

    fulllist = {**list1, **list2}

    In Python 2 or 3.4 and below, you need to copy one of the variable and update it:

    full_list = list1.copy()  # this keeps list1 unchanged
    full_list.update(list2)   # inplace update of the variable full_list

    The second method is more generic and would be more backward compatible (if you plan to provide your code to Python 2 users. Indeed, it would work for Python 2 and 3. However, it would be slower for Python 3.5 users (and above).

    Posted in Python | Tagged | Leave a comment