Keyboard shortcuts

Press or to navigate between chapters

Press S or / to search in the book

Press ? to show this help

Press Esc to hide this help

Science with Python

SciPy - for Scientific Computing in Python

Data Science tools in Python

  • SciPy ecosystem of open-source software for mathematics, science, and engineering.

  • BioPython tools for biological computation.

  • NumPy to handle N-dimensional arrays.

  • Pandas Python Data Analysis Library. (Data Frames)

  • Matplotlib a 2D plotting library.

  • Seaborn data visualization library based on matplotlib.

  • Bokeh interactive visualization library.

  • SciKit-Learn Machine Learning in Python. (name from: SciPy Toolkit)

  • TensorFlow Machine learning framework. (primarily for deep-learning) (developed by Google engineer)

  • Keras Machine learning framework (primarily for deep-learning, neural-network). (On top of Tensorflow and now part of it.)

  • PyTorch Machine learnng framework. (primarily for deep-learning) (from Facebook)

  • Orange Machine learning and data visualization tool. Written partially in Python.

  • Airflow Workflow management platform (from AirBnB)

  • Luigi Data pipelines (from Spotify)

  • Showing speed improvement using a GPU with CUDA and Python with numpy on Nvidia Quadro 2000D

  • Octave (Open Source Matlab replacement - not related to Python)

  • There are a lot more libraries for each one of these tasks. Shop around!

Data Analysis resources

For Econometrics

For Intro Stats

Datasets

Python and Biology

Biopython

Biopython background

Bio python sequences

from Bio.Seq import Seq

# Nucleotide Sequences
my_dna = Seq("AGTACACTGGTAGGCCTTACAG_T")
print(my_dna)                       # AGTACACTGGTAGGCCTTACAG_T
print(my_dna.complement())          # TCATGTGACCATCCGGAATGTC_A
print(my_dna.reverse_complement())  # A_CTGTAAGGCCTACCAGTGTACT
print(my_dna.transcribe())          # AGUACACUGGUAGGCCUUACAG_U

my_rna = Seq("GAC_U")
print(my_rna)                       # GAC_U
print(my_rna.reverse_complement())  # A_GUC
print(my_rna.reverse_complement())  # A_GUC
print(my_rna.transcribe())          # GAC_U

from Bio.Seq import Seq

what_is_this = Seq("AGTC_U")
what_is_this.complement()  # ValueError: Mixed RNA/DNA found

Download and Read FASTA, GenBank files

For example the data about Orchids in two formats:

Download those files and use them:

from Bio import SeqIO
import requests

def get_file(url, filename):
    res = requests.get(url)
    if res.status_code != 200:
        raise Exception("Could not get file")

    with open(filename, 'w') as fh:
        fh.write(res.text)


def process_file(filename, file_type):
    for seq_record in SeqIO.parse(filename, file_type):
        print(seq_record.id)
        print(repr(seq_record.seq))
        print(len(seq_record))


fasta_url = 'https://raw.githubusercontent.com/biopython/biopython/master/Doc/examples/ls_orchid.fasta'
filename = "ls_orchid.fasta"
file_type = "fasta"
get_file(fasta_url, filename)
process_file(filename, file_type)


genbank_url = "https://raw.githubusercontent.com/biopython/biopython/master/Doc/examples/ls_orchid.gbk"
filename = "ls_orchid.gbk"
file_type = "genbank"
get_file(genbank_url, filename)
process_file(filename, file_type)

Search NCBI and Download FASTA and GenBank files

Use the NCBI (National Center for Biotechnology Information) database to search manually for nucleotide or tons of other types of data. Then one can download the files manually from the web site.

Search nucleotids

You can also search the same database programmatically.

from Bio import Entrez
Entrez.email = "gabor@szabgab.com"

term = "Cypripedioideae[Orgn] AND matK[Gene]"

handle = Entrez.esearch(db="nucleotide", term=term, idtype="acc", retmax=30)
record = Entrez.read(handle)
print(record["Count"])       # 538
print(record["IdList"])      # ['MK792700.1', 'MK792699.1', 'MK792698.1', ..., 'MK792681.1']
print(len(record["IdList"])) # 30
handle.close()


# term = "Orchid"
# 530077
# ['NZ_SELD00000000.2', 'NZ_SELD02000072.1',

Download nucleotids

from Bio import Entrez, SeqIO

Entrez.email = "gabor@szabgab.com"

#doc_id = 'MK792700.1'
doc_id = "EU490707"

# rettype="fasta"
handle = Entrez.efetch(db="nucleotide", id=doc_id, rettype="gb", retmode="text")
data = handle.read()
handle.close()
#print(data)

filename = "temp.data"
with open(filename, 'w') as fh:
    fh.write(data)

file_type = "genbank"
for seq_record in SeqIO.parse(filename, file_type):
    print(seq_record.id)
    print(repr(seq_record.seq))  # A short part of the sequence
    print()
    print(seq_record.seq)   # The full sequence
    print()
    print(len(seq_record.seq))
    print()
    print(seq_record.name)
    print()
    print(seq_record.annotations)
    #print()
    #print(dir(seq_record))

Exercise: Nucleotid

  • Search for your favorite nucleotid
  • Print out the number of results
  • Download the 3 different sequences from the list (using the id) in GeneBank format and save them in files using the id as the name of the file and .gb as the extension
  • Write a separate script that reads and displays the sequences.

Exercise: Search tool for NCBI

Write a program that can search NCBI and download and list sequences.

  • ncbi.py --term "SEARCH TERM" --download N this will search the NCBI databasefor nucleotides using the the search term and download N results. It will also save the search term, the current date, the total number of items found, and the list of Ids downloaded in a JSON file along with the annotation from each ID".
  • repeated use of the above command will save the new searches in the same JSON file.
  • ncbi.py --history list all the previously executed search term along with the date it was used, the total number of items found and number of items downloaded. (This information is taken from the JSON file.)
  • ncbi.py --list list some information about all the downloaded sequences. (e.g. the organism and taxonomy) along with the ID of each file.
  • Optional: ncbi.py --aa ID print the distribution of Amino Acids in the sequences in the given ID"

Biology background

Chemistry

Chemistry links

import sdf
import pybel

Python-based Simulations of Chemistry Framework

Bond length

Covalent radius

Python energy landscape explorer

Other chemistry links

numpy

What is NumPy

  • array

  • shape

  • dtype

  • numpy

  • High-level mathematical functions to operate on large, multi-dimensional arrays and matrices. ndarray

Install Numpy

pip install numpy

Numpy - vector

  • array
  • ndim
  • dtype
  • shape
import numpy as np

a = np.array([3, 4, 7])
print(a)         # [3 4 7]
print(a * 3)     # [ 9 12 21]
print(a + 4)     # [ 7  8 11]
print(a.dtype)   # int64
print(a.ndim)    # 1
print(a.shape)   # (3,)

b = np.array([2, 3.14, -1])
print(b.dtype)   # float64
print(b.shape)   # (3,)

c = np.array(['one', 'two', 'three'])
print(c.dtype)   # <U5    (Unicode less than 5 characters)

NumPy 2D arrays

import numpy as np

a = np.array([
    [ 1,  2,  3,  4,  5],
    [ 2,  3,  4,  5,  6]
])

print(a)
# [[1 2 3 4 5]
#  [2 3 4 5 6]]

print(a.shape) # (2, 5)
print(a.ndim)  # 2


print(a * 3)
# [[ 3  6  9 12 15]
#  [ 6  9 12 15 18]]

print(a + 7)
# [[ 8  9 10 11 12]
#  [ 9 10 11 12 13]]

Numpy - set type

  • dtype
  • int8
import numpy as np

a = np.array([3, 4, 7], dtype='int8')
print(a)         # [3 4 7]
print(a * 3)     # [ 9 12 21]
print(a + 4)     # [ 7  8 11]
print(a.dtype)   # int8

NumPy arrays: ones and zeros

  • ones
  • zeros
import numpy as np

c = np.ones(4, dtype='int32')
print(c)          # [1 1 1 1]
print(c.dtype)    # int32
print(c.shape)    # (4,)
print()


d = np.zeros(3, dtype='float32')
print(d)          # [ 0.  0.  0.]
print(d.dtype)    # float32
print(d.shape)    # (3,)
print()


a = np.ones([2, 3])
print(a)
# [[1., 1., 1.],
# [1., 1., 1.]]
print(a.dtype)    # float64
print(a.shape)    # (2, 3)

Numpy: eye

  • eye
import numpy as np

a = np.eye(4)
print(a)
print()

b = np.eye(3, 5)
print(b)
[[1. 0. 0. 0.]
 [0. 1. 0. 0.]
 [0. 0. 1. 0.]
 [0. 0. 0. 1.]]

[[1. 0. 0. 0. 0.]
 [0. 1. 0. 0. 0.]
 [0. 0. 1. 0. 0.]]

NumPy array random

  • random
  • default_rng
import numpy as np

a = np.random.random((2, 5))  # in the range [0.0, 1.0)
print(a)
print()

rng = np.random.default_rng()
b = rng.random(size=(3, 4))
print(b)
[[0.32151126 0.07688622 0.95666894 0.42396291 0.93592235]
 [0.71406863 0.95152079 0.20199695 0.72628099 0.33545885]]

[[0.46643834 0.71350899 0.40279583 0.85148985]
 [0.19367868 0.53288449 0.97181597 0.86311691]
 [0.70687485 0.78534671 0.16654183 0.9371896 ]]

NumPy Random integers

  • random
  • default_rng
  • randint
  • integer
import numpy as np

a = np.random.randint(10, size=(3, 4))
print(a)

rng = np.random.default_rng()
b = rng.integers(42, size=(3, 4))
print(b)
[[1 2 2 6]
 [2 2 9 8]
 [8 8 9 5]]
[[13 31  7 11]
 [22  2  6 18]
 [24 10 12  0]]

NumPy array type change by division (int to float)

import numpy as np

a = np.array([3, 4, 7])
print(a.dtype) # int64
print(a.shape)  # (3,)

x = (a / 2)
print(x)        # [ 1.5  2.   3.5]
print(x.dtype)  # float64
print(x.shape)  # (3,)

Numpy: Array methods: transpose

  • transpose
import numpy

a = numpy.array([
    [ 1,  2,  3,  4,  5],
    [ 2,  3,  4,  5,  6]
])

b = a.transpose()

print(b)
# [[1 2]
#  [2 3]
#  [3 4]
#  [4 5]
#  [5 6]]

print(a)
# [[1 2 3 4 5]
#  [2 3 4 5 6]]

Numpy: reference, not copy

import numpy

a = numpy.array([
    [ 1,  2,  3,  4,  5],
    [ 2,  3,  4,  5,  6]
])

b = a.transpose()
a[0][0] = 42

print(b)
# [[42 2]
#  [2 3]
#  [3 4]
#  [4 5]
#  [5 6]]

print(a)
# [[42 2 3 4 5]
#  [2 3 4 5 6]]

Numpy: copy array

  • copy
import numpy

a = numpy.array([
    [ 1,  2,  3,  4,  5],
    [ 2,  3,  4,  5,  6]
])

b = a.copy().transpose()
a[0][0] = 42

print(b)
# [[1 2]
#  [2 3]
#  [3 4]
#  [4 5]
#  [5 6]]

print(a)
# [[42 2 3 4 5]
#  [2 3 4 5 6]]

Numpy: Elementwise Operations on Arrays

import numpy as np

a = np.array([
    [ 1,  2,  3,  4,  5],
    [ 2,  3,  4,  5,  6]
])
b = np.array([
    [ 7,  3,  8,  9,  4],
    [ 1,  3,  6,  1,  2]
])

print(a+b)
# [[ 8  5 11 13  9]
#  [ 3  6 10  6  8]]

print(a*b)
# [[ 7  6 24 36 20]
#  [ 2  9 24  5 12]]

Numpy: multiply, matmul, dot for vectors

  • multiply

  • matmul

  • dot

  • multiply Multiply arguments element-wise.

  • matmul Matrix product of two arrays.

  • dot Dot product of two arrays.

import numpy as np

a = np.array([3, 4, 7])
b = np.array([6, 5, 2])
print(a) # [3 4 7]
print(b) # [6 5 2]

c = np.multiply(a, b)
print(c) # [18 20 14]

d = np.dot(a, b)
print(d)   # 52

m = np.matmul(a, b)
print(m)   # 52

Numpy: multiply, matmul, dot for vector and matrix

import numpy as np

a = np.array([[1, 2, 3], [4, 5, 6]])
b = np.array([1, 2, 4])
print(a)
print(b)
print()

print(a*b)
print(b*a)
print()

print(np.multiply(a, b))

print()
print( np.dot(a, b) )
print( np.matmul(a, b) )


[[1 2 3]
 [4 5 6]]
[1 2 4]

[[ 1  4 12]
 [ 4 10 24]]
[[ 1  4 12]
 [ 4 10 24]]

[[ 1  4 12]
 [ 4 10 24]]

[17 38]
[17 38]

Numpy: multiply, matmul, dot for matrices

import numpy as np

a = np.array([[1, 2, 3], [4, 5, 6]])
b = np.array([[1, 3, 4], [7, 8, 0]])
print(a)
print(b)
print()

print(a*b)
print(b*a)
print()

print(np.multiply(a, b))

print()
print( np.dot(a, b.transpose()) )
print( np.matmul(a, b.transpose()) )

print()
print( np.dot(a.transpose(), b) )
print( np.matmul(a.transpose(), b) )

[[1 2 3]
 [4 5 6]]
[[1 3 4]
 [7 8 0]]

[[ 1  6 12]
 [28 40  0]]
[[ 1  6 12]
 [28 40  0]]

[[ 1  6 12]
 [28 40  0]]

[[19 23]
 [43 68]]
[[19 23]
 [43 68]]

[[29 35  4]
 [37 46  8]
 [45 57 12]]
[[29 35  4]
 [37 46  8]
 [45 57 12]]

Numpy: casting - converting from strings to integer

  • astype
import numpy as np

a = np.array([
    [ "12",  "23",  "3",  "4"],
    [ "2",  "3",  "4",  "5"]
])

print(a)
#[['12' '23' '3' '4']
# ['2' '3' '4' '5']]

try:
    b = a + 1
except Exception as e:
    print(e)
# TypeError: ufunc 'add' did not contain a loop with
#    signature matching types dtype('<U3') dtype('<U3') dtype('<U3')


c = a.astype(np.int) + 1
print(c)
# [[13 24  4  5]
# [ 3  4  5  6]]

Numpy: indexing 1d array

import numpy as np

a = np.array([1, 1, 2, 3, 5, 8, 13, 21, 34])
print(a)       # [ 1 1 2 3 5 8 13 21 34]

print(a[4])    # 5
print(a[2:5])  # [2 3 5]

Numpy: slice is a reference

The slice in numpy does not copy the data structure

import numpy as np

a = np.array([1, 1, 2, 3, 5, 8, 13, 21, 34])
print(a)       # [ 1 1 2 3 5 8 13 21 34]

b = a[2:5]
print(b)  # [2 3 5]

a[2] = 20
print(a)       # [ 1 1 20 3 5 8 13 21 34]
print(b)       # [20 3 5]

Numpy: slice - copy

import numpy as np

a = np.array([1, 1, 2, 3, 5, 8, 13, 21, 34])
print(a)       # [ 1 1 2 3 5 8 13 21 34]

b = a[2:5].copy()
print(b)  # [2 3 5]

a[2] = 20
print(a)       # [ 1 1 20 3 5 8 13 21 34]
print(b)       # [2 3 5]

Numpy: abs value on a Numpy array

  • abs
  • absolute
import numpy as np

a = np.array([[-1, 2, -3], [-4, 5, -7]])
print(a)
print(a.dtype)
print()

abs_a = np.absolute(a)
print(abs_a)
print(abs_a.dtype)
print()
print(a)
[[-1  2 -3]
 [-4  5 -7]]
int64

[[1 2 3]
 [4 5 7]]
int64

[[-1  2 -3]
 [-4  5 -7]]

Numpy: Logical not on a Numpy array

  • logical_not
  • not
  • True
  • False
  • bool
import numpy as np

a = np.array([True, True, False])
print(a.dtype)
print(a)
print()

not_a = np.logical_not(a)
print(not_a.dtype)
print(not_a)
print()

b = np.array([True, True, False, 0, 42])
print(b.dtype)
print(b)
print()

not_b = np.logical_not(b)
print(not_b.dtype)
print(not_b)
print()
bool
[ True  True False]

bool
[False False  True]

int64
[ 1  1  0  0 42]

bool
[False False  True  True False]

Numpy: Vectorize a function

  • vectorize
import numpy as np

def fibo(n):
    if n == 1 or n == 2:
        return 1
    a, b = 1, 1
    for _ in range(n-2):
        a, b = b, a + b
    return b

print(fibo(12))
print(fibo.__class__.__name__)

vfibo = np.vectorize(fibo)
a = np.array([
    [1, 2, 3, 4, 5, 6],
    [7, 8, 9, 10, 11, 12],
    ])
print(vfibo.__class__.__name__)

print(a)
print(a.dtype)
print()

b = vfibo(a)
print(b)
print(b.dtype)
144
function
vectorize
[[ 1  2  3  4  5  6]
 [ 7  8  9 10 11 12]]
int64

[[  1   1   2   3   5   8]
 [ 13  21  34  55  89 144]]
int64

Numpy: Vectorize len

  • vectorize
  • len
import numpy as np

animals = np.array(['Cow', 'Elephant', 'Snake', 'Camel', 'Praying Mantis'])
print(animals)

vlen = np.vectorize(len)
print(vlen(animals))
['Cow' 'Elephant' 'Snake' 'Camel' 'Praying Mantis']
[ 3  8  5  5 14]

Numpy: Vectorize lambda

  • vectorize
  • lambda
import numpy as np

animals = np.array(['Cow', 'Elephant', 'Snake', 'Camel', 'Praying Mantis'])
print(animals)

longer_than_5 = np.vectorize(lambda x: len(x) > 5)
long_animals_bool = longer_than_5(animals)
print(long_animals_bool)
['Cow' 'Elephant' 'Snake' 'Camel' 'Praying Mantis']
[False  True False False  True]

Numpy: Filtering array (selecting some of the values from an array)

  • vectorize
import numpy as np

animals = np.array(['Cow', 'Elephant', 'Snake', 'Camel', 'Praying Mantis'])
print(animals)

longer_than_5 = np.vectorize(lambda x: len(x) > 5)
long_animals_bool = longer_than_5(animals)
print(long_animals_bool)

long_animals = animals[long_animals_bool]
print(long_animals)

['Cow' 'Elephant' 'Snake' 'Camel' 'Praying Mantis']
[False  True False False  True]
['Elephant' 'Praying Mantis']

Numpy: Filter matrix values

import numpy as np
import re

scores = np.array([
    [23, 37, 18, 97, 13, 40],
    [10, 15, 20, 30, 39, 50],
    [99, 20, 83, 42, 19, 31],
    [19, 11, 55, 78, 39, 27]
])
print(scores)
print()

high_scores_boolean = (scores > 20)
print(high_scores_boolean)
print()

high_scores = scores[high_scores_boolean]
print(high_scores)
[[23 37 18 97 13 40]
 [10 15 20 30 39 50]
 [99 20 83 42 19 31]
 [19 11 55 78 39 27]]

[[ True  True False  True False  True]
 [False False False  True  True  True]
 [ True False  True  True False  True]
 [False False  True  True  True  True]]

[23 37 97 40 30 39 50 99 83 42 31 55 78 39 27]

Numpy: Filter matrix rows (selecting some rows)

import numpy as np

names = np.array(['Mary', 'Bar', 'Joe', 'Jane'])
print(names)
print()

def has_ar(text):
    return "ar" in text
    # if "ar" in text:
        # return True
    # else:
        # return False

names_with_ar_selector = np.vectorize(has_ar)
names_with_ar_bool = names_with_ar_selector(names)
print(names_with_ar_bool)
print()

scores = np.array([
    [23, 37, 18, 97, 13, 40],
    [10, 15, 20, 30, 39, 50],
    [99, 20, 83, 42, 19, 31],
    [19, 11, 55, 78, 39, 27]
])

print(scores[names_with_ar_bool])
['Mary' 'Bar' 'Joe' 'Jane']

[ True  True False False]

[[23 37 18 97 13 40]
 [10 15 20 30 39 50]]

[[23 37 18 97 13 40]
 [10 15 20 30 39 50]]

Numpy: Some statistics (sum, mean, std, var)

import numpy as np

scores = np.array([23, 37, 18, 97, 13, 40])
print(scores.sum())          # 228
print(len(scores))           # 6
print(scores.mean())         # 38.0

print(scores.std())          # 28.0950766743 standard deviation
print(scores.var())          # 789.333333333 variance
print(np.median(scores))     # 30.0
print(scores.max())          # 97
print(scores.min())          # 13

print(scores.cumsum())       # [ 23  60  78 175 188 228]

Numpy: Serialization (saving an array to a file)

import numpy as np

scores = np.array([
    [23, 37, 18, 97, 13, 40],
    [10, 15, 20, 30, 39, 50],
    [99, 20, 83, 42, 19, 31],
    [19, 11, 55, 78, 39, 27]
])
filename = 'scores.npy'
np.save(filename, scores)

s = np.load(filename)
print(s)

Numpy: Load from Matlab file to a Numpy array

import scipy.io

file_path = 'data.mat'
mat = scipy.io.loadmat(file_path)
data = mat['data']
print(type(data))
print(data)

  • numpy.ndarray

Numpy: Save a Numpy array as a Matlab file

import scipy.io
import numpy as np

data = np.random.random((2, 5))
print(data)

file_path = 'data.mat'
scipy.io.savemat(file_path, {'data': data})

Numpy: Horizontal stack vectors (hstack)

  • hstack
import numpy as np

a = np.array([1, 2, 3])
b = np.array([4, 5, 6])
c = np.array([7, 8, 9])
print(a)
print(b)
print(c)
print()

d = np.hstack([a, b])
print(d)
print()

e = np.hstack([d, c])
print(e)
[1 2 3]
[4 5 6]
[7 8 9]

[1 2 3 4 5 6]

[1 2 3 4 5 6 7 8 9]

Numpy: Append or vertically stack vectors and matrices (vstack)

  • vstack
import numpy as np

a = np.array([1, 2, 3])
b = np.array([4, 5, 6])
c = np.array([7, 8, 9])
print(a)
print(b)
print(c)
print()

m = np.vstack([a, b])
print(m)
print()

d3 = np.vstack([m, c])
print(d3)
[1 2 3]
[4 5 6]
[7 8 9]

[[1 2 3]
 [4 5 6]]

[[1 2 3]
 [4 5 6]
 [7 8 9]]

Numpy uint8

  • uint8
import numpy as np

a = np.array([127], 'uint8')
print(a.dtype)   # uint8
print(a)         # [127]

a[0] += 1        # [128]
print(a)

a[0] -= 1        # [127]
print(a)

a[0] = 255
print(a)         # [255]

a[0] += 1
print(a)         # [0]

Numpy int8

  • int8
import numpy as np

a = np.array([127], 'int8')
print(a.dtype)   # int8
print(a)         # [127]

a[0] += 1        # [-128]
print(a)

a[0] -= 1        # [127]
print(a)

a[0] = 255
print(a)         # [-1]

a[0] += 1
print(a)         # [0]

More Numpy

import numpy as np

matrix = np.array([
    [2, 3, 17],
    [27, 1, 10],
])

print(matrix.shape)
print(matrix.sum(axis = 1))
print(matrix.sum(axis = 0))

Pandas

Pandas

Datasets

  • Planets
  • StackOverflow survey

Planets data

{% embed include file="src/examples/pandas/planets.csv)

StackOverflow Survey data

  • Download one of the datasets from the Stack Overflow survey
  • unzip the file. Feel free to remove the __MACOSX/ directory if it is still there.

Planets - Read CSV into Dataframes

  • read_csv
import sys
import pandas as pd

filename = "planets.csv"
if len(sys.argv) == 2:
    filename = sys.argv[1]

df = pd.read_csv(filename)
print(df)

print()
print(type(df))
print(df.__class__.__name__)
   Planet name  Distance (AU)        Mass
0      Mercury           0.40    0.055000
1        Venus           0.70    0.815000
2        Earth           1.00    1.000000
3         Mars           1.50    0.107000
4        Ceres           2.77    0.000150
5      Jupiter           5.20  318.000000
6       Saturn           9.50   95.000000
7       Uranus          19.60   14.000000
8      Neptune          30.00   17.000000
9        Pluto          39.00    0.002180
10      Charon          39.00    0.000254

<class 'pandas.core.frame.DataFrame'>
DataFrame

Planets - DataFrame Statistics

  • columns
  • dtypes
  • index
  • values
  • describe
import sys
import pandas as pd

filename = "planets.csv"
if len(sys.argv) == 2:
    filename = sys.argv[1]

df = pd.read_csv(filename)

print(df.columns)
print()

print(df.dtypes)
print()

print(df.index)
print()

print(df.values)
print()

print(df.describe())

Index(['Planet name', 'Distance (AU)', 'Mass'], dtype='object')

Planet name       object
Distance (AU)    float64
Mass             float64
dtype: object

RangeIndex(start=0, stop=11, step=1)

[['Mercury' 0.4 0.055]
 ['Venus' 0.7 0.815]
 ['Earth' 1.0 1.0]
 ['Mars' 1.5 0.107]
 ['Ceres' 2.77 0.00015]
 ['Jupiter' 5.2 318.0]
 ['Saturn' 9.5 95.0]
 ['Uranus' 19.6 14.0]
 ['Neptune' 30.0 17.0]
 ['Pluto' 39.0 0.00218]
 ['Charon' 39.0 0.000254]]

       Distance (AU)        Mass
count      11.000000   11.000000
mean       13.515455   40.543599
std        15.638671   96.205021
min         0.400000    0.000150
25%         1.250000    0.028590
50%         5.200000    0.815000
75%        24.800000   15.500000
max        39.000000  318.000000

Pandas Read CSV various datatypes

Example with a few more column-types (e.g. the column with the title MyBool contains True and False values and it is recognized as bool)

{% embed include file="src/examples/pandas/mixed.csv)

import pandas as pd

df = pd.read_csv('mixed.csv')
print( df.dtypes )
print( df.describe() )

MyText        object
MyInteger      int64
MyFloat      float64
MyBool          bool
MyExit         int64
dtype: object
       MyInteger   MyFloat    MyExit
count   3.000000  3.000000  3.000000
mean    7.333333  3.233333  1.333333
std     4.509250  0.862168  1.527525
min     3.000000  2.300000  0.000000
25%     5.000000  2.850000  0.500000
50%     7.000000  3.400000  1.000000
75%     9.500000  3.700000  2.000000
max    12.000000  4.000000  3.000000

Pandas Read CSV set dtype

When recognizing integers read_csv will default to int64, but if we would like to save memory we can set the dtype while we read the file.

import pandas as pd
import numpy as np

df = pd.read_csv('mixed.csv', dtype = { 'MyInteger' : np.int8 })
print( df.dtypes )

MyText        object
MyInteger       int8
MyFloat      float64
MyBool          bool
MyExit         int64
dtype: object

Pandas Read CSV convert values

Sometime the data in the CSV file represents something else or we might want to change the meaning of the data.

For example in some cases 0 represents False and 1 represents True. If the CSV file contains 0 and 1 values in a column Pandas will automatically represent them as integers. We can convert them to False and True values respectively.

In another case we might have exit-codes in a column where 0 means success and any other number means failure. We might want to simplify that column and represent success by True and failure by False. (Yes, we loose the details of the failure, but maybe we are not interested in the details.)

This latter is what we can see in our example.

import pandas as pd
import numpy as np

df = pd.read_csv('mixed.csv', converters = { 'MyExit' : lambda x : x == '0' })
print( df.dtypes )
print( df )
MyText        object
MyInteger      int64
MyFloat      float64
MyBool          bool
MyExit          bool
dtype: object
  MyText  MyInteger  MyFloat  MyBool  MyExit
0    Joe         12      3.4    True    True
1   Jane          3      4.0   False   False
2   Mary          7      2.3   False   False

Pandas split multivalue column into separate columns

{% embed include file="src/examples/pandas/multivalue.csv)

import pandas as pd

df = pd.read_csv('multivalue.csv')
print(df)
print()

values = ["Apple", "Banana", "Peach", "Melon"]
for value in values:
    df[value] = df.apply(lambda row: pd.notnull(row['Fruits']) and value in row['Fruits'].split(','), axis=1)

print( df )
  MyText              Fruits
0    Joe               Apple
1   Jane        Apple,Banana
2   Mary  Banana,Peach,Melon
3    Bob                 NaN
4   Zane               Melon

  MyText              Fruits  Apple  Banana  Peach  Melon
0    Joe               Apple   True   False  False  False
1   Jane        Apple,Banana   True    True  False  False
2   Mary  Banana,Peach,Melon  False    True   True   True
3    Bob                 NaN  False   False  False  False
4   Zane               Melon  False   False  False   True

Pandas split multivalue column into separate columns - get_dummies

  • get_dummies
import pandas as pd

dfin = pd.read_csv('multivalue.csv')
print(dfin)
print()
fruits_dummies = dfin['Fruits'].str.get_dummies(',').astype(bool)
df = pd.concat([dfin, fruits_dummies], axis=1)
print(df)

  MyText              Fruits
0    Joe               Apple
1   Jane        Apple,Banana
2   Mary  Banana,Peach,Melon
3    Bob                 NaN
4   Zane               Melon

  MyText              Fruits  Apple  Banana  Melon  Peach
0    Joe               Apple   True   False  False  False
1   Jane        Apple,Banana   True    True  False  False
2   Mary  Banana,Peach,Melon  False    True   True   True
3    Bob                 NaN  False   False  False  False
4   Zane               Melon  False   False   True  False

Planets - Show first and last rows

  • head

  • tail

  • head will show the first few rows (defaults to 5)

  • tail will show the last few rows (defaults to 5)

import sys
import pandas as pd

filename = "planets.csv"
if len(sys.argv) == 2:
    filename = sys.argv[1]

df = pd.read_csv(filename)
print(df.head())

print()
print(df.tail(2))
  Planet name  Distance (AU)     Mass
0     Mercury           0.40  0.05500
1       Venus           0.70  0.81500
2       Earth           1.00  1.00000
3        Mars           1.50  0.10700
4       Ceres           2.77  0.00015

   Planet name  Distance (AU)      Mass
9        Pluto           39.0  0.002180
10      Charon           39.0  0.000254

Planets DataFrame select columns

import sys
import pandas as pd

filename = "planets.csv"
if len(sys.argv) == 2:
    filename = sys.argv[1]

df = pd.read_csv(filename)

mass = df['Mass'] # Get Series
print(type(mass))
print(mass)
print()

one_col = df[['Mass']] # Get DataFrame
print(type(one_col))
print(one_col)
print()

two_cols = df[['Planet name', 'Mass']] # Select multiple columns, get DataFrame
print(type(two_cols))
print(two_cols)
print()

<class 'pandas.core.series.Series'>
0       0.055000
1       0.815000
2       1.000000
3       0.107000
4       0.000150
5     318.000000
6      95.000000
7      14.000000
8      17.000000
9       0.002180
10      0.000254
Name: Mass, dtype: float64

<class 'pandas.core.frame.DataFrame'>
          Mass
0     0.055000
1     0.815000
2     1.000000
3     0.107000
4     0.000150
5   318.000000
6    95.000000
7    14.000000
8    17.000000
9     0.002180
10    0.000254

<class 'pandas.core.frame.DataFrame'>
   Planet name        Mass
0      Mercury    0.055000
1        Venus    0.815000
2        Earth    1.000000
3         Mars    0.107000
4        Ceres    0.000150
5      Jupiter  318.000000
6       Saturn   95.000000
7       Uranus   14.000000
8      Neptune   17.000000
9        Pluto    0.002180
10      Charon    0.000254

Planets DataFrame select rows

import sys
import pandas as pd

filename = "planets.csv"
if len(sys.argv) == 2:
    filename = sys.argv[1]

df = pd.read_csv(filename)

rows = df[2:5] # using row numbers
print(type(rows))
print(rows)
print()

iloc_rows = df.iloc[2:5] # just like plain []
print(type(iloc_rows))
print(iloc_rows)
print()

picked_rows = df.iloc[[2,5,3]] # using specific row numbers
print(type(picked_rows))
print(picked_rows)
print()
# df[[2,5,3]] would not work so we need iloc

<class 'pandas.core.frame.DataFrame'>
  Planet name  Distance (AU)     Mass
2       Earth           1.00  1.00000
3        Mars           1.50  0.10700
4       Ceres           2.77  0.00015

<class 'pandas.core.frame.DataFrame'>
  Planet name  Distance (AU)     Mass
2       Earth           1.00  1.00000
3        Mars           1.50  0.10700
4       Ceres           2.77  0.00015

<class 'pandas.core.frame.DataFrame'>
  Planet name  Distance (AU)     Mass
2       Earth            1.0    1.000
5     Jupiter            5.2  318.000
3        Mars            1.5    0.107

Planets DataFrame select rows and columns

import sys
import pandas as pd

filename = "planets.csv"
if len(sys.argv) == 2:
    filename = sys.argv[1]

df = pd.read_csv(filename)

small = df[2:5][['Distance (AU)', 'Planet name']]
print(type(small))
print(small)
print()

other = df.iloc[3:6][['Distance (AU)', 'Planet name']]
print(type(other))
print(other)
print()

<class 'pandas.core.frame.DataFrame'>
   Distance (AU) Planet name
2           1.00       Earth
3           1.50        Mars
4           2.77       Ceres

<class 'pandas.core.frame.DataFrame'>
   Distance (AU) Planet name
3           1.50        Mars
4           2.77       Ceres
5           5.20     Jupiter

DataFrame filter rows by size

import sys
import pandas as pd

filename = "planets.csv"
if len(sys.argv) == 2:
    filename = sys.argv[1]

df = pd.read_csv(filename)


big_ones_selector = df['Mass'] > 1
print(big_ones_selector)
print()

big_ones = df[big_ones_selector]
print(big_ones)
0     False
1     False
2     False
3     False
4     False
5      True
6      True
7      True
8      True
9     False
10    False
Name: Mass, dtype: bool

  Planet name  Distance (AU)   Mass
5     Jupiter            5.2  318.0
6      Saturn            9.5   95.0
7      Uranus           19.6   14.0
8     Neptune           30.0   17.0

DataFrame filter rows by value

import sys
import pandas as pd

filename = "planets.csv"
if len(sys.argv) == 2:
    filename = sys.argv[1]

df = pd.read_csv(filename)


planet_selector = (df['Planet name'] == 'Earth')
print(planet_selector)
print()

planets = df[ planet_selector ]
print(planets)

Output:

0     False
1     False
2      True
3     False
4     False
5     False
6     False
7     False
8     False
9     False
10    False
Name: Planet name, dtype: bool

  Planet name  Distance (AU)  Mass
2       Earth            1.0   1.0

DataFrame filter rows by value in list

import sys
import pandas as pd

filename = "planets.csv"
if len(sys.argv) == 2:
    filename = sys.argv[1]

df = pd.read_csv(filename)


planet_selector = df['Planet name'].isin(['Earth', 'Mars'])
print(planet_selector)
print()

planets = df[ planet_selector ]
print(planets)

Output:

0     False
1     False
2      True
3      True
4     False
5     False
6     False
7     False
8     False
9     False
10    False
Name: Planet name, dtype: bool

  Planet name  Distance (AU)   Mass
2       Earth            1.0  1.000
3        Mars            1.5  0.107

DataFrame filter elementwise boolean and

import sys
import pandas as pd

filename = "planets.csv"
if len(sys.argv) == 2:
    filename = sys.argv[1]

df = pd.read_csv(filename)

selector = (df['Mass'] > 1) & (df['Mass'] < 100)
print(selector)
print()

planets = df[ selector ]
print(planets)

Output:

0     False
1     False
2     False
3     False
4     False
5     False
6      True
7      True
8      True
9     False
10    False
Name: Mass, dtype: bool

  Planet name  Distance (AU)  Mass
6      Saturn            9.5  95.0
7      Uranus           19.6  14.0
8     Neptune           30.0  17.0

DataFrame sort (sort_values)

  • sort_values
import sys
import pandas as pd

filename = "planets.csv"
if len(sys.argv) == 2:
    filename = sys.argv[1]

df = pd.read_csv(filename)

sorted_df = df.sort_values('Planet name', ascending=True)
print(sorted_df)
# df remains unchanged

Output:

   Planet name  Distance (AU)        Mass
4        Ceres           2.77    0.000150
10      Charon          39.00    0.000254
2        Earth           1.00    1.000000
5      Jupiter           5.20  318.000000
3         Mars           1.50    0.107000
0      Mercury           0.40    0.055000
8      Neptune          30.00   17.000000
9        Pluto          39.00    0.002180
6       Saturn           9.50   95.000000
7       Uranus          19.60   14.000000
1        Venus           0.70    0.815000

DataFrame loc vs. iloc

  • loc by values (here we start from the row where the index column == 3
  • iloc by index (here we start from the 3rd row)
import sys
import pandas as pd

filename = "planets.csv"
if len(sys.argv) == 2:
    filename = sys.argv[1]

df = pd.read_csv(filename)
print(df)
print()

print(df.loc[3:6])   # by values (here we start from the row where the index column == 3
print()

print(df.iloc[3:6])  # by index (here we start from the 3rd row)
print()

sorted_df = df.sort_values('Planet name', ascending=True)
print(sorted_df)
print()

print(sorted_df.loc[3:6])
print()

print(sorted_df.iloc[3:6])
print()

print('-------')

print(sorted_df.loc[2:4])
print()

print(sorted_df.iloc[2:4])
print()


Output:

   Planet name  Distance (AU)        Mass
0      Mercury           0.40    0.055000
1        Venus           0.70    0.815000
2        Earth           1.00    1.000000
3         Mars           1.50    0.107000
4        Ceres           2.77    0.000150
5      Jupiter           5.20  318.000000
6       Saturn           9.50   95.000000
7       Uranus          19.60   14.000000
8      Neptune          30.00   17.000000
9        Pluto          39.00    0.002180
10      Charon          39.00    0.000254

  Planet name  Distance (AU)       Mass
3        Mars           1.50    0.10700
4       Ceres           2.77    0.00015
5     Jupiter           5.20  318.00000
6      Saturn           9.50   95.00000

  Planet name  Distance (AU)       Mass
3        Mars           1.50    0.10700
4       Ceres           2.77    0.00015
5     Jupiter           5.20  318.00000

   Planet name  Distance (AU)        Mass
4        Ceres           2.77    0.000150
10      Charon          39.00    0.000254
2        Earth           1.00    1.000000
5      Jupiter           5.20  318.000000
3         Mars           1.50    0.107000
0      Mercury           0.40    0.055000
8      Neptune          30.00   17.000000
9        Pluto          39.00    0.002180
6       Saturn           9.50   95.000000
7       Uranus          19.60   14.000000
1        Venus           0.70    0.815000

  Planet name  Distance (AU)      Mass
3        Mars            1.5   0.10700
0     Mercury            0.4   0.05500
8     Neptune           30.0  17.00000
9       Pluto           39.0   0.00218
6      Saturn            9.5  95.00000

  Planet name  Distance (AU)     Mass
5     Jupiter            5.2  318.000
3        Mars            1.5    0.107
0     Mercury            0.4    0.055

-------
Empty DataFrame
Columns: [Planet name, Distance (AU), Mass]
Index: []

  Planet name  Distance (AU)   Mass
2       Earth            1.0    1.0
5     Jupiter            5.2  318.0

Pandas Planets - Add calculated column, remove / delete column (drop)

import pandas as pd

df = pd.read_csv('planets.csv')

print(df.head())
print()

df['dm'] = df['Distance (AU)'] * df['Mass']
print(df.head())
print()

df.drop(columns = 'Mass', inplace=True)
print(df.head())

Output:

  Planet name  Distance (AU)     Mass
0     Mercury           0.40  0.05500
1       Venus           0.70  0.81500
2       Earth           1.00  1.00000
3        Mars           1.50  0.10700
4       Ceres           2.77  0.00015

  Planet name  Distance (AU)     Mass        dm
0     Mercury           0.40  0.05500  0.022000
1       Venus           0.70  0.81500  0.570500
2       Earth           1.00  1.00000  1.000000
3        Mars           1.50  0.10700  0.160500
4       Ceres           2.77  0.00015  0.000415

  Planet name  Distance (AU)        dm
0     Mercury           0.40  0.022000
1       Venus           0.70  0.570500
2       Earth           1.00  1.000000
3        Mars           1.50  0.160500
4       Ceres           2.77  0.000415

Pandas Planets - calculate

import sys
import pandas as pd

filename = "planets.csv"
center = 'Earth'
if len(sys.argv) > 1:
    center   = sys.argv[1]
if len(sys.argv) > 2:
    filename = sys.argv[2]

df = pd.read_csv(filename)

this = df[ df['Planet name'] == center ]
print(this)
print('----')

mass = this.iloc[0]['Mass']
print(mass)
print('----')

dist = this.iloc[0]['Distance (AU)']
print(dist)
print('----')


# gravitational force is F = G * (mass1*mass2) / distance**2
G = 6
distance = abs(dist - df['Distance (AU)'])
print(distance)
print('----')

forces = df.copy()
print(forces)
print('----')


F = G * (df['Mass'] * mass) / distance**2
print(F)
print('----')

forces['F'] = F
forces.drop(columns = 'Mass', inplace=True)
forces.drop(columns = 'Distance (AU)', inplace=True)
print(forces)
print('----')

Output:

  Planet name  Distance (AU)  Mass
2       Earth            1.0   1.0
----
1.0
----
1.0
----
0      0.60
1      0.30
2      0.00
3      0.50
4      1.77
5      4.20
6      8.50
7     18.60
8     29.00
9     38.00
10    38.00
Name: Distance (AU), dtype: float64
----
   Planet name  Distance (AU)        Mass
0      Mercury           0.40    0.055000
1        Venus           0.70    0.815000
2        Earth           1.00    1.000000
3         Mars           1.50    0.107000
4        Ceres           2.77    0.000150
5      Jupiter           5.20  318.000000
6       Saturn           9.50   95.000000
7       Uranus          19.60   14.000000
8      Neptune          30.00   17.000000
9        Pluto          39.00    0.002180
10      Charon          39.00    0.000254
----
0       0.916667
1      54.333333
2            inf
3       2.568000
4       0.000287
5     108.163265
6       7.889273
7       0.242803
8       0.121284
9       0.000009
10      0.000001
dtype: float64
----
   Planet name           F
0      Mercury    0.916667
1        Venus   54.333333
2        Earth         inf
3         Mars    2.568000
4        Ceres    0.000287
5      Jupiter  108.163265
6       Saturn    7.889273
7       Uranus    0.242803
8      Neptune    0.121284
9        Pluto    0.000009
10      Charon    0.000001
----

Pandas read CSV set index column

import pandas as pd

df = pd.read_csv('planets.csv', index_col='Planet name')
print(df.head(2))

Output:

             Distance (AU)   Mass
Planet name                      
Mercury                0.4  0.055
Venus                  0.7  0.815

Count values

import sys
import pandas as pd

filename = "survey_results_public.csv"
if len(sys.argv) == 2:
    filename = sys.argv[1]

df = pd.read_csv(filename)

country_count = df['Country'].value_counts()
print(country_count)
print(type(country_count)) # pandas.core.series.Series

# We can use it either as a dictionary or as a list
print(country_count['United States']) # 12469
print(country_count[0]) # 12469

Select top items

  • StackOverflow - biggest countries (in terms of number of responses)
import sys
import pandas as pd

filename = "survey_results_public.csv"
if len(sys.argv) == 2:
    filename = sys.argv[1]

df = pd.read_csv(filename)

country_count = df['Country'].value_counts()

N = 20
# Take the top N countries
first = country_count.head(N)
print(first)
print(type(first)) # Series

# first = country_count.iloc[0:N] # part of the Series
# print(first)
# type(first) # Series

# first = country_count[0:N]
# print(first)
# type(first) # Series

# Select rows of the "biggest" countries
print(first.keys())

Pandas Show histogram

import sys
import pandas as pd
import matplotlib.pyplot as plt

filename = "survey_results_public.csv"
if len(sys.argv) == 2:
    filename = sys.argv[1]

df = pd.read_csv(filename)

country_count = df['Country'].value_counts()

N = 20
# Take the top N countries
first = country_count.head(N)

# Historgram of the top 20 countries
hist = first.hist(bins = N)

plt.show(block=hist)
#plt.savefig('so_histogram.png')

{% embed include file="src/examples/pandas/so_histogram.png)

Pandas read selected columns

import sys
import pandas as pd

filename = "survey_results_public.csv"
if len(sys.argv) == 2:
    filename = sys.argv[1]


df = pd.read_csv(filename, usecols=['Country', 'OpenSourcer', 'CompTotal'])
print(df.head())

Pandas read file in chunks

import sys
import pandas as pd

filename = "survey_results_public.csv"
if len(sys.argv) == 2:
    filename = sys.argv[1]

size = 10000

for df_chunk in pd.read_csv(filename, chunksize=size):
    print(df_chunk.head())

Pandas read selected rows in chunks

import sys
import pandas as pd

filename = "survey_results_public.csv"
if len(sys.argv) == 2:
    filename = sys.argv[1]

# Load only data from a specific country.

country_name = 'Israel'
chunks = []

for chunk in pd.read_csv(filename, chunksize=10000):
    print(chunk.size)
    part = chunk[ chunk['Country'] == country_name ]
    print(part.size)
    print('--')
    chunks.append(part)

df = pd.concat(chunks)

print(df.count())
print(df.size)
import sys
import pandas as pd

filename = "survey_results_public.csv"
if len(sys.argv) == 2:
    filename = sys.argv[1]

# Load only data from a specific country.

country_name = 'Israel'
df = None
for chunk in pd.read_csv(filename, chunksize=10000):
    part = chunk[ chunk['Country'] == country_name ]
    if df is None:
        df = part.copy(deep = True)
    else:
        df = df.append(part.copy(deep = True), ignore_index = True)


print(df.count())
print(df.size)

Combine columns to create a new column

  • apply

{% embed include file="src/examples/pandas/data.csv)

import pandas as pd

filename = 'data.csv'
df = pd.read_csv(filename)

print(df)
print()


def combine_two_columns(row):
    return row['lname'] + '_' + row['fname']

df['combined'] = df.apply(combine_two_columns, axis=1)
print(df)
print()


def combine_more_columns(row):
    columns = ['lname', 'age', 'fname']
    return '_'.join(map(lambda name: str(row[name]), columns))

df['combined'] = df.apply(combine_more_columns, axis=1)
print(df)


Output:

  fname      lname  age
0   Foo        Bar  100
1  Alma     Matter   78
2  Buzz  Lightyear   23

  fname      lname  age        combined
0   Foo        Bar  100         Bar_Foo
1  Alma     Matter   78     Matter_Alma
2  Buzz  Lightyear   23  Lightyear_Buzz

  fname      lname  age           combined
0   Foo        Bar  100        Bar_100_Foo
1  Alma     Matter   78     Matter_78_Alma
2  Buzz  Lightyear   23  Lightyear_23_Buzz

Add a new column (conditional)

  • apply

{% embed include file="src/examples/pandas/temprature.csv)

import sys
import pandas as pd

filename = 'temprature.csv'
df = pd.read_csv(filename)
print(df)
print()

def covid_instructions(row):
    return "stay home" if row['temp'] >= 38 else "go to work"

df['covid_instructions'] = df.apply(covid_instructions, axis=1)

print(df)

Output:

   name  temp
0  Ella  36.6
1  Maor  40.0
2  Oren  38.2
3   Gal  37.0

   name  temp covid_instructions
0  Ella  36.6         go to work
1  Maor  40.0          stay home
2  Oren  38.2          stay home
3   Gal  37.0         go to work

Pandas read Excel

  • read_excel
import sys
import pandas as pd

if len(sys.argv) < 2:
    exit("Need filename")
filename = sys.argv[1]

df = pd.read_excel(filename)
print(df)

# to_excel()

Create Excel file for experiment with random data

Input is an Excel file with the following columns:

genome name, c1, c2, c3, c4, c5, c6
  • c1-c3 are numbers of cond1
  • c4-c6 are numbers of cond2

We would like to filter to the lines that fulfill the following equations:

log2(avg(1-3) / avg(4-6)) > limit
other_limit > p.value( )
import numpy as np
import pandas as pd
import datetime
import sys

if len(sys.argv) < 2:
    exit("Need number of rows")

rows_num = int(sys.argv[1])
cols_num = 6

start_time = datetime.datetime.now()
matrix = np.random.rand(rows_num, cols_num)
#print(matrix)

genome_names = list(map(lambda i: f'g{i}', range(rows_num)))
column_names = list(map(lambda i: f'm{i}', range(cols_num)))

df = pd.DataFrame(matrix, index=genome_names, columns=column_names)
df.index.name = 'genome name'

print(df.head())


end_generate_time = datetime.datetime.now()
print(end_generate_time - start_time)

df.to_excel('raw_data.xlsx')

end_save_time = datetime.datetime.now()
print(end_save_time - end_generate_time)

Output:

                   m0        m1        m2        m3        m4        m5
genome name                                                            
g0           0.775167  0.120102  0.813921  0.284670  0.074309  0.978062
g1           0.449572  0.556647  0.851609  0.711773  0.052198  0.543029
g2           0.592324  0.350038  0.273521  0.248995  0.773113  0.998779
0:00:00.007476
0:00:00.140074

Calculate Genome metrics

import pandas as pd
import numpy as np
import datetime
import sys

filename = 'raw_data.xlsx'
if len(sys.argv) == 2:
    filename = sys.argv[1]

def calculate_averages(row):
    v1 = row.iloc[0:3].mean()
    v2 = row.iloc[3:6].mean()
    return np.log2(v1/v2)

start_time = datetime.datetime.now()
df = pd.read_excel(filename, index_col='genome name')
load_time = datetime.datetime.now()
print(load_time - start_time)

print(df.head())

calculated_value = df.apply(calculate_averages, axis=1)

threshold = 0.2
filtered_df = df[calculated_value > threshold]

print(filtered_df.head())

calculate_time = datetime.datetime.now()
print(calculate_time - load_time)


Exercise: Pandas

  • Take the Stack Overflow survey and the report created by SO.

  • Distribution of responses among countries.

  • Relation of Open Source contribution to experience.

  • Open Source contribution by country.

  • Look at the pdf file and create similar reports for a specific country

  • Pick a dataset from Kaggle and try to analyze that.

TODO: Stack Overflow example

import pandas as pd
import matplotlib.pyplot as plt
import numpy as np

file = './developer_survey_2020/survey_results_public.csv'

df = pd.read_csv(file, sep = ',')

totals = []
countries = []
for c in np.unique(list(df['Country'])):

    totals.append(len(df[df['Country'] == c]))
    countries.append(c)

plt.figure()
plt.bar(np.arange(0, len(countries)), totals)
plt.xticks(np.arange(0, len(countries)), countries)



#%%

age_totals = []
ages = []
for c in np.unique(list(df['Age1stCode'])):

    age_totals.append(len(df[df['Age1stCode'] == c]))
    ages.append(c)

plt.figure()
plt.scatter(ages, age_totals)
plt.xticks(fontsize = 5)




import pandas

df=pandas.read_csv("survey_results_public.csv")

countrey_dist=df['Country'].value_counts()

open_sourcers_dist=df['OpenSourcer'].value_counts()

print("top 10 response countries:\n",countrey_dist.head(10))

open_sourcers_dist_top=df['OpenSourcer'][df['Country']=='United States'].value_counts()
experience_dist_top=df['YearsCode'][df['Country']=='United States'].value_counts()

print('distribution of open sourcer is the top country USA is :\n' ,open_sourcers_dist_top)
print('distribution of  experience in the top country USA is:\n' ,experience_dist_top)


df=df[['OpenSourcer','YearsCode']][df['Country']=='United States']
df_agg=df.groupby('OpenSourcer').agg('YearsCode')

print("relationship betwen OpenSourcer to coding exprience in US is :\n",df_agg)

import sys
import pandas as pd

filename = "survey_results_public.csv"
if len(sys.argv) == 2:
    filename = sys.argv[1]
country_name = 'Israel'
chunks = []
dev_chunks=[]
for chunk in pd.read_csv(filename, usecols=['Country','DevType'],chunksize=10000):
    part = chunk[chunk['Country'] == country_name]


    print(chunk.size)
    print(part.size)
    print('--')
    chunks.append(part)


df = pd.concat(chunks)
print(df.dtypes)
for value in ['Academic researcher','Data or business analyst', 'Data scientist or machine learning specialist','Database administrator','Designer', 'Developer, back-end',
              'Developer, desktop or enterprise applications','Developer, embedded applications or devices','Developer, front-end','Developer, full-stack','Developer, game or graphics', 'Developer, mobile','Developer, QA or test',
              'DevOps specialist','Educator','Engineer, data', 'Engineer, site reliability','Engineering manager', 'Marketing or sales professional', 'Product manager', 'Scientist',
              'Senior Executive (C-Suite, VP, etc.)', 'System administrator']:
    print(value)
    df[value]= df.apply(lambda row: value in row['DevType'], axis=1)

print(df.count())
print(df.size)


import sys
import pandas as pd

filename = "survey_results_public.csv"
if len(sys.argv) == 2:
    filename = sys.argv[1]
country_name = 'Israel'
chunks = []
dev_chunks=[]
for chunk in pd.read_csv(filename, usecols=['Country','DevType'],chunksize=10000):
    part = chunk[chunk['Country'] == country_name]

#df = pd.read_csv(filename, usecols=['Country','DevType'])
#,chunksize=10000):
#for chunk in pd.read_csv(filename, usecols=['Country','DevType'],chunksize=10000):
#    part = chunk[chunk['Country'] == country_name]
#
#
#    print(chunk.size)
#    print(part.size)
#    print('--')
    chunks.append(part)
#
#
df = pd.concat(chunks)
print(df.dtypes)
for value in ['Academic researcher','Data or business analyst', 'Data scientist or machine learning specialist','Database administrator','Designer', 'Developer, back-end',
              'Developer, desktop or enterprise applications','Developer, embedded applications or devices','Developer, front-end','Developer, full-stack','Developer, game or graphics', 'Developer, mobile','Developer, QA or test',
              'DevOps specialist','Educator','Engineer, data', 'Engineer, site reliability','Engineering manager', 'Marketing or sales professional', 'Product manager', 'Scientist',
              'Senior Executive (C-Suite, VP, etc.)', 'System administrator']:
#for value in ['Academic researcher','Data or business analyst', 'Designer']:
    print(value)
    #df[value]= df.apply(lambda row: 1, axis=1)
    #df[value]= df.apply(lambda row: value in str(row['DevType']), axis=1)
    df[value]= df.apply(lambda row: pd.notnull(row['DevType']) and value in row['DevType'], axis=1)

print(df.count())
print(df.size)
print(df)


import pandas as pd
import matplotlib.pyplot as plt

filepath=r'survey_results_public.csv'

df=pd.read_csv(filepath)
print("The dataframe columns are:\n",list(df.columns))
print('-'*30)
#Let's check what kind of dtypes is in each column,
#if stats can be extracted - print it
for i in range(len(df.dtypes.index)):
    print(df.dtypes.index[i] , 'is of type ', df.dtypes[i])
    if df.dtypes[i]=='float64':
        print('*'*10,"\nAnd it's statistics:")
        print(df[df.dtypes.index[i]].describe())

#who is the most responsive country?
most_responsive_country=df['Country'].value_counts().index[0]
#now let's check what is the average working time per week for the most responsive country
most_responsive_country_df=df[df['Country']==most_responsive_country]
average_working_time_weekly=most_responsive_country_df['WorkWeekHrs'].mean()

#back to the original df, see what is the study fields distribution
#and then plotting it as a bar chart
study_fields_normalized=df['UndergradMajor'].value_counts(normalize=True)
fig,ax=plt.subplots()
ax.barh(list(study_fields_normalized.index),list(study_fields_normalized*100))
ax.set_xlabel("Relative Distribution")
fig.show()

Other slides

Pandas Stocks

import pandas
import pandas_datareader.data as web
all_data = { ticker: web.get_data_yahoo(ticker) for ticker in ['AAPL', 'IBM', 'MSFT', 'GOOG']}

print(all_data.keys())          # dict_keys(['MSFT', 'IBM', 'AAPL', 'GOOG'])
print(all_data['MSFT'].keys())  # Index(['Open', 'High', 'Low', 'Close', 'Volume', 'Adj Close'], dtype='object')

price = pandas.DataFrame({ticker: data['Adj Close'] for ticker, data in all_data.items()})

print(price.head())

volume = pandas.DataFrame({ticker: data['Volume'] for ticker, data in all_data.items()})

print(volume.tail())

returns = price.pct_change()  # change in percentage
print(returns.head())

# correlation
print(returns.MSFT.corr(returns.IBM))   # 0.49532932971
print(returns.MSFT.corr(returns.AAPL))  # 0.389551383559

# covariance
print(returns.MSFT.cov(returns.IBM))    # 8.50115754064e-05
print(returns.MSFT.cov(returns.AAPL))   # 9.15254855961e-05

Pandas Stocks

import pandas
prices = pandas.read_csv('stock_prices.csv')
print(prices)

Merge Dataframes

import pandas as pd
import numpy as np

# s = pd.Series([1,3,5,np.nan,6,8])
# dates = pd.date_range('20130101', periods=6)
# x = pd.date_range('20130101', periods=6, freq='3D')
# df = pd.DataFrame(np.random.randn(6,4), index=dates, columns=list('ABCD'))
# df = pd.DataFrame(np.random.randn(6,4), index=dates, columns=list('ABCD'))
# df = pd.DataFrame(np.random.randn(6,4), index=dates, columns=list('ABC'))
# df2 = pd.DataFrame({ 'A' : 1.,
#                      'B' : pd.Timestamp('20130102'),
#                      'C' : pd.Series(1,index=list(range(4)),dtype='float32'),
#                      'D' : np.array([3] * 4,dtype='int32'),
#                      'E' : pd.Categorical(["test","train","test","train"]),
#                      'F' : 'foo' })
a = pd.DataFrame({ 'A' : ['Joe', 'Jane', 'Foo', 'Bar'], 'B' : [1, 23, 12, 5] })
b = pd.DataFrame({ 'A' : ['Joe', 'Jane', 'Foo', 'Bar'], 'B' : [7, 10, 27, 1 ] })
#c = pd.DataFrame({ 'A' : ['Jane', 'Joe', 'Foo', 'Bar'], 'B' : [10, 7, 27, 1 ] })
c = b.sort_values(by = 'A')
print(a)
print(b)
print(c)
print('---')
#print(a+b)
x = pd.merge(a, b, on='A')
z = pd.DataFrame({ 'A' : x.A, 'B' : x.B_x + x.B_y })
print(z)



#sa = a.sort_values(by = 'A')
#sc = c.sort_values(by = 'A')
print('-----')
#print(sa)
#print(sc)
y = pd.merge(a, c, on='A')
#print(x)
q = pd.DataFrame({ 'A' : y.A, 'B' : y.B_x + y.B_y })
print(z)

Analyze Alerts

import pandas
alerts = pandas.read_csv('../../data/alerts.csv')
print(alerts.head())
#print(alerts.count())

Analyze IFMetrics

import pandas
data = pandas.read_csv('../../data/ifmetrics.csv', na_values=['(null)'])
data.fillna(0, inplace=True)
# , parse_dates=True )
# print(type(data))  # pandas.core.frame.DataFrame
print(data.columns)  # Index([ ... ],  dtype='object', length=135)

#print(data['Utilization In - Threshold Exception Rate'].head(3))

for col in ['Utilization In - Threshold Exception Rate', 'Overall Exception Rate']:
    dt = data[col]
    print(dt[dt != 0])


#print(data.head(1))
#print(data.get_values())

Calculate Genome metrics - add columns

import pandas as pd
import numpy as np
import datetime
import sys

if len(sys.argv) < 2:
    exit("Need filename")
filename = sys.argv[1]


def calculate_averages(row):
    v1 = row.iloc[0:3].mean()
    v2 = row.iloc[3:6].mean()
    return np.log2(v1/v2)

start = datetime.datetime.now()
df = pd.read_excel(filename, index_col='genome name')
print(df.head())
print(datetime.datetime.now() - start)

# create a new column of the calculated value
df['calculated_value'] = df.apply(calculate_averages, axis=1)
print(datetime.datetime.now() - start)

threshold = 0.2
filtered_df = df[df['calculated_value'] > threshold]
print(filtered_df.head())
print(datetime.datetime.now() - start)

Calculate Genome metrics - vectorized

import pandas as pd
import numpy as np
import datetime
import sys

if len(sys.argv) < 2:
    exit("Need filename")
filename = sys.argv[1]

def calculate_averages(df):
    v1 = df.iloc[:, 0:3].mean(axis=1)  # axis=1 -> calculate the mean row-wise
    v2 = df.iloc[:, 3:6].mean(axis=1)
    return np.log2(v1/v2)

start = datetime.datetime.now()
df = pd.read_excel(filename, index_col='genome name')
print(df.head())
print(datetime.datetime.now() - start)

calculated_value = calculate_averages(df)
print(datetime.datetime.now() - start)

threshold = 0.2
filtered_df = df[calculated_value > threshold]
print(filtered_df.head())
print(datetime.datetime.now() - start)

Calculate Genome metrics - vectorized numpy

import pandas as pd
import numpy as np
import datetime
import sys

if len(sys.argv) < 2:
    exit("Need filename")
filename = sys.argv[1]

def calculate_averages(df_numpy):
    v1 = df_numpy[:, 0:3].mean(axis=1)
    v2 = df_numpy[:, 3:6].mean(axis=1)
    return np.log2(v1/v2)

start = datetime.datetime.now()
df = pd.read_excel(filename, index_col='genome name')
print(df.head())
print(datetime.datetime.now() - start)

# the .values attribute changes from Pandas to numpy array
# (no more iloc, no headers, no index)
calculated_value = calculate_averages(df.values)
print(datetime.datetime.now() - start)

threshold = 0.2
filtered_df = df[calculated_value > threshold]
print(filtered_df.head())
print(datetime.datetime.now() - start)

Pandas more

df.iloc[:, 4:10].sum(axis=1)

# rearrange order of columns
cols = list(df.columns)
df = df[ cols[0:4], cols[-1], cols[4:20] ]

to_csv('file.csv', index=False)

read_csv(filename, delimiter='\t')
to_csv(filename, sep='\t')


# after filtering out some rows:
df = df.reset_index()
df.reset_index(drop=True, inplace=True)


filter with
df.loc[ ~df['Name'].str.contains('substring') ]

can also have regex=True parameter

# replace values
df[ df['Name'] == 'old', 'Name' ] = 'new'

Pandas Series

  • Series
  • values
  • index
  • RangeIndex
import pandas

s = pandas.Series([1, 1, 2, 3, 5, 8])
print(s)

# 0    1
# 1    1
# 2    2
# 3    3
# 4    5
# 5    8
# dtype: int64

print(s.values)  # [1 1 2 3 5 8]
print(s.index)   # RangeIndex(start=0, stop=6, step=1)

print('----')
print(s.sum())      # 20
print(s.count())    # 6
print(s.mean())     # 3.33333333333
print(s.median())   # 2.5
print(s.std())      # 2.73252020426
print(s.cumsum())

# 0     1
# 1     2
# 2     4
# 3     7
# 4    12
# 5    20
# dtype: int64

Pandas Series with names

import pandas

planets       = ['Mercury', 'Venus', 'Earth', 'Mars']
distances_raw = [    0.4  ,   0.7  ,       1,  1.5  ]
masses_raw    = [    0.055,   0.815,       1,  0.107]

distance = pandas.Series(distances_raw, index = planets)
mass     = pandas.Series(masses_raw,    index = planets)

print(distance)

# Mercury     0.40
# Venus       0.70
# Earth       1.00
# Mars        1.50
# dtype: float64


print(distance.index)
# Index(['Mercury', 'Venus', 'Earth', 'Mars'], dtype='object')

print(distance[distance < 0.8])
# Mercury    0.4
# Venus      0.7
# dtype: float64


print('------')
print(distance/mass)
# Mercury         7.272727
# Venus           0.858896
# Earth           1.000000
# Mars           14.018692
# dtype: float64