Science with Python
SciPy - for Scientific Computing in Python
Data Science tools in Python
-
SciPy ecosystem of open-source software for mathematics, science, and engineering.
-
BioPython tools for biological computation.
-
NumPy to handle N-dimensional arrays.
-
Pandas Python Data Analysis Library. (Data Frames)
-
Matplotlib a 2D plotting library.
-
Seaborn data visualization library based on matplotlib.
-
Bokeh interactive visualization library.
-
SciKit-Learn Machine Learning in Python. (name from: SciPy Toolkit)
-
TensorFlow Machine learning framework. (primarily for deep-learning) (developed by Google engineer)
-
Keras Machine learning framework (primarily for deep-learning, neural-network). (On top of Tensorflow and now part of it.)
-
PyTorch Machine learnng framework. (primarily for deep-learning) (from Facebook)
-
Orange Machine learning and data visualization tool. Written partially in Python.
-
Airflow Workflow management platform (from AirBnB)
-
Luigi Data pipelines (from Spotify)
-
Showing speed improvement using a GPU with CUDA and Python with numpy on Nvidia Quadro 2000D
-
Octave (Open Source Matlab replacement - not related to Python)
-
There are a lot more libraries for each one of these tasks. Shop around!
Data Analysis resources
- Exploratory data analysis by John Tukey
- Think Bayes - Bayesian Statistics Made Simple
- Statistical Signal Extraction and Filtering: Structual Time Series Models
- Panel Data
For Econometrics
- Econometric Analysis
- Microeconometric Modeling and Discrete Choice Analysis with Cross Section and Panel Data
For Intro Stats
Datasets
Python and Biology
Biopython
Biopython background
- Sequence formats:
- FASTA
- FASTQ
- EMBL European Molecular Biology Laboratory
- Gene names symbols
Bio python sequences
from Bio.Seq import Seq
# Nucleotide Sequences
my_dna = Seq("AGTACACTGGTAGGCCTTACAG_T")
print(my_dna) # AGTACACTGGTAGGCCTTACAG_T
print(my_dna.complement()) # TCATGTGACCATCCGGAATGTC_A
print(my_dna.reverse_complement()) # A_CTGTAAGGCCTACCAGTGTACT
print(my_dna.transcribe()) # AGUACACUGGUAGGCCUUACAG_U
my_rna = Seq("GAC_U")
print(my_rna) # GAC_U
print(my_rna.reverse_complement()) # A_GUC
print(my_rna.reverse_complement()) # A_GUC
print(my_rna.transcribe()) # GAC_U
from Bio.Seq import Seq
what_is_this = Seq("AGTC_U")
what_is_this.complement() # ValueError: Mixed RNA/DNA found
Download and Read FASTA, GenBank files
For example the data about Orchids in two formats:
- ls_orchid.fasta in FASTA format
- ls_orchid.gbk in GenBank format
Download those files and use them:
from Bio import SeqIO
import requests
def get_file(url, filename):
res = requests.get(url)
if res.status_code != 200:
raise Exception("Could not get file")
with open(filename, 'w') as fh:
fh.write(res.text)
def process_file(filename, file_type):
for seq_record in SeqIO.parse(filename, file_type):
print(seq_record.id)
print(repr(seq_record.seq))
print(len(seq_record))
fasta_url = 'https://raw.githubusercontent.com/biopython/biopython/master/Doc/examples/ls_orchid.fasta'
filename = "ls_orchid.fasta"
file_type = "fasta"
get_file(fasta_url, filename)
process_file(filename, file_type)
genbank_url = "https://raw.githubusercontent.com/biopython/biopython/master/Doc/examples/ls_orchid.gbk"
filename = "ls_orchid.gbk"
file_type = "genbank"
get_file(genbank_url, filename)
process_file(filename, file_type)
Search NCBI and Download FASTA and GenBank files
Use the NCBI (National Center for Biotechnology Information) database to search manually for nucleotide or tons of other types of data. Then one can download the files manually from the web site.
Search nucleotids
You can also search the same database programmatically.
from Bio import Entrez
Entrez.email = "gabor@szabgab.com"
term = "Cypripedioideae[Orgn] AND matK[Gene]"
handle = Entrez.esearch(db="nucleotide", term=term, idtype="acc", retmax=30)
record = Entrez.read(handle)
print(record["Count"]) # 538
print(record["IdList"]) # ['MK792700.1', 'MK792699.1', 'MK792698.1', ..., 'MK792681.1']
print(len(record["IdList"])) # 30
handle.close()
# term = "Orchid"
# 530077
# ['NZ_SELD00000000.2', 'NZ_SELD02000072.1',
Download nucleotids
from Bio import Entrez, SeqIO
Entrez.email = "gabor@szabgab.com"
#doc_id = 'MK792700.1'
doc_id = "EU490707"
# rettype="fasta"
handle = Entrez.efetch(db="nucleotide", id=doc_id, rettype="gb", retmode="text")
data = handle.read()
handle.close()
#print(data)
filename = "temp.data"
with open(filename, 'w') as fh:
fh.write(data)
file_type = "genbank"
for seq_record in SeqIO.parse(filename, file_type):
print(seq_record.id)
print(repr(seq_record.seq)) # A short part of the sequence
print()
print(seq_record.seq) # The full sequence
print()
print(len(seq_record.seq))
print()
print(seq_record.name)
print()
print(seq_record.annotations)
#print()
#print(dir(seq_record))
Exercise: Nucleotid
- Search for your favorite nucleotid
- Print out the number of results
- Download the 3 different sequences from the list (using the id) in GeneBank format and save them in files using the id as the name of the file and .gb as the extension
- Write a separate script that reads and displays the sequences.
Exercise: Search tool for NCBI
Write a program that can search NCBI and download and list sequences.
ncbi.py --term "SEARCH TERM" --download N
this will search the NCBI databasefor nucleotides using the the search term and download N results. It will also save the search term, the current date, the total number of items found, and the list of Ids downloaded in a JSON file along with the annotation from each ID".- repeated use of the above command will save the new searches in the same JSON file.
ncbi.py --history
list all the previously executed search term along with the date it was used, the total number of items found and number of items downloaded. (This information is taken from the JSON file.)ncbi.py --list
list some information about all the downloaded sequences. (e.g. the organism and taxonomy) along with the ID of each file.- Optional:
ncbi.py --aa ID
print the distribution of Amino Acids in the sequences in the given ID"
Biology background
- Genetics - inheritance
- Genetic inheritance
- What's a genome Chp2 1
- What's a genome Chp4 1
- alleles, genotype, phenotype
Chemistry
Chemistry links
-
Open Babel The Open Source Chemistry Toolbox
-
Chemical table file to describe molecules and chemical reactions.
-
Pytim Interfacial Analysis of Molecular Simulations
-
Awesome Python Chemistry (article)
-
Awesome Python Chemistry (list on GitHub)
import sdf
import pybel
Python-based Simulations of Chemistry Framework
Bond length
- Bond length
- Distance between two points Pythagorean theorem
- Video
- XYZ fileformat to specify the molecule geometry.
Covalent radius
Python energy landscape explorer
Other chemistry links
- Periodic table
- Diatomic molecule
- VMD - Visual Molecular Dynamics and application to visualize molecules.
numpy
What is NumPy
-
array
-
shape
-
dtype
-
High-level mathematical functions to operate on large, multi-dimensional arrays and matrices. ndarray
Install Numpy
pip install numpy
Numpy - vector
- array
- ndim
- dtype
- shape
import numpy as np
a = np.array([3, 4, 7])
print(a) # [3 4 7]
print(a * 3) # [ 9 12 21]
print(a + 4) # [ 7 8 11]
print(a.dtype) # int64
print(a.ndim) # 1
print(a.shape) # (3,)
b = np.array([2, 3.14, -1])
print(b.dtype) # float64
print(b.shape) # (3,)
c = np.array(['one', 'two', 'three'])
print(c.dtype) # <U5 (Unicode less than 5 characters)
NumPy 2D arrays
import numpy as np
a = np.array([
[ 1, 2, 3, 4, 5],
[ 2, 3, 4, 5, 6]
])
print(a)
# [[1 2 3 4 5]
# [2 3 4 5 6]]
print(a.shape) # (2, 5)
print(a.ndim) # 2
print(a * 3)
# [[ 3 6 9 12 15]
# [ 6 9 12 15 18]]
print(a + 7)
# [[ 8 9 10 11 12]
# [ 9 10 11 12 13]]
Numpy - set type
- dtype
- int8
import numpy as np
a = np.array([3, 4, 7], dtype='int8')
print(a) # [3 4 7]
print(a * 3) # [ 9 12 21]
print(a + 4) # [ 7 8 11]
print(a.dtype) # int8
NumPy arrays: ones and zeros
- ones
- zeros
import numpy as np
c = np.ones(4, dtype='int32')
print(c) # [1 1 1 1]
print(c.dtype) # int32
print(c.shape) # (4,)
print()
d = np.zeros(3, dtype='float32')
print(d) # [ 0. 0. 0.]
print(d.dtype) # float32
print(d.shape) # (3,)
print()
a = np.ones([2, 3])
print(a)
# [[1., 1., 1.],
# [1., 1., 1.]]
print(a.dtype) # float64
print(a.shape) # (2, 3)
Numpy: eye
- eye
import numpy as np
a = np.eye(4)
print(a)
print()
b = np.eye(3, 5)
print(b)
[[1. 0. 0. 0.]
[0. 1. 0. 0.]
[0. 0. 1. 0.]
[0. 0. 0. 1.]]
[[1. 0. 0. 0. 0.]
[0. 1. 0. 0. 0.]
[0. 0. 1. 0. 0.]]
NumPy array random
- random
- default_rng
import numpy as np
a = np.random.random((2, 5)) # in the range [0.0, 1.0)
print(a)
print()
rng = np.random.default_rng()
b = rng.random(size=(3, 4))
print(b)
[[0.32151126 0.07688622 0.95666894 0.42396291 0.93592235]
[0.71406863 0.95152079 0.20199695 0.72628099 0.33545885]]
[[0.46643834 0.71350899 0.40279583 0.85148985]
[0.19367868 0.53288449 0.97181597 0.86311691]
[0.70687485 0.78534671 0.16654183 0.9371896 ]]
NumPy Random integers
- random
- default_rng
- randint
- integer
import numpy as np
a = np.random.randint(10, size=(3, 4))
print(a)
rng = np.random.default_rng()
b = rng.integers(42, size=(3, 4))
print(b)
[[1 2 2 6]
[2 2 9 8]
[8 8 9 5]]
[[13 31 7 11]
[22 2 6 18]
[24 10 12 0]]
NumPy array type change by division (int to float)
import numpy as np
a = np.array([3, 4, 7])
print(a.dtype) # int64
print(a.shape) # (3,)
x = (a / 2)
print(x) # [ 1.5 2. 3.5]
print(x.dtype) # float64
print(x.shape) # (3,)
Numpy: Array methods: transpose
- transpose
import numpy
a = numpy.array([
[ 1, 2, 3, 4, 5],
[ 2, 3, 4, 5, 6]
])
b = a.transpose()
print(b)
# [[1 2]
# [2 3]
# [3 4]
# [4 5]
# [5 6]]
print(a)
# [[1 2 3 4 5]
# [2 3 4 5 6]]
Numpy: reference, not copy
import numpy
a = numpy.array([
[ 1, 2, 3, 4, 5],
[ 2, 3, 4, 5, 6]
])
b = a.transpose()
a[0][0] = 42
print(b)
# [[42 2]
# [2 3]
# [3 4]
# [4 5]
# [5 6]]
print(a)
# [[42 2 3 4 5]
# [2 3 4 5 6]]
Numpy: copy array
- copy
import numpy
a = numpy.array([
[ 1, 2, 3, 4, 5],
[ 2, 3, 4, 5, 6]
])
b = a.copy().transpose()
a[0][0] = 42
print(b)
# [[1 2]
# [2 3]
# [3 4]
# [4 5]
# [5 6]]
print(a)
# [[42 2 3 4 5]
# [2 3 4 5 6]]
Numpy: Elementwise Operations on Arrays
import numpy as np
a = np.array([
[ 1, 2, 3, 4, 5],
[ 2, 3, 4, 5, 6]
])
b = np.array([
[ 7, 3, 8, 9, 4],
[ 1, 3, 6, 1, 2]
])
print(a+b)
# [[ 8 5 11 13 9]
# [ 3 6 10 6 8]]
print(a*b)
# [[ 7 6 24 36 20]
# [ 2 9 24 5 12]]
Numpy: multiply, matmul, dot for vectors
-
multiply
-
matmul
-
dot
-
multiply Multiply arguments element-wise.
-
matmul Matrix product of two arrays.
-
dot Dot product of two arrays.
import numpy as np
a = np.array([3, 4, 7])
b = np.array([6, 5, 2])
print(a) # [3 4 7]
print(b) # [6 5 2]
c = np.multiply(a, b)
print(c) # [18 20 14]
d = np.dot(a, b)
print(d) # 52
m = np.matmul(a, b)
print(m) # 52
Numpy: multiply, matmul, dot for vector and matrix
import numpy as np
a = np.array([[1, 2, 3], [4, 5, 6]])
b = np.array([1, 2, 4])
print(a)
print(b)
print()
print(a*b)
print(b*a)
print()
print(np.multiply(a, b))
print()
print( np.dot(a, b) )
print( np.matmul(a, b) )
[[1 2 3]
[4 5 6]]
[1 2 4]
[[ 1 4 12]
[ 4 10 24]]
[[ 1 4 12]
[ 4 10 24]]
[[ 1 4 12]
[ 4 10 24]]
[17 38]
[17 38]
Numpy: multiply, matmul, dot for matrices
import numpy as np
a = np.array([[1, 2, 3], [4, 5, 6]])
b = np.array([[1, 3, 4], [7, 8, 0]])
print(a)
print(b)
print()
print(a*b)
print(b*a)
print()
print(np.multiply(a, b))
print()
print( np.dot(a, b.transpose()) )
print( np.matmul(a, b.transpose()) )
print()
print( np.dot(a.transpose(), b) )
print( np.matmul(a.transpose(), b) )
[[1 2 3]
[4 5 6]]
[[1 3 4]
[7 8 0]]
[[ 1 6 12]
[28 40 0]]
[[ 1 6 12]
[28 40 0]]
[[ 1 6 12]
[28 40 0]]
[[19 23]
[43 68]]
[[19 23]
[43 68]]
[[29 35 4]
[37 46 8]
[45 57 12]]
[[29 35 4]
[37 46 8]
[45 57 12]]
Numpy: casting - converting from strings to integer
- astype
import numpy as np
a = np.array([
[ "12", "23", "3", "4"],
[ "2", "3", "4", "5"]
])
print(a)
#[['12' '23' '3' '4']
# ['2' '3' '4' '5']]
try:
b = a + 1
except Exception as e:
print(e)
# TypeError: ufunc 'add' did not contain a loop with
# signature matching types dtype('<U3') dtype('<U3') dtype('<U3')
c = a.astype(np.int) + 1
print(c)
# [[13 24 4 5]
# [ 3 4 5 6]]
Numpy: indexing 1d array
import numpy as np
a = np.array([1, 1, 2, 3, 5, 8, 13, 21, 34])
print(a) # [ 1 1 2 3 5 8 13 21 34]
print(a[4]) # 5
print(a[2:5]) # [2 3 5]
Numpy: slice is a reference
The slice in numpy does not copy the data structure
import numpy as np
a = np.array([1, 1, 2, 3, 5, 8, 13, 21, 34])
print(a) # [ 1 1 2 3 5 8 13 21 34]
b = a[2:5]
print(b) # [2 3 5]
a[2] = 20
print(a) # [ 1 1 20 3 5 8 13 21 34]
print(b) # [20 3 5]
Numpy: slice - copy
import numpy as np
a = np.array([1, 1, 2, 3, 5, 8, 13, 21, 34])
print(a) # [ 1 1 2 3 5 8 13 21 34]
b = a[2:5].copy()
print(b) # [2 3 5]
a[2] = 20
print(a) # [ 1 1 20 3 5 8 13 21 34]
print(b) # [2 3 5]
Numpy: abs value on a Numpy array
- abs
- absolute
import numpy as np
a = np.array([[-1, 2, -3], [-4, 5, -7]])
print(a)
print(a.dtype)
print()
abs_a = np.absolute(a)
print(abs_a)
print(abs_a.dtype)
print()
print(a)
[[-1 2 -3]
[-4 5 -7]]
int64
[[1 2 3]
[4 5 7]]
int64
[[-1 2 -3]
[-4 5 -7]]
Numpy: Logical not on a Numpy array
- logical_not
- not
- True
- False
- bool
import numpy as np
a = np.array([True, True, False])
print(a.dtype)
print(a)
print()
not_a = np.logical_not(a)
print(not_a.dtype)
print(not_a)
print()
b = np.array([True, True, False, 0, 42])
print(b.dtype)
print(b)
print()
not_b = np.logical_not(b)
print(not_b.dtype)
print(not_b)
print()
bool
[ True True False]
bool
[False False True]
int64
[ 1 1 0 0 42]
bool
[False False True True False]
Numpy: Vectorize a function
- vectorize
import numpy as np
def fibo(n):
if n == 1 or n == 2:
return 1
a, b = 1, 1
for _ in range(n-2):
a, b = b, a + b
return b
print(fibo(12))
print(fibo.__class__.__name__)
vfibo = np.vectorize(fibo)
a = np.array([
[1, 2, 3, 4, 5, 6],
[7, 8, 9, 10, 11, 12],
])
print(vfibo.__class__.__name__)
print(a)
print(a.dtype)
print()
b = vfibo(a)
print(b)
print(b.dtype)
144
function
vectorize
[[ 1 2 3 4 5 6]
[ 7 8 9 10 11 12]]
int64
[[ 1 1 2 3 5 8]
[ 13 21 34 55 89 144]]
int64
Numpy: Vectorize len
- vectorize
- len
import numpy as np
animals = np.array(['Cow', 'Elephant', 'Snake', 'Camel', 'Praying Mantis'])
print(animals)
vlen = np.vectorize(len)
print(vlen(animals))
['Cow' 'Elephant' 'Snake' 'Camel' 'Praying Mantis']
[ 3 8 5 5 14]
Numpy: Vectorize lambda
- vectorize
- lambda
import numpy as np
animals = np.array(['Cow', 'Elephant', 'Snake', 'Camel', 'Praying Mantis'])
print(animals)
longer_than_5 = np.vectorize(lambda x: len(x) > 5)
long_animals_bool = longer_than_5(animals)
print(long_animals_bool)
['Cow' 'Elephant' 'Snake' 'Camel' 'Praying Mantis']
[False True False False True]
Numpy: Filtering array (selecting some of the values from an array)
- vectorize
import numpy as np
animals = np.array(['Cow', 'Elephant', 'Snake', 'Camel', 'Praying Mantis'])
print(animals)
longer_than_5 = np.vectorize(lambda x: len(x) > 5)
long_animals_bool = longer_than_5(animals)
print(long_animals_bool)
long_animals = animals[long_animals_bool]
print(long_animals)
['Cow' 'Elephant' 'Snake' 'Camel' 'Praying Mantis']
[False True False False True]
['Elephant' 'Praying Mantis']
Numpy: Filter matrix values
import numpy as np
import re
scores = np.array([
[23, 37, 18, 97, 13, 40],
[10, 15, 20, 30, 39, 50],
[99, 20, 83, 42, 19, 31],
[19, 11, 55, 78, 39, 27]
])
print(scores)
print()
high_scores_boolean = (scores > 20)
print(high_scores_boolean)
print()
high_scores = scores[high_scores_boolean]
print(high_scores)
[[23 37 18 97 13 40]
[10 15 20 30 39 50]
[99 20 83 42 19 31]
[19 11 55 78 39 27]]
[[ True True False True False True]
[False False False True True True]
[ True False True True False True]
[False False True True True True]]
[23 37 97 40 30 39 50 99 83 42 31 55 78 39 27]
Numpy: Filter matrix rows (selecting some rows)
import numpy as np
names = np.array(['Mary', 'Bar', 'Joe', 'Jane'])
print(names)
print()
def has_ar(text):
return "ar" in text
# if "ar" in text:
# return True
# else:
# return False
names_with_ar_selector = np.vectorize(has_ar)
names_with_ar_bool = names_with_ar_selector(names)
print(names_with_ar_bool)
print()
scores = np.array([
[23, 37, 18, 97, 13, 40],
[10, 15, 20, 30, 39, 50],
[99, 20, 83, 42, 19, 31],
[19, 11, 55, 78, 39, 27]
])
print(scores[names_with_ar_bool])
['Mary' 'Bar' 'Joe' 'Jane']
[ True True False False]
[[23 37 18 97 13 40]
[10 15 20 30 39 50]]
[[23 37 18 97 13 40]
[10 15 20 30 39 50]]
Numpy: Some statistics (sum, mean, std, var)
import numpy as np
scores = np.array([23, 37, 18, 97, 13, 40])
print(scores.sum()) # 228
print(len(scores)) # 6
print(scores.mean()) # 38.0
print(scores.std()) # 28.0950766743 standard deviation
print(scores.var()) # 789.333333333 variance
print(np.median(scores)) # 30.0
print(scores.max()) # 97
print(scores.min()) # 13
print(scores.cumsum()) # [ 23 60 78 175 188 228]
Numpy: Serialization (saving an array to a file)
import numpy as np
scores = np.array([
[23, 37, 18, 97, 13, 40],
[10, 15, 20, 30, 39, 50],
[99, 20, 83, 42, 19, 31],
[19, 11, 55, 78, 39, 27]
])
filename = 'scores.npy'
np.save(filename, scores)
s = np.load(filename)
print(s)
Numpy: Load from Matlab file to a Numpy array
import scipy.io
file_path = 'data.mat'
mat = scipy.io.loadmat(file_path)
data = mat['data']
print(type(data))
print(data)
numpy.ndarray
Numpy: Save a Numpy array as a Matlab file
import scipy.io
import numpy as np
data = np.random.random((2, 5))
print(data)
file_path = 'data.mat'
scipy.io.savemat(file_path, {'data': data})
Numpy: Horizontal stack vectors (hstack)
- hstack
import numpy as np
a = np.array([1, 2, 3])
b = np.array([4, 5, 6])
c = np.array([7, 8, 9])
print(a)
print(b)
print(c)
print()
d = np.hstack([a, b])
print(d)
print()
e = np.hstack([d, c])
print(e)
[1 2 3]
[4 5 6]
[7 8 9]
[1 2 3 4 5 6]
[1 2 3 4 5 6 7 8 9]
Numpy: Append or vertically stack vectors and matrices (vstack)
- vstack
import numpy as np
a = np.array([1, 2, 3])
b = np.array([4, 5, 6])
c = np.array([7, 8, 9])
print(a)
print(b)
print(c)
print()
m = np.vstack([a, b])
print(m)
print()
d3 = np.vstack([m, c])
print(d3)
[1 2 3]
[4 5 6]
[7 8 9]
[[1 2 3]
[4 5 6]]
[[1 2 3]
[4 5 6]
[7 8 9]]
Numpy uint8
- uint8
import numpy as np
a = np.array([127], 'uint8')
print(a.dtype) # uint8
print(a) # [127]
a[0] += 1 # [128]
print(a)
a[0] -= 1 # [127]
print(a)
a[0] = 255
print(a) # [255]
a[0] += 1
print(a) # [0]
Numpy int8
- int8
import numpy as np
a = np.array([127], 'int8')
print(a.dtype) # int8
print(a) # [127]
a[0] += 1 # [-128]
print(a)
a[0] -= 1 # [127]
print(a)
a[0] = 255
print(a) # [-1]
a[0] += 1
print(a) # [0]
More Numpy
import numpy as np
matrix = np.array([
[2, 3, 17],
[27, 1, 10],
])
print(matrix.shape)
print(matrix.sum(axis = 1))
print(matrix.sum(axis = 0))
Pandas
Pandas
- Pandas Python Data Analysis Library
- Handle data sequences
- A Beginner's Guide to Optimizing Pandas Code for Speed
Datasets
- Planets
- StackOverflow survey
Planets data
{% embed include file="src/examples/pandas/planets.csv)
StackOverflow Survey data
- Download one of the datasets from the Stack Overflow survey
- unzip the file. Feel free to remove the
__MACOSX/
directory if it is still there.
Planets - Read CSV into Dataframes
- read_csv
import sys
import pandas as pd
filename = "planets.csv"
if len(sys.argv) == 2:
filename = sys.argv[1]
df = pd.read_csv(filename)
print(df)
print()
print(type(df))
print(df.__class__.__name__)
Planet name Distance (AU) Mass
0 Mercury 0.40 0.055000
1 Venus 0.70 0.815000
2 Earth 1.00 1.000000
3 Mars 1.50 0.107000
4 Ceres 2.77 0.000150
5 Jupiter 5.20 318.000000
6 Saturn 9.50 95.000000
7 Uranus 19.60 14.000000
8 Neptune 30.00 17.000000
9 Pluto 39.00 0.002180
10 Charon 39.00 0.000254
<class 'pandas.core.frame.DataFrame'>
DataFrame
Planets - DataFrame Statistics
- columns
- dtypes
- index
- values
- describe
import sys
import pandas as pd
filename = "planets.csv"
if len(sys.argv) == 2:
filename = sys.argv[1]
df = pd.read_csv(filename)
print(df.columns)
print()
print(df.dtypes)
print()
print(df.index)
print()
print(df.values)
print()
print(df.describe())
Index(['Planet name', 'Distance (AU)', 'Mass'], dtype='object')
Planet name object
Distance (AU) float64
Mass float64
dtype: object
RangeIndex(start=0, stop=11, step=1)
[['Mercury' 0.4 0.055]
['Venus' 0.7 0.815]
['Earth' 1.0 1.0]
['Mars' 1.5 0.107]
['Ceres' 2.77 0.00015]
['Jupiter' 5.2 318.0]
['Saturn' 9.5 95.0]
['Uranus' 19.6 14.0]
['Neptune' 30.0 17.0]
['Pluto' 39.0 0.00218]
['Charon' 39.0 0.000254]]
Distance (AU) Mass
count 11.000000 11.000000
mean 13.515455 40.543599
std 15.638671 96.205021
min 0.400000 0.000150
25% 1.250000 0.028590
50% 5.200000 0.815000
75% 24.800000 15.500000
max 39.000000 318.000000
Pandas Read CSV various datatypes
Example with a few more column-types (e.g. the column with the title MyBool contains True and False values and it is recognized as bool)
{% embed include file="src/examples/pandas/mixed.csv)
import pandas as pd
df = pd.read_csv('mixed.csv')
print( df.dtypes )
print( df.describe() )
MyText object
MyInteger int64
MyFloat float64
MyBool bool
MyExit int64
dtype: object
MyInteger MyFloat MyExit
count 3.000000 3.000000 3.000000
mean 7.333333 3.233333 1.333333
std 4.509250 0.862168 1.527525
min 3.000000 2.300000 0.000000
25% 5.000000 2.850000 0.500000
50% 7.000000 3.400000 1.000000
75% 9.500000 3.700000 2.000000
max 12.000000 4.000000 3.000000
Pandas Read CSV set dtype
When recognizing integers read_csv will default to int64, but if we would like to save memory we can set the dtype while we read the file.
import pandas as pd
import numpy as np
df = pd.read_csv('mixed.csv', dtype = { 'MyInteger' : np.int8 })
print( df.dtypes )
MyText object
MyInteger int8
MyFloat float64
MyBool bool
MyExit int64
dtype: object
Pandas Read CSV convert values
Sometime the data in the CSV file represents something else or we might want to change the meaning of the data.
For example in some cases 0 represents False and 1 represents True. If the CSV file contains 0 and 1 values in a column Pandas will automatically represent them as integers. We can convert them to False and True values respectively.
In another case we might have exit-codes in a column where 0 means success and any other number means failure. We might want to simplify that column and represent success by True and failure by False. (Yes, we loose the details of the failure, but maybe we are not interested in the details.)
This latter is what we can see in our example.
import pandas as pd
import numpy as np
df = pd.read_csv('mixed.csv', converters = { 'MyExit' : lambda x : x == '0' })
print( df.dtypes )
print( df )
MyText object
MyInteger int64
MyFloat float64
MyBool bool
MyExit bool
dtype: object
MyText MyInteger MyFloat MyBool MyExit
0 Joe 12 3.4 True True
1 Jane 3 4.0 False False
2 Mary 7 2.3 False False
Pandas split multivalue column into separate columns
{% embed include file="src/examples/pandas/multivalue.csv)
import pandas as pd
df = pd.read_csv('multivalue.csv')
print(df)
print()
values = ["Apple", "Banana", "Peach", "Melon"]
for value in values:
df[value] = df.apply(lambda row: pd.notnull(row['Fruits']) and value in row['Fruits'].split(','), axis=1)
print( df )
MyText Fruits
0 Joe Apple
1 Jane Apple,Banana
2 Mary Banana,Peach,Melon
3 Bob NaN
4 Zane Melon
MyText Fruits Apple Banana Peach Melon
0 Joe Apple True False False False
1 Jane Apple,Banana True True False False
2 Mary Banana,Peach,Melon False True True True
3 Bob NaN False False False False
4 Zane Melon False False False True
Pandas split multivalue column into separate columns - get_dummies
- get_dummies
import pandas as pd
dfin = pd.read_csv('multivalue.csv')
print(dfin)
print()
fruits_dummies = dfin['Fruits'].str.get_dummies(',').astype(bool)
df = pd.concat([dfin, fruits_dummies], axis=1)
print(df)
MyText Fruits
0 Joe Apple
1 Jane Apple,Banana
2 Mary Banana,Peach,Melon
3 Bob NaN
4 Zane Melon
MyText Fruits Apple Banana Melon Peach
0 Joe Apple True False False False
1 Jane Apple,Banana True True False False
2 Mary Banana,Peach,Melon False True True True
3 Bob NaN False False False False
4 Zane Melon False False True False
Planets - Show first and last rows
-
head
-
tail
-
head
will show the first few rows (defaults to 5) -
tail
will show the last few rows (defaults to 5)
import sys
import pandas as pd
filename = "planets.csv"
if len(sys.argv) == 2:
filename = sys.argv[1]
df = pd.read_csv(filename)
print(df.head())
print()
print(df.tail(2))
Planet name Distance (AU) Mass
0 Mercury 0.40 0.05500
1 Venus 0.70 0.81500
2 Earth 1.00 1.00000
3 Mars 1.50 0.10700
4 Ceres 2.77 0.00015
Planet name Distance (AU) Mass
9 Pluto 39.0 0.002180
10 Charon 39.0 0.000254
Planets DataFrame select columns
import sys
import pandas as pd
filename = "planets.csv"
if len(sys.argv) == 2:
filename = sys.argv[1]
df = pd.read_csv(filename)
mass = df['Mass'] # Get Series
print(type(mass))
print(mass)
print()
one_col = df[['Mass']] # Get DataFrame
print(type(one_col))
print(one_col)
print()
two_cols = df[['Planet name', 'Mass']] # Select multiple columns, get DataFrame
print(type(two_cols))
print(two_cols)
print()
<class 'pandas.core.series.Series'>
0 0.055000
1 0.815000
2 1.000000
3 0.107000
4 0.000150
5 318.000000
6 95.000000
7 14.000000
8 17.000000
9 0.002180
10 0.000254
Name: Mass, dtype: float64
<class 'pandas.core.frame.DataFrame'>
Mass
0 0.055000
1 0.815000
2 1.000000
3 0.107000
4 0.000150
5 318.000000
6 95.000000
7 14.000000
8 17.000000
9 0.002180
10 0.000254
<class 'pandas.core.frame.DataFrame'>
Planet name Mass
0 Mercury 0.055000
1 Venus 0.815000
2 Earth 1.000000
3 Mars 0.107000
4 Ceres 0.000150
5 Jupiter 318.000000
6 Saturn 95.000000
7 Uranus 14.000000
8 Neptune 17.000000
9 Pluto 0.002180
10 Charon 0.000254
Planets DataFrame select rows
import sys
import pandas as pd
filename = "planets.csv"
if len(sys.argv) == 2:
filename = sys.argv[1]
df = pd.read_csv(filename)
rows = df[2:5] # using row numbers
print(type(rows))
print(rows)
print()
iloc_rows = df.iloc[2:5] # just like plain []
print(type(iloc_rows))
print(iloc_rows)
print()
picked_rows = df.iloc[[2,5,3]] # using specific row numbers
print(type(picked_rows))
print(picked_rows)
print()
# df[[2,5,3]] would not work so we need iloc
<class 'pandas.core.frame.DataFrame'>
Planet name Distance (AU) Mass
2 Earth 1.00 1.00000
3 Mars 1.50 0.10700
4 Ceres 2.77 0.00015
<class 'pandas.core.frame.DataFrame'>
Planet name Distance (AU) Mass
2 Earth 1.00 1.00000
3 Mars 1.50 0.10700
4 Ceres 2.77 0.00015
<class 'pandas.core.frame.DataFrame'>
Planet name Distance (AU) Mass
2 Earth 1.0 1.000
5 Jupiter 5.2 318.000
3 Mars 1.5 0.107
Planets DataFrame select rows and columns
import sys
import pandas as pd
filename = "planets.csv"
if len(sys.argv) == 2:
filename = sys.argv[1]
df = pd.read_csv(filename)
small = df[2:5][['Distance (AU)', 'Planet name']]
print(type(small))
print(small)
print()
other = df.iloc[3:6][['Distance (AU)', 'Planet name']]
print(type(other))
print(other)
print()
<class 'pandas.core.frame.DataFrame'>
Distance (AU) Planet name
2 1.00 Earth
3 1.50 Mars
4 2.77 Ceres
<class 'pandas.core.frame.DataFrame'>
Distance (AU) Planet name
3 1.50 Mars
4 2.77 Ceres
5 5.20 Jupiter
DataFrame filter rows by size
import sys
import pandas as pd
filename = "planets.csv"
if len(sys.argv) == 2:
filename = sys.argv[1]
df = pd.read_csv(filename)
big_ones_selector = df['Mass'] > 1
print(big_ones_selector)
print()
big_ones = df[big_ones_selector]
print(big_ones)
0 False
1 False
2 False
3 False
4 False
5 True
6 True
7 True
8 True
9 False
10 False
Name: Mass, dtype: bool
Planet name Distance (AU) Mass
5 Jupiter 5.2 318.0
6 Saturn 9.5 95.0
7 Uranus 19.6 14.0
8 Neptune 30.0 17.0
DataFrame filter rows by value
import sys
import pandas as pd
filename = "planets.csv"
if len(sys.argv) == 2:
filename = sys.argv[1]
df = pd.read_csv(filename)
planet_selector = (df['Planet name'] == 'Earth')
print(planet_selector)
print()
planets = df[ planet_selector ]
print(planets)
Output:
0 False
1 False
2 True
3 False
4 False
5 False
6 False
7 False
8 False
9 False
10 False
Name: Planet name, dtype: bool
Planet name Distance (AU) Mass
2 Earth 1.0 1.0
DataFrame filter rows by value in list
import sys
import pandas as pd
filename = "planets.csv"
if len(sys.argv) == 2:
filename = sys.argv[1]
df = pd.read_csv(filename)
planet_selector = df['Planet name'].isin(['Earth', 'Mars'])
print(planet_selector)
print()
planets = df[ planet_selector ]
print(planets)
Output:
0 False
1 False
2 True
3 True
4 False
5 False
6 False
7 False
8 False
9 False
10 False
Name: Planet name, dtype: bool
Planet name Distance (AU) Mass
2 Earth 1.0 1.000
3 Mars 1.5 0.107
DataFrame filter elementwise boolean and
import sys
import pandas as pd
filename = "planets.csv"
if len(sys.argv) == 2:
filename = sys.argv[1]
df = pd.read_csv(filename)
selector = (df['Mass'] > 1) & (df['Mass'] < 100)
print(selector)
print()
planets = df[ selector ]
print(planets)
Output:
0 False
1 False
2 False
3 False
4 False
5 False
6 True
7 True
8 True
9 False
10 False
Name: Mass, dtype: bool
Planet name Distance (AU) Mass
6 Saturn 9.5 95.0
7 Uranus 19.6 14.0
8 Neptune 30.0 17.0
DataFrame sort (sort_values)
- sort_values
import sys
import pandas as pd
filename = "planets.csv"
if len(sys.argv) == 2:
filename = sys.argv[1]
df = pd.read_csv(filename)
sorted_df = df.sort_values('Planet name', ascending=True)
print(sorted_df)
# df remains unchanged
Output:
Planet name Distance (AU) Mass
4 Ceres 2.77 0.000150
10 Charon 39.00 0.000254
2 Earth 1.00 1.000000
5 Jupiter 5.20 318.000000
3 Mars 1.50 0.107000
0 Mercury 0.40 0.055000
8 Neptune 30.00 17.000000
9 Pluto 39.00 0.002180
6 Saturn 9.50 95.000000
7 Uranus 19.60 14.000000
1 Venus 0.70 0.815000
DataFrame loc vs. iloc
loc
by values (here we start from the row where the index column == 3iloc
by index (here we start from the 3rd row)
import sys
import pandas as pd
filename = "planets.csv"
if len(sys.argv) == 2:
filename = sys.argv[1]
df = pd.read_csv(filename)
print(df)
print()
print(df.loc[3:6]) # by values (here we start from the row where the index column == 3
print()
print(df.iloc[3:6]) # by index (here we start from the 3rd row)
print()
sorted_df = df.sort_values('Planet name', ascending=True)
print(sorted_df)
print()
print(sorted_df.loc[3:6])
print()
print(sorted_df.iloc[3:6])
print()
print('-------')
print(sorted_df.loc[2:4])
print()
print(sorted_df.iloc[2:4])
print()
Output:
Planet name Distance (AU) Mass
0 Mercury 0.40 0.055000
1 Venus 0.70 0.815000
2 Earth 1.00 1.000000
3 Mars 1.50 0.107000
4 Ceres 2.77 0.000150
5 Jupiter 5.20 318.000000
6 Saturn 9.50 95.000000
7 Uranus 19.60 14.000000
8 Neptune 30.00 17.000000
9 Pluto 39.00 0.002180
10 Charon 39.00 0.000254
Planet name Distance (AU) Mass
3 Mars 1.50 0.10700
4 Ceres 2.77 0.00015
5 Jupiter 5.20 318.00000
6 Saturn 9.50 95.00000
Planet name Distance (AU) Mass
3 Mars 1.50 0.10700
4 Ceres 2.77 0.00015
5 Jupiter 5.20 318.00000
Planet name Distance (AU) Mass
4 Ceres 2.77 0.000150
10 Charon 39.00 0.000254
2 Earth 1.00 1.000000
5 Jupiter 5.20 318.000000
3 Mars 1.50 0.107000
0 Mercury 0.40 0.055000
8 Neptune 30.00 17.000000
9 Pluto 39.00 0.002180
6 Saturn 9.50 95.000000
7 Uranus 19.60 14.000000
1 Venus 0.70 0.815000
Planet name Distance (AU) Mass
3 Mars 1.5 0.10700
0 Mercury 0.4 0.05500
8 Neptune 30.0 17.00000
9 Pluto 39.0 0.00218
6 Saturn 9.5 95.00000
Planet name Distance (AU) Mass
5 Jupiter 5.2 318.000
3 Mars 1.5 0.107
0 Mercury 0.4 0.055
-------
Empty DataFrame
Columns: [Planet name, Distance (AU), Mass]
Index: []
Planet name Distance (AU) Mass
2 Earth 1.0 1.0
5 Jupiter 5.2 318.0
Pandas Planets - Add calculated column, remove / delete column (drop)
import pandas as pd
df = pd.read_csv('planets.csv')
print(df.head())
print()
df['dm'] = df['Distance (AU)'] * df['Mass']
print(df.head())
print()
df.drop(columns = 'Mass', inplace=True)
print(df.head())
Output:
Planet name Distance (AU) Mass
0 Mercury 0.40 0.05500
1 Venus 0.70 0.81500
2 Earth 1.00 1.00000
3 Mars 1.50 0.10700
4 Ceres 2.77 0.00015
Planet name Distance (AU) Mass dm
0 Mercury 0.40 0.05500 0.022000
1 Venus 0.70 0.81500 0.570500
2 Earth 1.00 1.00000 1.000000
3 Mars 1.50 0.10700 0.160500
4 Ceres 2.77 0.00015 0.000415
Planet name Distance (AU) dm
0 Mercury 0.40 0.022000
1 Venus 0.70 0.570500
2 Earth 1.00 1.000000
3 Mars 1.50 0.160500
4 Ceres 2.77 0.000415
Pandas Planets - calculate
import sys
import pandas as pd
filename = "planets.csv"
center = 'Earth'
if len(sys.argv) > 1:
center = sys.argv[1]
if len(sys.argv) > 2:
filename = sys.argv[2]
df = pd.read_csv(filename)
this = df[ df['Planet name'] == center ]
print(this)
print('----')
mass = this.iloc[0]['Mass']
print(mass)
print('----')
dist = this.iloc[0]['Distance (AU)']
print(dist)
print('----')
# gravitational force is F = G * (mass1*mass2) / distance**2
G = 6
distance = abs(dist - df['Distance (AU)'])
print(distance)
print('----')
forces = df.copy()
print(forces)
print('----')
F = G * (df['Mass'] * mass) / distance**2
print(F)
print('----')
forces['F'] = F
forces.drop(columns = 'Mass', inplace=True)
forces.drop(columns = 'Distance (AU)', inplace=True)
print(forces)
print('----')
Output:
Planet name Distance (AU) Mass
2 Earth 1.0 1.0
----
1.0
----
1.0
----
0 0.60
1 0.30
2 0.00
3 0.50
4 1.77
5 4.20
6 8.50
7 18.60
8 29.00
9 38.00
10 38.00
Name: Distance (AU), dtype: float64
----
Planet name Distance (AU) Mass
0 Mercury 0.40 0.055000
1 Venus 0.70 0.815000
2 Earth 1.00 1.000000
3 Mars 1.50 0.107000
4 Ceres 2.77 0.000150
5 Jupiter 5.20 318.000000
6 Saturn 9.50 95.000000
7 Uranus 19.60 14.000000
8 Neptune 30.00 17.000000
9 Pluto 39.00 0.002180
10 Charon 39.00 0.000254
----
0 0.916667
1 54.333333
2 inf
3 2.568000
4 0.000287
5 108.163265
6 7.889273
7 0.242803
8 0.121284
9 0.000009
10 0.000001
dtype: float64
----
Planet name F
0 Mercury 0.916667
1 Venus 54.333333
2 Earth inf
3 Mars 2.568000
4 Ceres 0.000287
5 Jupiter 108.163265
6 Saturn 7.889273
7 Uranus 0.242803
8 Neptune 0.121284
9 Pluto 0.000009
10 Charon 0.000001
----
Pandas read CSV set index column
import pandas as pd
df = pd.read_csv('planets.csv', index_col='Planet name')
print(df.head(2))
Output:
Distance (AU) Mass
Planet name
Mercury 0.4 0.055
Venus 0.7 0.815
Count values
import sys
import pandas as pd
filename = "survey_results_public.csv"
if len(sys.argv) == 2:
filename = sys.argv[1]
df = pd.read_csv(filename)
country_count = df['Country'].value_counts()
print(country_count)
print(type(country_count)) # pandas.core.series.Series
# We can use it either as a dictionary or as a list
print(country_count['United States']) # 12469
print(country_count[0]) # 12469
Select top items
- StackOverflow - biggest countries (in terms of number of responses)
import sys
import pandas as pd
filename = "survey_results_public.csv"
if len(sys.argv) == 2:
filename = sys.argv[1]
df = pd.read_csv(filename)
country_count = df['Country'].value_counts()
N = 20
# Take the top N countries
first = country_count.head(N)
print(first)
print(type(first)) # Series
# first = country_count.iloc[0:N] # part of the Series
# print(first)
# type(first) # Series
# first = country_count[0:N]
# print(first)
# type(first) # Series
# Select rows of the "biggest" countries
print(first.keys())
Pandas Show histogram
import sys
import pandas as pd
import matplotlib.pyplot as plt
filename = "survey_results_public.csv"
if len(sys.argv) == 2:
filename = sys.argv[1]
df = pd.read_csv(filename)
country_count = df['Country'].value_counts()
N = 20
# Take the top N countries
first = country_count.head(N)
# Historgram of the top 20 countries
hist = first.hist(bins = N)
plt.show(block=hist)
#plt.savefig('so_histogram.png')
{% embed include file="src/examples/pandas/so_histogram.png)
Pandas read selected columns
import sys
import pandas as pd
filename = "survey_results_public.csv"
if len(sys.argv) == 2:
filename = sys.argv[1]
df = pd.read_csv(filename, usecols=['Country', 'OpenSourcer', 'CompTotal'])
print(df.head())
Pandas read file in chunks
import sys
import pandas as pd
filename = "survey_results_public.csv"
if len(sys.argv) == 2:
filename = sys.argv[1]
size = 10000
for df_chunk in pd.read_csv(filename, chunksize=size):
print(df_chunk.head())
Pandas read selected rows in chunks
import sys
import pandas as pd
filename = "survey_results_public.csv"
if len(sys.argv) == 2:
filename = sys.argv[1]
# Load only data from a specific country.
country_name = 'Israel'
chunks = []
for chunk in pd.read_csv(filename, chunksize=10000):
print(chunk.size)
part = chunk[ chunk['Country'] == country_name ]
print(part.size)
print('--')
chunks.append(part)
df = pd.concat(chunks)
print(df.count())
print(df.size)
import sys
import pandas as pd
filename = "survey_results_public.csv"
if len(sys.argv) == 2:
filename = sys.argv[1]
# Load only data from a specific country.
country_name = 'Israel'
df = None
for chunk in pd.read_csv(filename, chunksize=10000):
part = chunk[ chunk['Country'] == country_name ]
if df is None:
df = part.copy(deep = True)
else:
df = df.append(part.copy(deep = True), ignore_index = True)
print(df.count())
print(df.size)
Combine columns to create a new column
- apply
{% embed include file="src/examples/pandas/data.csv)
import pandas as pd
filename = 'data.csv'
df = pd.read_csv(filename)
print(df)
print()
def combine_two_columns(row):
return row['lname'] + '_' + row['fname']
df['combined'] = df.apply(combine_two_columns, axis=1)
print(df)
print()
def combine_more_columns(row):
columns = ['lname', 'age', 'fname']
return '_'.join(map(lambda name: str(row[name]), columns))
df['combined'] = df.apply(combine_more_columns, axis=1)
print(df)
Output:
fname lname age
0 Foo Bar 100
1 Alma Matter 78
2 Buzz Lightyear 23
fname lname age combined
0 Foo Bar 100 Bar_Foo
1 Alma Matter 78 Matter_Alma
2 Buzz Lightyear 23 Lightyear_Buzz
fname lname age combined
0 Foo Bar 100 Bar_100_Foo
1 Alma Matter 78 Matter_78_Alma
2 Buzz Lightyear 23 Lightyear_23_Buzz
Add a new column (conditional)
- apply
{% embed include file="src/examples/pandas/temprature.csv)
import sys
import pandas as pd
filename = 'temprature.csv'
df = pd.read_csv(filename)
print(df)
print()
def covid_instructions(row):
return "stay home" if row['temp'] >= 38 else "go to work"
df['covid_instructions'] = df.apply(covid_instructions, axis=1)
print(df)
Output:
name temp
0 Ella 36.6
1 Maor 40.0
2 Oren 38.2
3 Gal 37.0
name temp covid_instructions
0 Ella 36.6 go to work
1 Maor 40.0 stay home
2 Oren 38.2 stay home
3 Gal 37.0 go to work
Pandas read Excel
- read_excel
import sys
import pandas as pd
if len(sys.argv) < 2:
exit("Need filename")
filename = sys.argv[1]
df = pd.read_excel(filename)
print(df)
# to_excel()
Create Excel file for experiment with random data
Input is an Excel file with the following columns:
genome name, c1, c2, c3, c4, c5, c6
- c1-c3 are numbers of cond1
- c4-c6 are numbers of cond2
We would like to filter to the lines that fulfill the following equations:
log2(avg(1-3) / avg(4-6)) > limit
other_limit > p.value( )
import numpy as np
import pandas as pd
import datetime
import sys
if len(sys.argv) < 2:
exit("Need number of rows")
rows_num = int(sys.argv[1])
cols_num = 6
start_time = datetime.datetime.now()
matrix = np.random.rand(rows_num, cols_num)
#print(matrix)
genome_names = list(map(lambda i: f'g{i}', range(rows_num)))
column_names = list(map(lambda i: f'm{i}', range(cols_num)))
df = pd.DataFrame(matrix, index=genome_names, columns=column_names)
df.index.name = 'genome name'
print(df.head())
end_generate_time = datetime.datetime.now()
print(end_generate_time - start_time)
df.to_excel('raw_data.xlsx')
end_save_time = datetime.datetime.now()
print(end_save_time - end_generate_time)
Output:
m0 m1 m2 m3 m4 m5
genome name
g0 0.775167 0.120102 0.813921 0.284670 0.074309 0.978062
g1 0.449572 0.556647 0.851609 0.711773 0.052198 0.543029
g2 0.592324 0.350038 0.273521 0.248995 0.773113 0.998779
0:00:00.007476
0:00:00.140074
Calculate Genome metrics
import pandas as pd
import numpy as np
import datetime
import sys
filename = 'raw_data.xlsx'
if len(sys.argv) == 2:
filename = sys.argv[1]
def calculate_averages(row):
v1 = row.iloc[0:3].mean()
v2 = row.iloc[3:6].mean()
return np.log2(v1/v2)
start_time = datetime.datetime.now()
df = pd.read_excel(filename, index_col='genome name')
load_time = datetime.datetime.now()
print(load_time - start_time)
print(df.head())
calculated_value = df.apply(calculate_averages, axis=1)
threshold = 0.2
filtered_df = df[calculated_value > threshold]
print(filtered_df.head())
calculate_time = datetime.datetime.now()
print(calculate_time - load_time)
Exercise: Pandas
-
Take the Stack Overflow survey and the report created by SO.
-
Distribution of responses among countries.
-
Relation of Open Source contribution to experience.
-
Open Source contribution by country.
-
Look at the pdf file and create similar reports for a specific country
-
Pick a dataset from Kaggle and try to analyze that.
TODO: Stack Overflow example
import pandas as pd
import matplotlib.pyplot as plt
import numpy as np
file = './developer_survey_2020/survey_results_public.csv'
df = pd.read_csv(file, sep = ',')
totals = []
countries = []
for c in np.unique(list(df['Country'])):
totals.append(len(df[df['Country'] == c]))
countries.append(c)
plt.figure()
plt.bar(np.arange(0, len(countries)), totals)
plt.xticks(np.arange(0, len(countries)), countries)
#%%
age_totals = []
ages = []
for c in np.unique(list(df['Age1stCode'])):
age_totals.append(len(df[df['Age1stCode'] == c]))
ages.append(c)
plt.figure()
plt.scatter(ages, age_totals)
plt.xticks(fontsize = 5)
import pandas
df=pandas.read_csv("survey_results_public.csv")
countrey_dist=df['Country'].value_counts()
open_sourcers_dist=df['OpenSourcer'].value_counts()
print("top 10 response countries:\n",countrey_dist.head(10))
open_sourcers_dist_top=df['OpenSourcer'][df['Country']=='United States'].value_counts()
experience_dist_top=df['YearsCode'][df['Country']=='United States'].value_counts()
print('distribution of open sourcer is the top country USA is :\n' ,open_sourcers_dist_top)
print('distribution of experience in the top country USA is:\n' ,experience_dist_top)
df=df[['OpenSourcer','YearsCode']][df['Country']=='United States']
df_agg=df.groupby('OpenSourcer').agg('YearsCode')
print("relationship betwen OpenSourcer to coding exprience in US is :\n",df_agg)
import sys
import pandas as pd
filename = "survey_results_public.csv"
if len(sys.argv) == 2:
filename = sys.argv[1]
country_name = 'Israel'
chunks = []
dev_chunks=[]
for chunk in pd.read_csv(filename, usecols=['Country','DevType'],chunksize=10000):
part = chunk[chunk['Country'] == country_name]
print(chunk.size)
print(part.size)
print('--')
chunks.append(part)
df = pd.concat(chunks)
print(df.dtypes)
for value in ['Academic researcher','Data or business analyst', 'Data scientist or machine learning specialist','Database administrator','Designer', 'Developer, back-end',
'Developer, desktop or enterprise applications','Developer, embedded applications or devices','Developer, front-end','Developer, full-stack','Developer, game or graphics', 'Developer, mobile','Developer, QA or test',
'DevOps specialist','Educator','Engineer, data', 'Engineer, site reliability','Engineering manager', 'Marketing or sales professional', 'Product manager', 'Scientist',
'Senior Executive (C-Suite, VP, etc.)', 'System administrator']:
print(value)
df[value]= df.apply(lambda row: value in row['DevType'], axis=1)
print(df.count())
print(df.size)
import sys
import pandas as pd
filename = "survey_results_public.csv"
if len(sys.argv) == 2:
filename = sys.argv[1]
country_name = 'Israel'
chunks = []
dev_chunks=[]
for chunk in pd.read_csv(filename, usecols=['Country','DevType'],chunksize=10000):
part = chunk[chunk['Country'] == country_name]
#df = pd.read_csv(filename, usecols=['Country','DevType'])
#,chunksize=10000):
#for chunk in pd.read_csv(filename, usecols=['Country','DevType'],chunksize=10000):
# part = chunk[chunk['Country'] == country_name]
#
#
# print(chunk.size)
# print(part.size)
# print('--')
chunks.append(part)
#
#
df = pd.concat(chunks)
print(df.dtypes)
for value in ['Academic researcher','Data or business analyst', 'Data scientist or machine learning specialist','Database administrator','Designer', 'Developer, back-end',
'Developer, desktop or enterprise applications','Developer, embedded applications or devices','Developer, front-end','Developer, full-stack','Developer, game or graphics', 'Developer, mobile','Developer, QA or test',
'DevOps specialist','Educator','Engineer, data', 'Engineer, site reliability','Engineering manager', 'Marketing or sales professional', 'Product manager', 'Scientist',
'Senior Executive (C-Suite, VP, etc.)', 'System administrator']:
#for value in ['Academic researcher','Data or business analyst', 'Designer']:
print(value)
#df[value]= df.apply(lambda row: 1, axis=1)
#df[value]= df.apply(lambda row: value in str(row['DevType']), axis=1)
df[value]= df.apply(lambda row: pd.notnull(row['DevType']) and value in row['DevType'], axis=1)
print(df.count())
print(df.size)
print(df)
import pandas as pd
import matplotlib.pyplot as plt
filepath=r'survey_results_public.csv'
df=pd.read_csv(filepath)
print("The dataframe columns are:\n",list(df.columns))
print('-'*30)
#Let's check what kind of dtypes is in each column,
#if stats can be extracted - print it
for i in range(len(df.dtypes.index)):
print(df.dtypes.index[i] , 'is of type ', df.dtypes[i])
if df.dtypes[i]=='float64':
print('*'*10,"\nAnd it's statistics:")
print(df[df.dtypes.index[i]].describe())
#who is the most responsive country?
most_responsive_country=df['Country'].value_counts().index[0]
#now let's check what is the average working time per week for the most responsive country
most_responsive_country_df=df[df['Country']==most_responsive_country]
average_working_time_weekly=most_responsive_country_df['WorkWeekHrs'].mean()
#back to the original df, see what is the study fields distribution
#and then plotting it as a bar chart
study_fields_normalized=df['UndergradMajor'].value_counts(normalize=True)
fig,ax=plt.subplots()
ax.barh(list(study_fields_normalized.index),list(study_fields_normalized*100))
ax.set_xlabel("Relative Distribution")
fig.show()
Other slides
Pandas Stocks
import pandas
import pandas_datareader.data as web
all_data = { ticker: web.get_data_yahoo(ticker) for ticker in ['AAPL', 'IBM', 'MSFT', 'GOOG']}
print(all_data.keys()) # dict_keys(['MSFT', 'IBM', 'AAPL', 'GOOG'])
print(all_data['MSFT'].keys()) # Index(['Open', 'High', 'Low', 'Close', 'Volume', 'Adj Close'], dtype='object')
price = pandas.DataFrame({ticker: data['Adj Close'] for ticker, data in all_data.items()})
print(price.head())
volume = pandas.DataFrame({ticker: data['Volume'] for ticker, data in all_data.items()})
print(volume.tail())
returns = price.pct_change() # change in percentage
print(returns.head())
# correlation
print(returns.MSFT.corr(returns.IBM)) # 0.49532932971
print(returns.MSFT.corr(returns.AAPL)) # 0.389551383559
# covariance
print(returns.MSFT.cov(returns.IBM)) # 8.50115754064e-05
print(returns.MSFT.cov(returns.AAPL)) # 9.15254855961e-05
Pandas Stocks
import pandas
prices = pandas.read_csv('stock_prices.csv')
print(prices)
Merge Dataframes
import pandas as pd
import numpy as np
# s = pd.Series([1,3,5,np.nan,6,8])
# dates = pd.date_range('20130101', periods=6)
# x = pd.date_range('20130101', periods=6, freq='3D')
# df = pd.DataFrame(np.random.randn(6,4), index=dates, columns=list('ABCD'))
# df = pd.DataFrame(np.random.randn(6,4), index=dates, columns=list('ABCD'))
# df = pd.DataFrame(np.random.randn(6,4), index=dates, columns=list('ABC'))
# df2 = pd.DataFrame({ 'A' : 1.,
# 'B' : pd.Timestamp('20130102'),
# 'C' : pd.Series(1,index=list(range(4)),dtype='float32'),
# 'D' : np.array([3] * 4,dtype='int32'),
# 'E' : pd.Categorical(["test","train","test","train"]),
# 'F' : 'foo' })
a = pd.DataFrame({ 'A' : ['Joe', 'Jane', 'Foo', 'Bar'], 'B' : [1, 23, 12, 5] })
b = pd.DataFrame({ 'A' : ['Joe', 'Jane', 'Foo', 'Bar'], 'B' : [7, 10, 27, 1 ] })
#c = pd.DataFrame({ 'A' : ['Jane', 'Joe', 'Foo', 'Bar'], 'B' : [10, 7, 27, 1 ] })
c = b.sort_values(by = 'A')
print(a)
print(b)
print(c)
print('---')
#print(a+b)
x = pd.merge(a, b, on='A')
z = pd.DataFrame({ 'A' : x.A, 'B' : x.B_x + x.B_y })
print(z)
#sa = a.sort_values(by = 'A')
#sc = c.sort_values(by = 'A')
print('-----')
#print(sa)
#print(sc)
y = pd.merge(a, c, on='A')
#print(x)
q = pd.DataFrame({ 'A' : y.A, 'B' : y.B_x + y.B_y })
print(z)
Analyze Alerts
import pandas
alerts = pandas.read_csv('../../data/alerts.csv')
print(alerts.head())
#print(alerts.count())
Analyze IFMetrics
import pandas
data = pandas.read_csv('../../data/ifmetrics.csv', na_values=['(null)'])
data.fillna(0, inplace=True)
# , parse_dates=True )
# print(type(data)) # pandas.core.frame.DataFrame
print(data.columns) # Index([ ... ], dtype='object', length=135)
#print(data['Utilization In - Threshold Exception Rate'].head(3))
for col in ['Utilization In - Threshold Exception Rate', 'Overall Exception Rate']:
dt = data[col]
print(dt[dt != 0])
#print(data.head(1))
#print(data.get_values())
Calculate Genome metrics - add columns
import pandas as pd
import numpy as np
import datetime
import sys
if len(sys.argv) < 2:
exit("Need filename")
filename = sys.argv[1]
def calculate_averages(row):
v1 = row.iloc[0:3].mean()
v2 = row.iloc[3:6].mean()
return np.log2(v1/v2)
start = datetime.datetime.now()
df = pd.read_excel(filename, index_col='genome name')
print(df.head())
print(datetime.datetime.now() - start)
# create a new column of the calculated value
df['calculated_value'] = df.apply(calculate_averages, axis=1)
print(datetime.datetime.now() - start)
threshold = 0.2
filtered_df = df[df['calculated_value'] > threshold]
print(filtered_df.head())
print(datetime.datetime.now() - start)
Calculate Genome metrics - vectorized
import pandas as pd
import numpy as np
import datetime
import sys
if len(sys.argv) < 2:
exit("Need filename")
filename = sys.argv[1]
def calculate_averages(df):
v1 = df.iloc[:, 0:3].mean(axis=1) # axis=1 -> calculate the mean row-wise
v2 = df.iloc[:, 3:6].mean(axis=1)
return np.log2(v1/v2)
start = datetime.datetime.now()
df = pd.read_excel(filename, index_col='genome name')
print(df.head())
print(datetime.datetime.now() - start)
calculated_value = calculate_averages(df)
print(datetime.datetime.now() - start)
threshold = 0.2
filtered_df = df[calculated_value > threshold]
print(filtered_df.head())
print(datetime.datetime.now() - start)
Calculate Genome metrics - vectorized numpy
import pandas as pd
import numpy as np
import datetime
import sys
if len(sys.argv) < 2:
exit("Need filename")
filename = sys.argv[1]
def calculate_averages(df_numpy):
v1 = df_numpy[:, 0:3].mean(axis=1)
v2 = df_numpy[:, 3:6].mean(axis=1)
return np.log2(v1/v2)
start = datetime.datetime.now()
df = pd.read_excel(filename, index_col='genome name')
print(df.head())
print(datetime.datetime.now() - start)
# the .values attribute changes from Pandas to numpy array
# (no more iloc, no headers, no index)
calculated_value = calculate_averages(df.values)
print(datetime.datetime.now() - start)
threshold = 0.2
filtered_df = df[calculated_value > threshold]
print(filtered_df.head())
print(datetime.datetime.now() - start)
Pandas more
df.iloc[:, 4:10].sum(axis=1)
# rearrange order of columns
cols = list(df.columns)
df = df[ cols[0:4], cols[-1], cols[4:20] ]
to_csv('file.csv', index=False)
read_csv(filename, delimiter='\t')
to_csv(filename, sep='\t')
# after filtering out some rows:
df = df.reset_index()
df.reset_index(drop=True, inplace=True)
filter with
df.loc[ ~df['Name'].str.contains('substring') ]
can also have regex=True parameter
# replace values
df[ df['Name'] == 'old', 'Name' ] = 'new'
Pandas Series
- Series
- values
- index
- RangeIndex
import pandas
s = pandas.Series([1, 1, 2, 3, 5, 8])
print(s)
# 0 1
# 1 1
# 2 2
# 3 3
# 4 5
# 5 8
# dtype: int64
print(s.values) # [1 1 2 3 5 8]
print(s.index) # RangeIndex(start=0, stop=6, step=1)
print('----')
print(s.sum()) # 20
print(s.count()) # 6
print(s.mean()) # 3.33333333333
print(s.median()) # 2.5
print(s.std()) # 2.73252020426
print(s.cumsum())
# 0 1
# 1 2
# 2 4
# 3 7
# 4 12
# 5 20
# dtype: int64
Pandas Series with names
import pandas
planets = ['Mercury', 'Venus', 'Earth', 'Mars']
distances_raw = [ 0.4 , 0.7 , 1, 1.5 ]
masses_raw = [ 0.055, 0.815, 1, 0.107]
distance = pandas.Series(distances_raw, index = planets)
mass = pandas.Series(masses_raw, index = planets)
print(distance)
# Mercury 0.40
# Venus 0.70
# Earth 1.00
# Mars 1.50
# dtype: float64
print(distance.index)
# Index(['Mercury', 'Venus', 'Earth', 'Mars'], dtype='object')
print(distance[distance < 0.8])
# Mercury 0.4
# Venus 0.7
# dtype: float64
print('------')
print(distance/mass)
# Mercury 7.272727
# Venus 0.858896
# Earth 1.000000
# Mars 14.018692
# dtype: float64