In [2]:
import pandas as pd
import numpy as np

import matplotlib.pyplot as plt

from pydataset import data

The Pandas DataFrame Object

The pandas library is used to deal with structured data stored in tables. You might aquire the structured data from CSV files, TSV files, SQL database tables, or spreadsheets. You can also create pandas DataFrames. It can be thought of as a dictionary-like container for Series. Below, I will go over ways to both create DataFrame objects and read in data to a pandas DataFrame.


What Is a Pandas DataFrame?

We saw that the pandas Series object is a one-dimensional, labeled array; the pandas DataFrame object is a two-dimensional labled data structure with columns of the same or different data types. A DataFrame is like a sequence of Series objects aligned by the same index. There are three main components that make up a pandas DataFrame: the index, the columns, and the data.

dataframe diagram


The Components of a Pandas DataFrame - Index, Columns, Data

  • I can access the components of a DataFrame as easily as a Series.

DataFrame Components


Create a Pandas DataFrame:

  • There are multiple ways to create pandas DataFrame objects, and I will demonstrate some of these below.
  • If you want more, see the official doc on the pandas DataFrame here.

The pandas DataFrame constructor function defaults.

pd.DataFrame(data=None, index=None, columns=None, dtype=None, copy=False)

From a Dictionary

Does this look familiar?

In [3]:
shopping_cart = {
    "tax": .08,
    "items": [
        {
            "title": "orange juice",
            "price": 3.99,
            "quantity": 1
        },
        {
            "title": "rice",
            "price": 1.99,
            "quantity": 3
        },
        {
            "title": "beans",
            "price": 0.99,
            "quantity": 3
        },
        {
            "title": "chili sauce",
            "price": 2.99,
            "quantity": 1
        },
        {
            "title": "chocolate",
            "price": 0.75,
            "quantity": 9
        }
    ]
}
  • shopping_cart is a dictionary with two keys, tax and items, but the value for items happens to be a list of dictionaries, so depending on what I want from my data, I may need to unpack the items column. I can do that!
In [4]:
df = pd.DataFrame(shopping_cart)
df
Out[4]:
tax items
0 0.08 {'title': 'orange juice', 'price': 3.99, 'quantity': 1}
1 0.08 {'title': 'rice', 'price': 1.99, 'quantity': 3}
2 0.08 {'title': 'beans', 'price': 0.99, 'quantity': 3}
3 0.08 {'title': 'chili sauce', 'price': 2.99, 'quantity': 1}
4 0.08 {'title': 'chocolate', 'price': 0.75, 'quantity': 9}
  • I can unpack the values of the items column by applying the pd.Series() method to the items column. This returns a new DataFrame.
In [5]:
df['items'].apply(pd.Series)
Out[5]:
title price quantity
0 orange juice 3.99 1
1 rice 1.99 3
2 beans 0.99 3
3 chili sauce 2.99 1
4 chocolate 0.75 9
  • I can unpack the items column and combine the original and new DataFrames using the pd.concat() function if I want the tax column and new unpacked columns in the same DataFrame. I don't need the original items column anymore, so I can drop that column at the same time using the .drop() method.
Remember that I want to drop a column, so I need to set my axis=1, and I want to concatenate my DataFrames column-wise, so again, axis=1. I remember that axis=1 refers to columns because the '1' looks like a column!
# Default settings for the pandas concat function.

pd.concat([df1, df2], axis=0, join='outer')
In [6]:
# I can transform my two DataFrames and then concatenate or do it all at once like below.

big_df = pd.concat([df.drop(['items'], axis=1), df['items'].apply(pd.Series)], axis=1)
big_df
Out[6]:
tax title price quantity
0 0.08 orange juice 3.99 1
1 0.08 rice 1.99 3
2 0.08 beans 0.99 3
3 0.08 chili sauce 2.99 1
4 0.08 chocolate 0.75 9

From a List of Dictionaries

  • items IS already a list of dictionaries, so if I don't need tax, I can do this.
In [7]:
items = shopping_cart['items']
items
Out[7]:
[{'title': 'orange juice', 'price': 3.99, 'quantity': 1},
 {'title': 'rice', 'price': 1.99, 'quantity': 3},
 {'title': 'beans', 'price': 0.99, 'quantity': 3},
 {'title': 'chili sauce', 'price': 2.99, 'quantity': 1},
 {'title': 'chocolate', 'price': 0.75, 'quantity': 9}]
  • Now, I can pass my items variable as the data argument to pd.DataFrame.
In [8]:
cart_items = pd.DataFrame(items)
cart_items
Out[8]:
title price quantity
0 orange juice 3.99 1
1 rice 1.99 3
2 beans 0.99 3
3 chili sauce 2.99 1
4 chocolate 0.75 9
In [9]:
type(cart_items)
Out[9]:
pandas.core.frame.DataFrame

Labeled Index

In [10]:
fam = {'name':['Milla', 'Steve', 'Faith', 'Declan'], 
       'signs':['Virgo', 'Gemini', 'Aquarius', 'Aries'],
       'age': [14, 44, 34, 7]} 
fam
Out[10]:
{'name': ['Milla', 'Steve', 'Faith', 'Declan'],
 'signs': ['Virgo', 'Gemini', 'Aquarius', 'Aries'],
 'age': [14, 44, 34, 7]}
In [11]:
type(fam)
Out[11]:
dict
In [12]:
fam_df = pd.DataFrame(fam)
fam_df
Out[12]:
name signs age
0 Milla Virgo 14
1 Steve Gemini 44
2 Faith Aquarius 34
3 Declan Aries 7
  • I have the option to pass index labels when creating a DataFrame, too.
In [13]:
fam_df = pd.DataFrame(fam, index =['kane_1', 'kane_2', 'kane_3', 'kane_4'])
fam_df
Out[13]:
name signs age
kane_1 Milla Virgo 14
kane_2 Steve Gemini 44
kane_3 Faith Aquarius 34
kane_4 Declan Aries 7
In [14]:
fam_df.info()
<class 'pandas.core.frame.DataFrame'>
Index: 4 entries, kane_1 to kane_4
Data columns (total 3 columns):
 #   Column  Non-Null Count  Dtype 
---  ------  --------------  ----- 
 0   name    4 non-null      object
 1   signs   4 non-null      object
 2   age     4 non-null      int64 
dtypes: int64(1), object(2)
memory usage: 128.0+ bytes

Read Data Into a Pandas DataFrame

Create your DataFrame using a SQL query to access a database.

  • I can use the pandas pd.read_sql() function to read the results of a SQL query into a pandas DataFrame.
df = pd.read_sql(sql_query, connection_url)

To do this, I will need to import host, password, and user from my env file.

In [15]:
from env import host, password, user
  • I can create a handy function to get my connection_url argument for my function. My function will take in a string for the name of the db, and the user, host, and password parameters have default arguments set to match the imports from my env file.
In [16]:
def get_connection(db, user=user, host=host, password=password):
    return f'mysql+pymysql://{user}:{password}@{host}/{db}'
  • I need to write a valid SQL query.
In [17]:
sql_query = 'SELECT * FROM employees'
  • Putting it all together, I will end up with something like this:
In [18]:
def get_employees_data(db):
    return pd.read_sql(sql_query, get_connection(db))
  • Call my function and assign my data to the variable employees.
employees = get_employees_data('employees')

Write your DataFrame to a CSV file

When I import a large dataset using a SQL query, it only takes me about 3 or 4 times of restarting my kernel and waiting minutes for my query to run to decide I need to stop and write my new DataFrame to a CSV file that I can access instantly.

I only need to run this code once to create a new CSV file in my current directory, and then I can read in my data as shown in the next section. I can comment out the code I was using to read in my data and write to a CSV file after I have my CSV file.

employees.to_csv('employees_df.csv')
In [19]:
# Create my DataFrame reading from my own CSV file; way faster now.

employees = pd.read_csv('employees_df.csv', index_col=0)
employees.head()
Out[19]:
emp_no birth_date first_name last_name gender hire_date
0 10001 1953-09-02 Georgi Facello M 1986-06-26
1 10002 1964-06-02 Bezalel Simmel F 1985-11-21
2 10003 1959-12-03 Parto Bamford M 1986-08-28
3 10004 1954-05-01 Chirstian Koblick M 1986-12-01
4 10005 1955-01-21 Kyoichi Maliniak M 1989-09-12

Create your DataFrame using data from a CSV file.

  • I can use the pandas pd.read_csv() function to read the data from a CSV file into a pandas DataFrame.
Tip: If you read in your data using a csv file and find that you have an extra column called 'Unnamed: 0', adding index_col=0 as an argument will get rid of that.
# If my csv file is in the same directory as my notebook, I can do this.
df = pd.read_csv('file_name.csv', index_col=0)
# If my csv file is not in the same directory as my notebook, I have to include the file path.
df = pd.read_csv('file_path/file_name.csv', index_col=0)
In [20]:
student_df = pd.read_csv('data/interesting_data.csv')
student_df.head()
Out[20]:
Country Region DataYear ClassGrade Gender Ageyears Handed Height_cm Footlength_cm Armspan_cm ... Watching_TV_Hours Paid_Work_Hours Work_At_Home_Hours Schoolwork_Pressure Planned_Education_Level Favorite_Music Superpower Preferred_Status Role_Model_Type Charity_Donation
0 USA TX 2018 12 Female 17.0 Right-Handed 152 22.5 24 ... 0.0 35.0 1.0 Very little Graduate degree Rap/Hip hop Invisibility Happy Business person Environment
1 USA TN 2018 12 Male 18.0 NaN NaN NaN NaN ... NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
2 USA CO 2018 12 Female 17.0 Ambidextrous 164cm 22cm 160cm ... 6.0 2.0 2.0 Some Graduate degree Rap/Hip hop Invisibility Happy Friend International aid
3 USA ID 2018 12 Male 23.0 Right-Handed 171.5 21.6 162.5 ... 2.0 0.0 8.0 A lot Graduate degree Gospel Freeze time Happy Relative Religious
4 USA NC 2018 12 Female 17.0 Right-Handed 136 26 54 ... 0.0 2.0 2.0 A lot Undergraduate degree Rap/Hip hop Freeze time Happy Relative International aid

5 rows × 60 columns


Create your DataFrame using PyDataset.

  • I have to import the following:
from pydataset import data
  • This code snippet will show me the information doc on the dataset.
data(df_string_name, show_doc=True)
  • This code snippet will load the dataset for use as a pandas DataFrame.
df = data(df_string_name)
  • There are 757 available datasets using pydataset. Running the following code snippet in a cell will return a DataFrame with all of your options.
data()
Warning: When you import data from pydataset, do not assign anything else in your notebook to the variable 'data'. If you name some dictionary or list in your notebook 'data', and then try to import a pydataset farther down, you will go crazy trying to figure out why you can't access the datasets from pydataset anymore! Just don't do it! And when you DO do it, because someday you will, remember this warning, find the problematic variable, rename it, and run your imports again.
In [21]:
# This is a pydataset called HairEyeColor.

colors = data('HairEyeColor')
colors.head()
Out[21]:
Hair Eye Sex Freq
1 Black Brown Male 32
2 Brown Brown Male 53
3 Red Brown Male 10
4 Blond Brown Male 3
5 Black Blue Male 11

Create your DataFrame using copy-pasted tabular data from your clipboard!

  • This is a super cool pandas function that allows you to create a DataFrame from data that you have just copied to your clipboard. Depending on the format of the data you copy, you may need to tweak some parameters to get your DataFrame exactly as you want. Instead of going into a bunch of detail here, check out this simple medium article on how to use this method.
  • Already have a header? This works.
    pd.read_clipboard()
    
  • No header on the data you copied? No problem, you can create your own with this.
pd.read_clipboard(headers=None, names=desired_column_names)

So What's So Great About a Pandas DataFrame?

A DataFrame can be composed of columns of different data types, it allows for indexing and subsetting of data, and comes with many useful attributes and methods adding an enormous array of functionality to this pandas object. One of the most impressive features of the pandas library is that it offers vectorized functions optimized to operate on Series and DataFrame objects. I explored what this means for the pandas Series object in the Series Review Notebook; this notebook will dig into what this means for a DataFrame object.

I'll explore all of these awesome features in depth throughout the rest of this notebook.


Now What?

Most of the methods and attributes covered in my pandas Series review notebook also belong to pandas DataFrame objects. They may function slightly differently, have additional parameters and/or functionality, or return different objects, but you will find most of them useful whether working with Series or DataFrames. In this notebook, I will expand upon some of the methods and attributes previously covered, but for the most part my goal is to add to my existing toolbox.

In [22]:
# There are 433 methods and attributes of a pandas Series.

series_attribute_methods = set(dir(pd.Series))
len(series_attribute_methods)
Out[22]:
433
In [23]:
# There are 430 methods and attributes of a pandas DataFrame.

df_attribute_methods = set(dir(pd.DataFrame))
len(df_attribute_methods)
Out[23]:
430
In [24]:
# There are 377 methods and attributes that belong to BOTH pandas Series AND DataFrames.

len(series_attribute_methods & df_attribute_methods)
Out[24]:
377

Indexing

  • Like the pandas Series object, the pandas DataFrame object supports both position- and label-based indexing using the indexing operator [].
  • I will demonstrate concrete examples of indexing using the indexing operator [] alone and with the .loc and .iloc attributes below.

[]

  • I can use a list of columns from a DataFrame with the indexing operator (aka bracket notation) to return a subset of my original DataFrame.
In [25]:
# Peek at columns in student_df

student_df.head(1)
Out[25]:
Country Region DataYear ClassGrade Gender Ageyears Handed Height_cm Footlength_cm Armspan_cm ... Watching_TV_Hours Paid_Work_Hours Work_At_Home_Hours Schoolwork_Pressure Planned_Education_Level Favorite_Music Superpower Preferred_Status Role_Model_Type Charity_Donation
0 USA TX 2018 12 Female 17.0 Right-Handed 152 22.5 24 ... 0.0 35.0 1.0 Very little Graduate degree Rap/Hip hop Invisibility Happy Business person Environment

1 rows × 60 columns

In [26]:
student_df.shape
Out[26]:
(100, 60)
In [27]:
# Create a df that is a subset of the original student_df.

student_subset = student_df[['Country', 'Region']]
student_subset.head()
Out[27]:
Country Region
0 USA TX
1 USA TN
2 USA CO
3 USA ID
4 USA NC
In [28]:
# I can choose columns and rows like this.

student_df[['Country', 'Region']][10:20]
Out[28]:
Country Region
10 USA IA
11 USA NJ
12 USA TX
13 USA FL
14 USA OK
15 USA ID
16 USA TX
17 USA IL
18 USA CA
19 USA ID
In [29]:
# I can create a subset using a boolean Series; I will use this to subset my original df.

female_bool_series = student_df.Gender == 'Female'
female_bool_series.head()
Out[29]:
0     True
1    False
2     True
3    False
4     True
Name: Gender, dtype: bool
In [30]:
# I use my boolean Series to select only observations where `Gender == Female`.

female_subset = student_df[female_bool_series]
female_subset.head()
Out[30]:
Country Region DataYear ClassGrade Gender Ageyears Handed Height_cm Footlength_cm Armspan_cm ... Watching_TV_Hours Paid_Work_Hours Work_At_Home_Hours Schoolwork_Pressure Planned_Education_Level Favorite_Music Superpower Preferred_Status Role_Model_Type Charity_Donation
0 USA TX 2018 12 Female 17.0 Right-Handed 152 22.5 24 ... 0.0 35.0 1.0 Very little Graduate degree Rap/Hip hop Invisibility Happy Business person Environment
2 USA CO 2018 12 Female 17.0 Ambidextrous 164cm 22cm 160cm ... 6.0 2.0 2.0 Some Graduate degree Rap/Hip hop Invisibility Happy Friend International aid
4 USA NC 2018 12 Female 17.0 Right-Handed 136 26 54 ... 0.0 2.0 2.0 A lot Undergraduate degree Rap/Hip hop Freeze time Happy Relative International aid
5 USA SC 2018 12 Female 17.0 Right-Handed 5.1 6 10 ... 0.0 40.0 1.0 A lot Some college Country Telepathy Happy Religious figure Religious
7 USA CA 2018 12 Female 17.0 Right-Handed 152 22 142 ... 12.0 1.0 3.0 Some Graduate degree Techno/Electronic Freeze time Happy Coach or club leader Education/Youth development

5 rows × 60 columns


.loc

  • I can use the .loc attribute to select specific rows AND columns by index label. My index label can be a number, but it can also be a string label. This method offers a lot of flexibility! The .loc attribute's indexing is inclusive and uses an index label, not position.
df.loc[row_indexer, column_indexer]
In [31]:
# I want the rows from start through 5 (inclusive) and columns from Gender through Armspan_cm (inclusive).

loc_subset = student_df.loc[:5, 'Gender': 'Armspan_cm']
loc_subset
Out[31]:
Gender Ageyears Handed Height_cm Footlength_cm Armspan_cm
0 Female 17.0 Right-Handed 152 22.5 24
1 Male 18.0 NaN NaN NaN NaN
2 Female 17.0 Ambidextrous 164cm 22cm 160cm
3 Male 23.0 Right-Handed 171.5 21.6 162.5
4 Female 17.0 Right-Handed 136 26 54
5 Female 17.0 Right-Handed 5.1 6 10
In [32]:
loc_subset.shape
Out[32]:
(6, 6)

Example of Boolean Indexing Using .loc

  • Here I am passing a boolean Series as a selector to the .loc attribute called on my original series from above. As you can see below, where the boolean Series has a True value, the observation from the original Series is returned.
In [33]:
# Create a bool Series to select a subset of ambidextrous students using `.loc`

ambi_bool = student_df.Handed == 'Ambidextrous'
ambi_bool.head()
Out[33]:
0    False
1    False
2     True
3    False
4    False
Name: Handed, dtype: bool
In [34]:
# Create subset of ambidextrous students using my bool Series

student_df.loc[ambi_bool]
Out[34]:
Country Region DataYear ClassGrade Gender Ageyears Handed Height_cm Footlength_cm Armspan_cm ... Watching_TV_Hours Paid_Work_Hours Work_At_Home_Hours Schoolwork_Pressure Planned_Education_Level Favorite_Music Superpower Preferred_Status Role_Model_Type Charity_Donation
2 USA CO 2018 12 Female 17.0 Ambidextrous 164cm 22cm 160cm ... 6.0 2.0 2.0 Some Graduate degree Rap/Hip hop Invisibility Happy Friend International aid
29 USA FL 2018 12 Male 17.0 Ambidextrous 192.5 32.5 202.5 ... 4.0 0.0 0.0 Some Graduate degree Rap/Hip hop Fly Happy Relative Arts, culture, sports
30 USA PA 2018 12 Female 17.0 Ambidextrous 173 24 173 ... 1.0 12.0 7.0 A lot Undergraduate degree Classical Telepathy Happy Friend International aid
57 USA IL 2018 12 Female 17.0 Ambidextrous 168 26 157 ... 7.0 2.0 7.0 A lot Graduate degree Country Freeze time Happy Relative Education/Youth development
62 USA IN 2018 12 Female 17.0 Ambidextrous 145 21.5 145 ... NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN

5 rows × 60 columns

In [35]:
# I can select only students who are ambidextrous using my bool Series AND select specific columns.

student_df.loc[ambi_bool, ['Gender', 'ClassGrade' ]]
Out[35]:
Gender ClassGrade
2 Female 12
29 Male 12
30 Female 12
57 Female 12
62 Female 12

  • I can use a lambda function with with the .loc attribute to subset my data.
In [36]:
student_df.loc[lambda df: df['Region'] == 'TX']
Out[36]:
Country Region DataYear ClassGrade Gender Ageyears Handed Height_cm Footlength_cm Armspan_cm ... Watching_TV_Hours Paid_Work_Hours Work_At_Home_Hours Schoolwork_Pressure Planned_Education_Level Favorite_Music Superpower Preferred_Status Role_Model_Type Charity_Donation
0 USA TX 2018 12 Female 17.0 Right-Handed 152 22.5 24 ... 0.0 35.0 1.0 Very little Graduate degree Rap/Hip hop Invisibility Happy Business person Environment
12 USA TX 2018 12 Female 17.0 Right-Handed 165 26 173 ... 3.0 0.0 1.0 A lot Graduate degree Rap/Hip hop Telepathy Healthy Musician or singer Wildlife, animals
16 USA TX 2018 12 Female 17.0 Right-Handed NaN 25 174 ... 4.0 0.0 0.0 Some Other Other Telepathy Famous Actor Wildlife, animals
64 USA TX 2018 12 Male 17.0 Right-Handed 170 28.6 10000 ... 0.2 2.0 1.0 NaN Graduate degree Other Invisibility Happy Friend Health
71 USA TX 2018 12 Female 17.0 Right-Handed 160 24.6 160 ... 0.0 35.0 7.0 A lot Graduate degree Rhythm and blues (R&B) Telepathy Happy Musician or singer International aid
84 USA TX 2018 12 Male 18.0 Right-Handed 184 28 76 ... NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
85 USA TX 2018 12 Male 17.0 Right-Handed 175 25 170 ... 0.0 40.0 0.0 Very little Graduate degree Rap/Hip hop Super strength Happy Other Health

7 rows × 60 columns


.iloc

  • I can use the .iloc attribute to select specific rows and colums by index position. .iloc does not accept a boolean Series as a selector like .loc does. It takes in integers representing index position and is NOT inclusive.
df.iloc[row_indexer, column_indexer]
In [37]:
student_df.iloc[90:, :3 ]
Out[37]:
Country Region DataYear
90 USA MA 2018
91 USA MA 2018
92 USA CA 2018
93 USA PA 2018
94 USA NJ 2018
95 USA ID 2018
96 USA ID 2018
97 USA PA 2018
98 USA PA 2018
99 USA AL 2018

Manipulating the Index

.set_index()

  • This method allows me to set my DataFrame index using an existing column. I can also pass a Series or Index object the same length as my DataFrame as an argument to create an new index. This will not change my original DataFrame because the default is inplace=False.
df.set_index(keys, drop=True, append=False, inplace=False, verify_integrity=False)
In [38]:
# My original fam_df.

fam_df.head(1)
Out[38]:
name signs age
kane_1 Milla Virgo 14
In [39]:
# I can set the `name` column to be my new index.

fam_df.set_index('name')
Out[39]:
signs age
name
Milla Virgo 14
Steve Gemini 44
Faith Aquarius 34
Declan Aries 7

In [40]:
# Create a new index using a pandas Index object of the same length.

fam_df.set_index(pd.Index([1, 2, 3, 4]))
Out[40]:
name signs age
1 Milla Virgo 14
2 Steve Gemini 44
3 Faith Aquarius 34
4 Declan Aries 7

.reset_index()

  • This method will come in handy a lot as we move into methodologies, but for now I'll at least introduce it. This method does not change your original DataFrame unless you pass inplace=True; otherwise, just reassign or assign the new copy.
df.reset_index(level=None, drop=False, inplace=False, col_level=0, col_fill='')
In [41]:
# This resets the fam_df index to default and add my original index as a column.

fam_df.reset_index()
Out[41]:
index name signs age
0 kane_1 Milla Virgo 14
1 kane_2 Steve Gemini 44
2 kane_3 Faith Aquarius 34
3 kane_4 Declan Aries 7

In [42]:
# I reset my index and rename original index column if I want to keep it.

fam_df.reset_index().rename(columns={'index': 'id'})
Out[42]:
id name signs age
0 kane_1 Milla Virgo 14
1 kane_2 Steve Gemini 44
2 kane_3 Faith Aquarius 34
3 kane_4 Declan Aries 7

In [43]:
# If I don't want the original index as a column, I can set `drop=True`.

fam_df.reset_index(drop=True)
Out[43]:
name signs age
0 Milla Virgo 14
1 Steve Gemini 44
2 Faith Aquarius 34
3 Declan Aries 7

Aggregating

.groupby()

  • This powerful method allows you to group your data by one or more columns and apply any type of function to each group returning the calculations in a Series or DataFrame.
  • I can use a single grouping column, a single aggregating column, and a single aggregating function.
df.groupby('grouping_column').agg_column.agg_func()
In [44]:
student_df.head(1)
Out[44]:
Country Region DataYear ClassGrade Gender Ageyears Handed Height_cm Footlength_cm Armspan_cm ... Watching_TV_Hours Paid_Work_Hours Work_At_Home_Hours Schoolwork_Pressure Planned_Education_Level Favorite_Music Superpower Preferred_Status Role_Model_Type Charity_Donation
0 USA TX 2018 12 Female 17.0 Right-Handed 152 22.5 24 ... 0.0 35.0 1.0 Very little Graduate degree Rap/Hip hop Invisibility Happy Business person Environment

1 rows × 60 columns

In [45]:
# Use a `groupby()` to calculate the average age by Gender; return a df by using [['double_brackets']].

student_df.groupby('Gender')[['Ageyears']].mean()
Out[45]:
Ageyears
Gender
Female 17.021739
Male 17.943396
  • I can use a list of grouping columns and a single aggregating function.
df.groupby(['list', 'of', 'grouping', 'columns']).agg_func()
In [46]:
# Perform a multi-level groupby aggregation.

student_df.groupby(['Gender', 'Handed']).size()
Out[46]:
Gender  Handed      
Female  Ambidextrous     4
        Left-Handed      4
        Right-Handed    38
Male    Ambidextrous     1
        Left-Handed      8
        Right-Handed    42
dtype: int64

.agg()

Chaining the .agg() method with a .groupby() provides more flexibility when I'm aggregating.

I can use a list of grouping columns and perform more than one function on my aggregating column.

df.groupby(['list', 'of', 'grouping', 'columns']).agg_column.agg_func(['func', 'other_func'])
In [47]:
student_df.groupby(['Gender', 'Handed']).Ageyears.agg(['mean', 'median'])
Out[47]:
mean median
Gender Handed
Female Ambidextrous 17.000000 17.0
Left-Handed 17.000000 17.0
Right-Handed 17.026316 17.0
Male Ambidextrous 17.000000 17.0
Left-Handed 17.250000 17.0
Right-Handed 18.119048 17.0

I can use a list of grouping columns and pass a dictionary to the .agg() method to use different aggregating functions on different columns.

In [48]:
student_df.groupby(['Gender', 'Handed']).agg({'Ageyears': 'mean', 'Text_Messages_Sent_Yesterday': 'median'})
Out[48]:
Ageyears Text_Messages_Sent_Yesterday
Gender Handed
Female Ambidextrous 17.000000 15.0
Left-Handed 17.000000 200.0
Right-Handed 17.026316 17.0
Male Ambidextrous 17.000000 30.0
Left-Handed 17.250000 20.0
Right-Handed 18.119048 15.0

Joining

.concat()

  • This function takes in a list or dictionary of Series or DataFrame objects and joins them along a particular axis, row-wise axis=0 or column-wise axis=1.
# For example, concat with a list of two DataFrames
pd.concat([df1, df2], axis=0)
  • When your list contains at least one DataFrame, a DataFrame is returned.
  • When concatenating only Series objects row-wise, axis=0, a Series is returned.
  • When concatenating Series or DataFrames column-wise, axis=1, a DataFrame is returned.
# Default is set to row-wise concatenation using an outer join.
pd.concat(objs, axis=0, join='outer')
Row-wise Concat

Combine two DataFrame objects with identical columns:

In [49]:
fam_df
Out[49]:
name signs age
kane_1 Milla Virgo 14
kane_2 Steve Gemini 44
kane_3 Faith Aquarius 34
kane_4 Declan Aries 7
In [50]:
# Create a list of dictionaries to be new rows in a DataFrame concatenated to my original fam_df.

new = [{'name': 'Penny', 'signs': 'Libra', 'age': '0'},
       {'name': 'Betty', 'signs': 'Libra', 'age': '1'},
       {'name': 'Pris', 'signs': 'Scorpio', 'age': '2'}]
In [51]:
# Create new_df using my list of dictionaries above.

new_df = pd.DataFrame(new, index=['kane_5', 'kane_6', 'kane_7'])
new_df
Out[51]:
name signs age
kane_5 Penny Libra 0
kane_6 Betty Libra 1
kane_7 Pris Scorpio 2
In [52]:
# Concatenate my new_df to my original fam_df; the default, `axis=0`, will stack these dfs.

fam_df = pd.concat([fam_df, new_df])
fam_df
Out[52]:
name signs age
kane_1 Milla Virgo 14
kane_2 Steve Gemini 44
kane_3 Faith Aquarius 34
kane_4 Declan Aries 7
kane_5 Penny Libra 0
kane_6 Betty Libra 1
kane_7 Pris Scorpio 2
Column-wise Concat

Combine two DataFrame objects with identical index labels:

In [53]:
new_cols_df = pd.DataFrame({'eyes': ['brown', 'brown', 'blue', 'brown', 'amber', 'brown', 'hazel'],
                           'hair': ['brown', 'black', 'blonde', 'red', 'red', 'black', 'red']},
                           index=['kane_1', 'kane_2', 'kane_3','kane_4', 'kane_5', 'kane_6', 'kane_7']) 

new_cols_df
Out[53]:
eyes hair
kane_1 brown brown
kane_2 brown black
kane_3 blue blonde
kane_4 brown red
kane_5 amber red
kane_6 brown black
kane_7 hazel red
In [54]:
fam_df = pd.concat([fam_df, new_cols_df], axis=1)
fam_df
Out[54]:
name signs age eyes hair
kane_1 Milla Virgo 14 brown brown
kane_2 Steve Gemini 44 brown black
kane_3 Faith Aquarius 34 blue blonde
kane_4 Declan Aries 7 brown red
kane_5 Penny Libra 0 amber red
kane_6 Betty Libra 1 brown black
kane_7 Pris Scorpio 2 hazel red

df.merge()

  • This method is similar to a SQL join. Here's a cool read making a comparison between the two, if you're interested.
left_df.merge(right_df, how='inner', on=None, left_on=None, right_on=None, left_index=False, right_index=False, sort=False, suffixes='_x', '_y', copy=True, indicator=False, validate=None)
  • How does changing the default argument of the how parameter change my resulting DataFrame?
how
  • Type of merge to be performed.

how=left: use only keys from left frame, similar to a SQL left outer join; preserve key order.

how=right: use only keys from right frame, similar to a SQL right outer join; preserve key order.

how=outer: use union of keys from both frames, similar to a SQL full outer join; sort keys lexicographically.

how=inner: use intersection of keys from both frames, similar to a SQL inner join; preserve the order of the left keys.

on
  • Merge on Label or List
  • The default argument for the on parameter is None, so if I am not merging on the indexes of the Series or DataFrame objects I'm joining, this defaults to the intersection of the columns in both objects. Otherwise, I can pass the column(s) name or index level to join on.
In [55]:
# Read in some data from a CSV file to create my `titles` DataFrame.

titles = pd.read_csv('data/titles.csv', index_col=0)
titles.head(2)
Out[55]:
emp_no title from_date to_date
0 10001 Senior Engineer 1986-06-26 9999-01-01
1 10002 Staff 1996-08-03 9999-01-01
In [56]:
# Peek at columns in table I want to merge with.

employees.head(2)
Out[56]:
emp_no birth_date first_name last_name gender hire_date
0 10001 1953-09-02 Georgi Facello M 1986-06-26
1 10002 1964-06-02 Bezalel Simmel F 1985-11-21
In [57]:
# Merge employees and titles DataFrames on `emp_no` column.

all_emp_titles = employees.merge(titles, on='emp_no')
all_emp_titles.head()
Out[57]:
emp_no birth_date first_name last_name gender hire_date title from_date to_date
0 10001 1953-09-02 Georgi Facello M 1986-06-26 Senior Engineer 1986-06-26 9999-01-01
1 10002 1964-06-02 Bezalel Simmel F 1985-11-21 Staff 1996-08-03 9999-01-01
2 10003 1959-12-03 Parto Bamford M 1986-08-28 Senior Engineer 1995-12-03 9999-01-01
3 10004 1954-05-01 Chirstian Koblick M 1986-12-01 Engineer 1986-12-01 1995-12-01
4 10004 1954-05-01 Chirstian Koblick M 1986-12-01 Senior Engineer 1995-12-01 9999-01-01

Creating and Deleting Columns

.drop()

  • The .drop() method allows me to drop rows or columns from a DataFrame. I can specify a single label or list of labels to drop; the default for the axis parameter is axis=0 (rows).
DataFrame.drop(labels=None, axis=0, index=None, columns=None, level=None, inplace=False, errors='raise')
  • If I want to drop columns, I can adjust the axis parameter or, simpler, I can just pass a single label or list of labels as an argument to columns. Again, the inplace=False, so I have to either save the resulting DataFrame to a new variable or reassign it to my original variable; this WILL mutate my original data, so be careful.
  • If I reassign my new DataFrame, thus changing my original, and I try to rerun that code, I WILL get an error because that column(s) is gone. Don't be surprised when you hit this error; you will.
In [58]:
student_df.head(1)
Out[58]:
Country Region DataYear ClassGrade Gender Ageyears Handed Height_cm Footlength_cm Armspan_cm ... Watching_TV_Hours Paid_Work_Hours Work_At_Home_Hours Schoolwork_Pressure Planned_Education_Level Favorite_Music Superpower Preferred_Status Role_Model_Type Charity_Donation
0 USA TX 2018 12 Female 17.0 Right-Handed 152 22.5 24 ... 0.0 35.0 1.0 Very little Graduate degree Rap/Hip hop Invisibility Happy Business person Environment

1 rows × 60 columns

In [59]:
# Country and Region are gone in the resulting df copy; I did not reassign, so temporarily.

student_df.drop(columns=['Country', 'Region']).head(3)
Out[59]:
DataYear ClassGrade Gender Ageyears Handed Height_cm Footlength_cm Armspan_cm Languages_spoken Travel_to_School ... Watching_TV_Hours Paid_Work_Hours Work_At_Home_Hours Schoolwork_Pressure Planned_Education_Level Favorite_Music Superpower Preferred_Status Role_Model_Type Charity_Donation
0 2018 12 Female 17.0 Right-Handed 152 22.5 24 1.0 Car ... 0.0 35.0 1.0 Very little Graduate degree Rap/Hip hop Invisibility Happy Business person Environment
1 2018 12 Male 18.0 NaN NaN NaN NaN NaN NaN ... NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
2 2018 12 Female 17.0 Ambidextrous 164cm 22cm 160cm 2.0 Car ... 6.0 2.0 2.0 Some Graduate degree Rap/Hip hop Invisibility Happy Friend International aid

3 rows × 58 columns


In [60]:
# Drop a row by index position or list of positions.

student_df.drop(index = student_df.index[[2, 5]]).head()
Out[60]:
Country Region DataYear ClassGrade Gender Ageyears Handed Height_cm Footlength_cm Armspan_cm ... Watching_TV_Hours Paid_Work_Hours Work_At_Home_Hours Schoolwork_Pressure Planned_Education_Level Favorite_Music Superpower Preferred_Status Role_Model_Type Charity_Donation
0 USA TX 2018 12 Female 17.0 Right-Handed 152 22.5 24 ... 0.0 35.0 1.0 Very little Graduate degree Rap/Hip hop Invisibility Happy Business person Environment
1 USA TN 2018 12 Male 18.0 NaN NaN NaN NaN ... NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
3 USA ID 2018 12 Male 23.0 Right-Handed 171.5 21.6 162.5 ... 2.0 0.0 8.0 A lot Graduate degree Gospel Freeze time Happy Relative Religious
4 USA NC 2018 12 Female 17.0 Right-Handed 136 26 54 ... 0.0 2.0 2.0 A lot Undergraduate degree Rap/Hip hop Freeze time Happy Relative International aid
6 USA ID 2018 12 Male 24.0 Right-Handed 72 24 70 ... 3.0 0.0 2.0 Some Graduate degree Rap/Hip hop NaN Happy Relative Religious

5 rows × 60 columns


.rename()

  • I can rename my columns using this method, but unless I'm changing column names as a part of method chaining or I don't want to change my original DataFrame, I will usually just rename my columns using the .columns attribute. It's good to know how to use both, so you can use the best one for your context.
df.rename(columns={'original_name': 'new_name', 'original_name': 'new_name'})

OR

df.columns = ['new_name_1', 'new_name_2', 'new_name_3']
In [61]:
# If I want to keep these changes, I need to reassign.

student_df.rename(columns={'DataYear': 'year', 'ClassGrade': 'grade'}).head(1)
Out[61]:
Country Region year grade Gender Ageyears Handed Height_cm Footlength_cm Armspan_cm ... Watching_TV_Hours Paid_Work_Hours Work_At_Home_Hours Schoolwork_Pressure Planned_Education_Level Favorite_Music Superpower Preferred_Status Role_Model_Type Charity_Donation
0 USA TX 2018 12 Female 17.0 Right-Handed 152 22.5 24 ... 0.0 35.0 1.0 Very little Graduate degree Rap/Hip hop Invisibility Happy Business person Environment

1 rows × 60 columns


In [62]:
fam_df
Out[62]:
name signs age eyes hair
kane_1 Milla Virgo 14 brown brown
kane_2 Steve Gemini 44 brown black
kane_3 Faith Aquarius 34 blue blonde
kane_4 Declan Aries 7 brown red
kane_5 Penny Libra 0 amber red
kane_6 Betty Libra 1 brown black
kane_7 Pris Scorpio 2 hazel red
In [63]:
# Use `.rename()` method as a part of method chaining. It's a good thing.

rename_fam_df = fam_df.drop(columns='signs').rename(columns={'name': 'family_member'})
rename_fam_df
Out[63]:
family_member age eyes hair
kane_1 Milla 14 brown brown
kane_2 Steve 44 brown black
kane_3 Faith 34 blue blonde
kane_4 Declan 7 brown red
kane_5 Penny 0 amber red
kane_6 Betty 1 brown black
kane_7 Pris 2 hazel red

.columns

Here's a fast way to rename a lot of columns by accessing and reassigning to the .columns attribute.

df.columns = ['new_col_name', 'new_col_name', 'new_col_name'...]
In [64]:
employees.head(1)
Out[64]:
emp_no birth_date first_name last_name gender hire_date
0 10001 1953-09-02 Georgi Facello M 1986-06-26
In [65]:
# Print out a list of my original column names to steal for my code below.

employees.columns.to_list()
Out[65]:
['emp_no', 'birth_date', 'first_name', 'last_name', 'gender', 'hire_date']
In [66]:
# Copy and paste list and change what I want. Then, reassign new list to df.columns

employees.columns = ['emp_no', 'birth', 'first', 'last', 'gender',
       'hire_date']

employees.head(1)
Out[66]:
emp_no birth first last gender hire_date
0 10001 1953-09-02 Georgi Facello M 1986-06-26

.assign()

  • The .assign() method is used to create new columns from existing columns in my DataFrame. I can also create new columns using bracket notation, so I'll demonstrate both ways here, and as always, my specific context will most likely dictate which method I employ in the wild.
  • If I'm making one new column, and I want to add it to my original DataFrame, I will usually just use bracket notation, but when I want to create multiple columns, I'm method chaining, or I don't want to change my existing DataFrame, .assign() is a great tool.
df.assign(new_column = some_expression_or_calculation,
          another_new_column = some_other_expression_if_I_want)
In [67]:
fam_df
Out[67]:
name signs age eyes hair
kane_1 Milla Virgo 14 brown brown
kane_2 Steve Gemini 44 brown black
kane_3 Faith Aquarius 34 blue blonde
kane_4 Declan Aries 7 brown red
kane_5 Penny Libra 0 amber red
kane_6 Betty Libra 1 brown black
kane_7 Pris Scorpio 2 hazel red
In [68]:
# I can make more than one new column at once! I have to assign or reassign this to keep it.

fam_df.assign(shout = fam_df.name.str.upper(),
              name_letter_count = fam_df.name.apply(len), 
              sign_letter_count = fam_df.signs.apply(len))
Out[68]:
name signs age eyes hair shout name_letter_count sign_letter_count
kane_1 Milla Virgo 14 brown brown MILLA 5 5
kane_2 Steve Gemini 44 brown black STEVE 5 6
kane_3 Faith Aquarius 34 blue blonde FAITH 5 8
kane_4 Declan Aries 7 brown red DECLAN 6 5
kane_5 Penny Libra 0 amber red PENNY 5 5
kane_6 Betty Libra 1 brown black BETTY 5 5
kane_7 Pris Scorpio 2 hazel red PRIS 4 7
In [69]:
# My original DataFrame is unchanged because I did not reassign above.

fam_df.head(2)
Out[69]:
name signs age eyes hair
kane_1 Milla Virgo 14 brown brown
kane_2 Steve Gemini 44 brown black

In [70]:
# I can use bracket notation to add one column at a time to my original DataFrame.

fam_df['shout'] = fam_df.name.str.upper()
fam_df
Out[70]:
name signs age eyes hair shout
kane_1 Milla Virgo 14 brown brown MILLA
kane_2 Steve Gemini 44 brown black STEVE
kane_3 Faith Aquarius 34 blue blonde FAITH
kane_4 Declan Aries 7 brown red DECLAN
kane_5 Penny Libra 0 amber red PENNY
kane_6 Betty Libra 1 brown black BETTY
kane_7 Pris Scorpio 2 hazel red PRIS
In [71]:
# I'm changing my original DataFrame each time I create a new column using bracket notation.

fam_df['name_letter_count'] = fam_df.name.apply(len)
fam_df
Out[71]:
name signs age eyes hair shout name_letter_count
kane_1 Milla Virgo 14 brown brown MILLA 5
kane_2 Steve Gemini 44 brown black STEVE 5
kane_3 Faith Aquarius 34 blue blonde FAITH 5
kane_4 Declan Aries 7 brown red DECLAN 6
kane_5 Penny Libra 0 amber red PENNY 5
kane_6 Betty Libra 1 brown black BETTY 5
kane_7 Pris Scorpio 2 hazel red PRIS 4

np.where()

  • This is a really useful NumPy function that I find myself reaching for all the time! I like to use it to create a new column in my DataFrame with values based on another column in my DataFrame.
  • I can base my new column values on whether the values in the existing column meet my condition.
  • There is a df.where() pandas method that is very similar, but it does not function in the same way; don't get confused by this.
np.where(condition, this_where_True, this_where_False)
In [72]:
fam_df.info()
<class 'pandas.core.frame.DataFrame'>
Index: 7 entries, kane_1 to kane_7
Data columns (total 7 columns):
 #   Column             Non-Null Count  Dtype 
---  ------             --------------  ----- 
 0   name               7 non-null      object
 1   signs              7 non-null      object
 2   age                7 non-null      object
 3   eyes               7 non-null      object
 4   hair               7 non-null      object
 5   shout              7 non-null      object
 6   name_letter_count  7 non-null      int64 
dtypes: int64(1), object(6)
memory usage: 448.0+ bytes
In [73]:
# I need to convert my age column to be integers if I want to use it as such.

fam_df['age'] = fam_df.age.astype(int)
In [74]:
# Create a new column with values based on a conditional test on another column.

fam_df['age_group'] = np.where(fam_df.age > 18, 'adult', 'minor')
fam_df.head()
Out[74]:
name signs age eyes hair shout name_letter_count age_group
kane_1 Milla Virgo 14 brown brown MILLA 5 minor
kane_2 Steve Gemini 44 brown black STEVE 5 adult
kane_3 Faith Aquarius 34 blue blonde FAITH 5 adult
kane_4 Declan Aries 7 brown red DECLAN 6 minor
kane_5 Penny Libra 0 amber red PENNY 5 minor

Sorting

.sort_values()

  • This is a very useful method of both pandas Series and DataFrames. When I call this method on a DataFrame, I have to pass an argument to the by parameter to specifiy which column(s) to sort my DataFrame by.
  • I can pass a string value (column name) or a list (column_names) as the argument to the by parameter, and I can also pass a single boolean value or a list of boolean values to the ascending parameter to curate my sort.
# Defaults for sort_values method.

df.sort_values(by, axis=0, ascending=True, inplace=False, kind='quicksort', 
               na_position='last', ignore_index=False)
In [75]:
# Here I am sorting by first name in ascending order and last name in descending order!

employees.sort_values(by=['first', 'last'], ascending=[True, False]).head()
Out[75]:
emp_no birth first last gender hire_date
35 10036 1959-08-10 Adamantios Portugali M 1992-01-03
34 10035 1953-02-08 Alain Chappelet M 1988-09-05
58 10059 1953-09-19 Alejandro McAlpine F 1991-06-26
38 10039 1959-10-01 Alejandro Brender M 1988-01-19
90 10091 1955-10-04 Amabile Gomatam M 1992-11-18

.sort_index()

  • Just as I can sort by the values in my Series or DataFrame, I can also sort by the index. I also use this method all the time; keep it in mind.
DataFrame.sort_index(axis=0, level=None, ascending=True, inplace=False, kind='quicksort', na_position='last', sort_remaining=True, ignore_index=False)
In [76]:
fam_df.head(3)
Out[76]:
name signs age eyes hair shout name_letter_count age_group
kane_1 Milla Virgo 14 brown brown MILLA 5 minor
kane_2 Steve Gemini 44 brown black STEVE 5 adult
kane_3 Faith Aquarius 34 blue blonde FAITH 5 adult
In [77]:
# I can reverse the order of my fam_df by the index if I want.

fam_df.sort_index(ascending=False)
Out[77]:
name signs age eyes hair shout name_letter_count age_group
kane_7 Pris Scorpio 2 hazel red PRIS 4 minor
kane_6 Betty Libra 1 brown black BETTY 5 minor
kane_5 Penny Libra 0 amber red PENNY 5 minor
kane_4 Declan Aries 7 brown red DECLAN 6 minor
kane_3 Faith Aquarius 34 blue blonde FAITH 5 adult
kane_2 Steve Gemini 44 brown black STEVE 5 adult
kane_1 Milla Virgo 14 brown brown MILLA 5 minor

In [78]:
# I can also sort my DataFrame columns by setting `axis=1`.

fam_df.sort_index(axis=1)
Out[78]:
age age_group eyes hair name name_letter_count shout signs
kane_1 14 minor brown brown Milla 5 MILLA Virgo
kane_2 44 adult brown black Steve 5 STEVE Gemini
kane_3 34 adult blue blonde Faith 5 FAITH Aquarius
kane_4 7 minor brown red Declan 6 DECLAN Aries
kane_5 0 minor amber red Penny 5 PENNY Libra
kane_6 1 minor brown black Betty 5 BETTY Libra
kane_7 2 minor hazel red Pris 4 PRIS Scorpio

Reshaping Data

.T

  • I can access this property of my DataFrame to transpose its indexes.
In [79]:
fam_df.T
Out[79]:
kane_1 kane_2 kane_3 kane_4 kane_5 kane_6 kane_7
name Milla Steve Faith Declan Penny Betty Pris
signs Virgo Gemini Aquarius Aries Libra Libra Scorpio
age 14 44 34 7 0 1 2
eyes brown brown blue brown amber brown hazel
hair brown black blonde red red black red
shout MILLA STEVE FAITH DECLAN PENNY BETTY PRIS
name_letter_count 5 5 5 6 5 5 4
age_group minor adult adult minor minor minor minor

.pivot_table()

  • This pandas function allows me to create a spreadsheet-style pivot table as a DataFrame. I'll demonstrate a very simple pivot here, but as we deal with more complex data, pivots can do a lot more.
pd.pivot_table(data, values=None, index=None, columns=None, aggfunc=mean)
In [85]:
# View original DataFrame.

cart_items
Out[85]:
title price quantity total
0 orange juice 3.99 1 3.99
1 rice 1.99 3 5.97
2 beans 0.99 3 2.97
3 chili sauce 2.99 1 2.99
4 chocolate 0.75 9 6.75
In [86]:
# Create a new column calculating price and quantity for use below.

cart_items['total'] = cart_items.price * cart_items.quantity
cart_items
Out[86]:
title price quantity total
0 orange juice 3.99 1 3.99
1 rice 1.99 3 5.97
2 beans 0.99 3 2.97
3 chili sauce 2.99 1 2.99
4 chocolate 0.75 9 6.75
In [87]:
# A simple pivot table setting only `values` and `columns`.

pd.pivot_table(cart_items, values=['quantity', 'price', 'total'], columns='title')
Out[87]:
title beans chili sauce chocolate orange juice rice
price 0.99 2.99 0.75 3.99 1.99
quantity 3.00 1.00 9.00 1.00 3.00
total 2.97 2.99 6.75 3.99 5.97
In [88]:
# I can choose different metrics for each of my values/rows.

pd.pivot_table(cart_items, values=['quantity', 'price', 'total'], columns='title', aggfunc={'price':'mean', 'quantity':'sum', 'total':'sum'} )
Out[88]:
title beans chili sauce chocolate orange juice rice
price 0.99 2.99 0.75 3.99 1.99
quantity 3.00 1.00 9.00 1.00 3.00
total 2.97 2.99 6.75 3.99 5.97

.crosstab()

  • This function basically creates a Frequency Table.
pd.crosstab(index_series, col_series)
In [89]:
# Here I'm reading in a CSV file I created to create my DataFrame.

dept_titles = pd.read_csv('data/dept_titles.csv', index_col=0)
dept_titles.head()
Out[89]:
emp_no title from_date to_date dept_name
0 10011 Staff 1990-01-22 1996-11-09 Customer Service
1 10038 Senior Staff 1996-09-20 9999-01-01 Customer Service
2 10038 Staff 1989-09-20 1996-09-20 Customer Service
3 10049 Senior Staff 2000-05-04 9999-01-01 Customer Service
4 10049 Staff 1992-05-04 2000-05-04 Customer Service
In [90]:
# Create a frequency table of titles by department

all_titles_crosstab = pd.crosstab(dept_titles.dept_name, dept_titles.title)
all_titles_crosstab
Out[90]:
title Assistant Engineer Engineer Manager Senior Engineer Senior Staff Staff Technique Leader
dept_name
Customer Service 298 2362 4 2027 13925 16150 309
Development 7769 58135 2 49326 1247 1424 7683
Finance 0 0 2 0 12139 13929 0
Human Resources 0 0 2 0 12274 14342 0
Marketing 0 0 2 0 13940 16196 0
Production 6445 49649 4 42205 1270 1478 6557
Quality Management 1831 13852 4 11864 0 0 1795
Research 378 2986 2 2570 11637 13495 393
Sales 0 0 2 0 36191 41808 0

Styling

  • Make it easier to read your DataFrame by using the style.background_gradient() method of the pandas DataFrame! You can find great advice about choosing the right colormap for your purpose here.
  • By default, colors are computed based on the row values in each column.

df.style.background_gradient(cmap='PuBu' , axis=0)

  • If you want to compute the colors based on the entire matrix, axis=None.
  • I'll show a couple of useful built-in styling methods here, and if you want to read more about styling, check out this styling guide.

Here, I want colors computed based on the column values in each row (departments), so I set axis=1.

In [91]:
all_titles_crosstab.style.background_gradient(cmap='PuBuGn', axis=1)
Out[91]:
title Assistant Engineer Engineer Manager Senior Engineer Senior Staff Staff Technique Leader
dept_name
Customer Service 298 2362 4 2027 13925 16150 309
Development 7769 58135 2 49326 1247 1424 7683
Finance 0 0 2 0 12139 13929 0
Human Resources 0 0 2 0 12274 14342 0
Marketing 0 0 2 0 13940 16196 0
Production 6445 49649 4 42205 1270 1478 6557
Quality Management 1831 13852 4 11864 0 0 1795
Research 378 2986 2 2570 11637 13495 393
Sales 0 0 2 0 36191 41808 0

Maybe you just want to find the max number in each row (department name) quickly.

In [92]:
all_titles_crosstab.style.highlight_max(axis=1)
Out[92]:
title Assistant Engineer Engineer Manager Senior Engineer Senior Staff Staff Technique Leader
dept_name
Customer Service 298 2362 4 2027 13925 16150 309
Development 7769 58135 2 49326 1247 1424 7683
Finance 0 0 2 0 12139 13929 0
Human Resources 0 0 2 0 12274 14342 0
Marketing 0 0 2 0 13940 16196 0
Production 6445 49649 4 42205 1270 1478 6557
Quality Management 1831 13852 4 11864 0 0 1795
Research 378 2986 2 2570 11637 13495 393
Sales 0 0 2 0 36191 41808 0

Want more on reshaping using the pandas crosstab function? This article is a lot of fun!


Chaining

I can chain methods onto my Series or DataFrame as long as my method returns a Series or DataFrame object.

series.method().method().method()

dataframe.method().method().method()

.pipe()

  • I can also chain functions using the .pipe() method; it is like .apply() for an entire DataFrame.
  • I can chain functions together to make a pipline where the output from running one function serves as the input for the next function.
df.pipe.(function).pipe(function)