import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from pydataset import data
The pandas library is used to deal with structured data stored in tables. You might aquire the structured data from CSV files, TSV files, SQL database tables, or spreadsheets. You can also create pandas DataFrames. It can be thought of as a dictionary-like container for Series. Below, I will go over ways to both create DataFrame objects and read in data to a pandas DataFrame.
We saw that the pandas Series object is a one-dimensional, labeled array; the pandas DataFrame object is a two-dimensional labled data structure with columns of the same or different data types. A DataFrame is like a sequence of Series objects aligned by the same index. There are three main components that make up a pandas DataFrame: the index, the columns, and the data.
The pandas DataFrame constructor function defaults.
pd.DataFrame(data=None, index=None, columns=None, dtype=None, copy=False)
Does this look familiar?
shopping_cart = {
"tax": .08,
"items": [
{
"title": "orange juice",
"price": 3.99,
"quantity": 1
},
{
"title": "rice",
"price": 1.99,
"quantity": 3
},
{
"title": "beans",
"price": 0.99,
"quantity": 3
},
{
"title": "chili sauce",
"price": 2.99,
"quantity": 1
},
{
"title": "chocolate",
"price": 0.75,
"quantity": 9
}
]
}
shopping_cart
is a dictionary with two keys, tax
and items
, but the value for items happens to be a list of dictionaries, so depending on what I want from my data, I may need to unpack the items
column. I can do that!df = pd.DataFrame(shopping_cart)
df
items
column by applying the pd.Series()
method to the items
column. This returns a new DataFrame.df['items'].apply(pd.Series)
items
column and combine the original and new DataFrames using the pd.concat()
function if I want the tax
column and new unpacked columns in the same DataFrame. I don't need the original items
column anymore, so I can drop that column at the same time using the .drop()
method.# Default settings for the pandas concat function.
pd.concat([df1, df2], axis=0, join='outer')
# I can transform my two DataFrames and then concatenate or do it all at once like below.
big_df = pd.concat([df.drop(['items'], axis=1), df['items'].apply(pd.Series)], axis=1)
big_df
items
IS already a list of dictionaries, so if I don't need tax, I can do this.items = shopping_cart['items']
items
pd.DataFrame
.cart_items = pd.DataFrame(items)
cart_items
type(cart_items)
fam = {'name':['Milla', 'Steve', 'Faith', 'Declan'],
'signs':['Virgo', 'Gemini', 'Aquarius', 'Aries'],
'age': [14, 44, 34, 7]}
fam
type(fam)
fam_df = pd.DataFrame(fam)
fam_df
fam_df = pd.DataFrame(fam, index =['kane_1', 'kane_2', 'kane_3', 'kane_4'])
fam_df
fam_df.info()
pd.read_sql()
function to read the results of a SQL query into a pandas DataFrame.df = pd.read_sql(sql_query, connection_url)
To do this, I will need to import host, password, and user from my env file.
from env import host, password, user
connection_url
argument for my function. My function will take in a string for the name of the db, and the user, host, and password parameters have default arguments set to match the imports from my env file.def get_connection(db, user=user, host=host, password=password):
return f'mysql+pymysql://{user}:{password}@{host}/{db}'
sql_query = 'SELECT * FROM employees'
def get_employees_data(db):
return pd.read_sql(sql_query, get_connection(db))
employees
.employees = get_employees_data('employees')
When I import a large dataset using a SQL query, it only takes me about 3 or 4 times of restarting my kernel and waiting minutes for my query to run to decide I need to stop and write my new DataFrame to a CSV file that I can access instantly.
I only need to run this code once to create a new CSV file in my current directory, and then I can read in my data as shown in the next section. I can comment out the code I was using to read in my data and write to a CSV file after I have my CSV file.
employees.to_csv('employees_df.csv')
# Create my DataFrame reading from my own CSV file; way faster now.
employees = pd.read_csv('employees_df.csv', index_col=0)
employees.head()
pd.read_csv()
function to read the data from a CSV file into a pandas DataFrame.# If my csv file is in the same directory as my notebook, I can do this.
df = pd.read_csv('file_name.csv', index_col=0)
# If my csv file is not in the same directory as my notebook, I have to include the file path.
df = pd.read_csv('file_path/file_name.csv', index_col=0)
student_df = pd.read_csv('data/interesting_data.csv')
student_df.head()
from pydataset import data
data(df_string_name, show_doc=True)
df = data(df_string_name)
data()
# This is a pydataset called HairEyeColor.
colors = data('HairEyeColor')
colors.head()
pd.read_clipboard()
pd.read_clipboard(headers=None, names=desired_column_names)
A DataFrame can be composed of columns of different data types, it allows for indexing and subsetting of data, and comes with many useful attributes and methods adding an enormous array of functionality to this pandas object. One of the most impressive features of the pandas library is that it offers vectorized functions optimized to operate on Series and DataFrame objects. I explored what this means for the pandas Series object in the Series Review Notebook; this notebook will dig into what this means for a DataFrame object.
I'll explore all of these awesome features in depth throughout the rest of this notebook.
Most of the methods and attributes covered in my pandas Series review notebook also belong to pandas DataFrame objects. They may function slightly differently, have additional parameters and/or functionality, or return different objects, but you will find most of them useful whether working with Series or DataFrames. In this notebook, I will expand upon some of the methods and attributes previously covered, but for the most part my goal is to add to my existing toolbox.
# There are 433 methods and attributes of a pandas Series.
series_attribute_methods = set(dir(pd.Series))
len(series_attribute_methods)
# There are 430 methods and attributes of a pandas DataFrame.
df_attribute_methods = set(dir(pd.DataFrame))
len(df_attribute_methods)
# There are 377 methods and attributes that belong to BOTH pandas Series AND DataFrames.
len(series_attribute_methods & df_attribute_methods)
[]
. []
alone and with the .loc
and .iloc
attributes below.[]
¶# Peek at columns in student_df
student_df.head(1)
student_df.shape
# Create a df that is a subset of the original student_df.
student_subset = student_df[['Country', 'Region']]
student_subset.head()
# I can choose columns and rows like this.
student_df[['Country', 'Region']][10:20]
# I can create a subset using a boolean Series; I will use this to subset my original df.
female_bool_series = student_df.Gender == 'Female'
female_bool_series.head()
# I use my boolean Series to select only observations where `Gender == Female`.
female_subset = student_df[female_bool_series]
female_subset.head()
.loc
¶.loc
attribute to select specific rows AND columns by index label. My index label can be a number, but it can also be a string label. This method offers a lot of flexibility! The .loc
attribute's indexing is inclusive and uses an index label, not position. df.loc[row_indexer, column_indexer]
# I want the rows from start through 5 (inclusive) and columns from Gender through Armspan_cm (inclusive).
loc_subset = student_df.loc[:5, 'Gender': 'Armspan_cm']
loc_subset
loc_subset.shape
Example of Boolean Indexing Using
.loc
True
value, the observation from the original Series is returned.# Create a bool Series to select a subset of ambidextrous students using `.loc`
ambi_bool = student_df.Handed == 'Ambidextrous'
ambi_bool.head()
# Create subset of ambidextrous students using my bool Series
student_df.loc[ambi_bool]
# I can select only students who are ambidextrous using my bool Series AND select specific columns.
student_df.loc[ambi_bool, ['Gender', 'ClassGrade' ]]
.loc
attribute to subset my data.student_df.loc[lambda df: df['Region'] == 'TX']
.iloc
¶.iloc
attribute to select specific rows and colums by index position. .iloc
does not accept a boolean Series as a selector like .loc
does. It takes in integers representing index position and is NOT inclusive.df.iloc[row_indexer, column_indexer]
student_df.iloc[90:, :3 ]
.set_index()
¶inplace=False
.df.set_index(keys, drop=True, append=False, inplace=False, verify_integrity=False)
# My original fam_df.
fam_df.head(1)
# I can set the `name` column to be my new index.
fam_df.set_index('name')
# Create a new index using a pandas Index object of the same length.
fam_df.set_index(pd.Index([1, 2, 3, 4]))
.reset_index()
¶inplace=True
; otherwise, just reassign or assign the new copy.df.reset_index(level=None, drop=False, inplace=False, col_level=0, col_fill='')
# This resets the fam_df index to default and add my original index as a column.
fam_df.reset_index()
# I reset my index and rename original index column if I want to keep it.
fam_df.reset_index().rename(columns={'index': 'id'})
# If I don't want the original index as a column, I can set `drop=True`.
fam_df.reset_index(drop=True)
.groupby()
¶df.groupby('grouping_column').agg_column.agg_func()
student_df.head(1)
# Use a `groupby()` to calculate the average age by Gender; return a df by using [['double_brackets']].
student_df.groupby('Gender')[['Ageyears']].mean()
df.groupby(['list', 'of', 'grouping', 'columns']).agg_func()
# Perform a multi-level groupby aggregation.
student_df.groupby(['Gender', 'Handed']).size()
.agg()
¶Chaining the .agg()
method with a .groupby()
provides more flexibility when I'm aggregating.
I can use a list of grouping columns and perform more than one function on my aggregating column.
df.groupby(['list', 'of', 'grouping', 'columns']).agg_column.agg_func(['func', 'other_func'])
student_df.groupby(['Gender', 'Handed']).Ageyears.agg(['mean', 'median'])
I can use a list of grouping columns and pass a dictionary to the .agg()
method to use different aggregating functions on different columns.
student_df.groupby(['Gender', 'Handed']).agg({'Ageyears': 'mean', 'Text_Messages_Sent_Yesterday': 'median'})
.concat()
¶axis=0
or column-wise axis=1
. # For example, concat with a list of two DataFrames
pd.concat([df1, df2], axis=0)
axis=0
, a Series is returned.axis=1
, a DataFrame is returned.# Default is set to row-wise concatenation using an outer join.
pd.concat(objs, axis=0, join='outer')
Combine two DataFrame objects with identical columns:
fam_df
# Create a list of dictionaries to be new rows in a DataFrame concatenated to my original fam_df.
new = [{'name': 'Penny', 'signs': 'Libra', 'age': '0'},
{'name': 'Betty', 'signs': 'Libra', 'age': '1'},
{'name': 'Pris', 'signs': 'Scorpio', 'age': '2'}]
# Create new_df using my list of dictionaries above.
new_df = pd.DataFrame(new, index=['kane_5', 'kane_6', 'kane_7'])
new_df
# Concatenate my new_df to my original fam_df; the default, `axis=0`, will stack these dfs.
fam_df = pd.concat([fam_df, new_df])
fam_df
Combine two DataFrame objects with identical index labels:
new_cols_df = pd.DataFrame({'eyes': ['brown', 'brown', 'blue', 'brown', 'amber', 'brown', 'hazel'],
'hair': ['brown', 'black', 'blonde', 'red', 'red', 'black', 'red']},
index=['kane_1', 'kane_2', 'kane_3','kane_4', 'kane_5', 'kane_6', 'kane_7'])
new_cols_df
fam_df = pd.concat([fam_df, new_cols_df], axis=1)
fam_df
df.merge()
¶left_df.merge(right_df, how='inner', on=None, left_on=None, right_on=None, left_index=False, right_index=False, sort=False, suffixes='_x', '_y', copy=True, indicator=False, validate=None)
how
parameter change my resulting DataFrame?how
¶how=left
: use only keys from left frame, similar to a SQL left outer join; preserve key order.
how=right
: use only keys from right frame, similar to a SQL right outer join; preserve key order.
how=outer
: use union of keys from both frames, similar to a SQL full outer join; sort keys lexicographically.
how=inner
: use intersection of keys from both frames, similar to a SQL inner join; preserve the order of the left keys.
on
¶on
parameter is None
, so if I am not merging on the indexes of the Series or DataFrame objects I'm joining, this defaults to the intersection of the columns in both objects. Otherwise, I can pass the column(s) name or index level to join on.# Read in some data from a CSV file to create my `titles` DataFrame.
titles = pd.read_csv('data/titles.csv', index_col=0)
titles.head(2)
# Peek at columns in table I want to merge with.
employees.head(2)
# Merge employees and titles DataFrames on `emp_no` column.
all_emp_titles = employees.merge(titles, on='emp_no')
all_emp_titles.head()
.drop()
¶.drop()
method allows me to drop rows or columns from a DataFrame. I can specify a single label or list of labels to drop; the default for the axis parameter is axis=0
(rows). DataFrame.drop(labels=None, axis=0, index=None, columns=None, level=None, inplace=False, errors='raise')
columns
. Again, the inplace=False
, so I have to either save the resulting DataFrame to a new variable or reassign it to my original variable; this WILL mutate my original data, so be careful. student_df.head(1)
# Country and Region are gone in the resulting df copy; I did not reassign, so temporarily.
student_df.drop(columns=['Country', 'Region']).head(3)
# Drop a row by index position or list of positions.
student_df.drop(index = student_df.index[[2, 5]]).head()
.rename()
¶.columns
attribute. It's good to know how to use both, so you can use the best one for your context.df.rename(columns={'original_name': 'new_name', 'original_name': 'new_name'})
OR
df.columns = ['new_name_1', 'new_name_2', 'new_name_3']
# If I want to keep these changes, I need to reassign.
student_df.rename(columns={'DataYear': 'year', 'ClassGrade': 'grade'}).head(1)
fam_df
# Use `.rename()` method as a part of method chaining. It's a good thing.
rename_fam_df = fam_df.drop(columns='signs').rename(columns={'name': 'family_member'})
rename_fam_df
.columns
¶Here's a fast way to rename a lot of columns by accessing and reassigning to the .columns
attribute.
df.columns = ['new_col_name', 'new_col_name', 'new_col_name'...]
employees.head(1)
# Print out a list of my original column names to steal for my code below.
employees.columns.to_list()
# Copy and paste list and change what I want. Then, reassign new list to df.columns
employees.columns = ['emp_no', 'birth', 'first', 'last', 'gender',
'hire_date']
employees.head(1)
.assign()
¶.assign()
method is used to create new columns from existing columns in my DataFrame. I can also create new columns using bracket notation, so I'll demonstrate both ways here, and as always, my specific context will most likely dictate which method I employ in the wild. .assign()
is a great tool.df.assign(new_column = some_expression_or_calculation,
another_new_column = some_other_expression_if_I_want)
fam_df
# I can make more than one new column at once! I have to assign or reassign this to keep it.
fam_df.assign(shout = fam_df.name.str.upper(),
name_letter_count = fam_df.name.apply(len),
sign_letter_count = fam_df.signs.apply(len))
# My original DataFrame is unchanged because I did not reassign above.
fam_df.head(2)
# I can use bracket notation to add one column at a time to my original DataFrame.
fam_df['shout'] = fam_df.name.str.upper()
fam_df
# I'm changing my original DataFrame each time I create a new column using bracket notation.
fam_df['name_letter_count'] = fam_df.name.apply(len)
fam_df
np.where()
¶df.where()
pandas method that is very similar, but it does not function in the same way; don't get confused by this.np.where(condition, this_where_True, this_where_False)
fam_df.info()
# I need to convert my age column to be integers if I want to use it as such.
fam_df['age'] = fam_df.age.astype(int)
# Create a new column with values based on a conditional test on another column.
fam_df['age_group'] = np.where(fam_df.age > 18, 'adult', 'minor')
fam_df.head()
.sort_values()
¶by
parameter to specifiy which column(s) to sort my DataFrame by. by
parameter, and I can also pass a single boolean value or a list of boolean values to the ascending
parameter to curate my sort.# Defaults for sort_values method.
df.sort_values(by, axis=0, ascending=True, inplace=False, kind='quicksort',
na_position='last', ignore_index=False)
# Here I am sorting by first name in ascending order and last name in descending order!
employees.sort_values(by=['first', 'last'], ascending=[True, False]).head()
.sort_index()
¶DataFrame.sort_index(axis=0, level=None, ascending=True, inplace=False, kind='quicksort', na_position='last', sort_remaining=True, ignore_index=False)
fam_df.head(3)
# I can reverse the order of my fam_df by the index if I want.
fam_df.sort_index(ascending=False)
# I can also sort my DataFrame columns by setting `axis=1`.
fam_df.sort_index(axis=1)
.T
¶fam_df.T
.pivot_table()
¶pd.pivot_table(data, values=None, index=None, columns=None, aggfunc=’mean’)
# View original DataFrame.
cart_items
# Create a new column calculating price and quantity for use below.
cart_items['total'] = cart_items.price * cart_items.quantity
cart_items
# A simple pivot table setting only `values` and `columns`.
pd.pivot_table(cart_items, values=['quantity', 'price', 'total'], columns='title')
# I can choose different metrics for each of my values/rows.
pd.pivot_table(cart_items, values=['quantity', 'price', 'total'], columns='title', aggfunc={'price':'mean', 'quantity':'sum', 'total':'sum'} )
.crosstab()
¶pd.crosstab(index_series, col_series)
# Here I'm reading in a CSV file I created to create my DataFrame.
dept_titles = pd.read_csv('data/dept_titles.csv', index_col=0)
dept_titles.head()
# Create a frequency table of titles by department
all_titles_crosstab = pd.crosstab(dept_titles.dept_name, dept_titles.title)
all_titles_crosstab
style.background_gradient()
method of the pandas DataFrame! You can find great advice about choosing the right colormap for your purpose here.df.style.background_gradient(cmap='PuBu' , axis=0)
axis=None
.Here, I want colors computed based on the column values in each row (departments), so I set axis=1
.
all_titles_crosstab.style.background_gradient(cmap='PuBuGn', axis=1)
Maybe you just want to find the max number in each row (department name) quickly.
all_titles_crosstab.style.highlight_max(axis=1)
Want more on reshaping using the pandas crosstab
function? This article is a lot of fun!
I can chain methods onto my Series or DataFrame as long as my method returns a Series or DataFrame object.
series.method().method().method()
dataframe.method().method().method()
.pipe()
¶.pipe()
method; it is like .apply()
for an entire DataFrame. df.pipe.(function).pipe(function)