import pandas as pd
import numpy as np

import matplotlib.pyplot as plt

from pydataset import data

The Pandas DataFrame Object¶

The pandas library is used to deal with structured data stored in tables. You might aquire the structured data from CSV files, TSV files, SQL database tables, or spreadsheets. You can also create pandas DataFrames. It can be thought of as a dictionary-like container for Series. Below, I will go over ways to both create DataFrame objects and read in data to a pandas DataFrame.

What Is a Pandas DataFrame?

We saw that the pandas Series object is a one-dimensional, labeled array; the pandas DataFrame object is a two-dimensional labled data structure with columns of the same or different data types. A DataFrame is like a sequence of Series objects aligned by the same index. There are three main components that make up a pandas DataFrame: the index, the columns, and the data.

dataframe diagram

The Components of a Pandas DataFrame - Index, Columns, Data¶

I can access the components of a DataFrame as easily as a Series.

DataFrame Components

Create a Pandas DataFrame:¶

There are multiple ways to create pandas DataFrame objects, and I will demonstrate some of these below.

If you want more, see the official doc on the pandas DataFrame here.

The pandas DataFrame constructor function defaults.

pd.DataFrame(data=None, index=None, columns=None, dtype=None, copy=False)

From a Dictionary¶

Does this look familiar?

shopping_cart = {
    "tax": .08,
    "items": [
        {
            "title": "orange juice",
            "price": 3.99,
            "quantity": 1
        },
        {
            "title": "rice",
            "price": 1.99,
            "quantity": 3
        },
        {
            "title": "beans",
            "price": 0.99,
            "quantity": 3
        },
        {
            "title": "chili sauce",
            "price": 2.99,
            "quantity": 1
        },
        {
            "title": "chocolate",
            "price": 0.75,
            "quantity": 9
        }
    ]
}

shopping_cart is a dictionary with two keys, tax and items, but the value for items happens to be a list of dictionaries, so depending on what I want from my data, I may need to unpack the items column. I can do that!

df = pd.DataFrame(shopping_cart)
df

I can unpack the values of the items column by applying the pd.Series() method to the items column. This returns a new DataFrame.

df['items'].apply(pd.Series)

I can unpack the items column and combine the original and new DataFrames using the pd.concat() function if I want the tax column and new unpacked columns in the same DataFrame. I don't need the original items column anymore, so I can drop that column at the same time using the .drop() method.

Remember that I want to drop a column, so I need to set my axis=1, and I want to concatenate my DataFrames column-wise, so again, axis=1. I remember that axis=1 refers to columns because the '1' looks like a column!

# Default settings for the pandas concat function.

pd.concat([df1, df2], axis=0, join='outer')

# I can transform my two DataFrames and then concatenate or do it all at once like below.

big_df = pd.concat([df.drop(['items'], axis=1), df['items'].apply(pd.Series)], axis=1)
big_df

From a List of Dictionaries¶

items IS already a list of dictionaries, so if I don't need tax, I can do this.

items = shopping_cart['items']
items

[{'title': 'orange juice', 'price': 3.99, 'quantity': 1},
 {'title': 'rice', 'price': 1.99, 'quantity': 3},
 {'title': 'beans', 'price': 0.99, 'quantity': 3},
 {'title': 'chili sauce', 'price': 2.99, 'quantity': 1},
 {'title': 'chocolate', 'price': 0.75, 'quantity': 9}]

Now, I can pass my items variable as the data argument to pd.DataFrame.

cart_items = pd.DataFrame(items)
cart_items

type(cart_items)

pandas.core.frame.DataFrame

Labeled Index¶

fam = {'name':['Milla', 'Steve', 'Faith', 'Declan'], 
       'signs':['Virgo', 'Gemini', 'Aquarius', 'Aries'],
       'age': [14, 44, 34, 7]} 
fam

{'name': ['Milla', 'Steve', 'Faith', 'Declan'],
 'signs': ['Virgo', 'Gemini', 'Aquarius', 'Aries'],
 'age': [14, 44, 34, 7]}

type(fam)

dict

fam_df = pd.DataFrame(fam)
fam_df

I have the option to pass index labels when creating a DataFrame, too.

fam_df = pd.DataFrame(fam, index =['kane_1', 'kane_2', 'kane_3', 'kane_4'])
fam_df

fam_df.info()

<class 'pandas.core.frame.DataFrame'>
Index: 4 entries, kane_1 to kane_4
Data columns (total 3 columns):
 #   Column  Non-Null Count  Dtype 
---  ------  --------------  ----- 
 0   name    4 non-null      object
 1   signs   4 non-null      object
 2   age     4 non-null      int64 
dtypes: int64(1), object(2)
memory usage: 128.0+ bytes

Read Data Into a Pandas DataFrame¶

Create your DataFrame using a SQL query to access a database.¶

I can use the pandas pd.read_sql() function to read the results of a SQL query into a pandas DataFrame.

df = pd.read_sql(sql_query, connection_url)

To do this, I will need to import host, password, and user from my env file.

from env import host, password, user

I can create a handy function to get my connection_url argument for my function. My function will take in a string for the name of the db, and the user, host, and password parameters have default arguments set to match the imports from my env file.

def get_connection(db, user=user, host=host, password=password):
    return f'mysql+pymysql://{user}:{password}@{host}/{db}'

I need to write a valid SQL query.

sql_query = 'SELECT * FROM employees'

Putting it all together, I will end up with something like this:

def get_employees_data(db):
    return pd.read_sql(sql_query, get_connection(db))

Call my function and assign my data to the variable employees.

employees = get_employees_data('employees')

Write your DataFrame to a CSV file¶

When I import a large dataset using a SQL query, it only takes me about 3 or 4 times of restarting my kernel and waiting minutes for my query to run to decide I need to stop and write my new DataFrame to a CSV file that I can access instantly.

I only need to run this code once to create a new CSV file in my current directory, and then I can read in my data as shown in the next section. I can comment out the code I was using to read in my data and write to a CSV file after I have my CSV file.

employees.to_csv('employees_df.csv')

# Create my DataFrame reading from my own CSV file; way faster now.

employees = pd.read_csv('employees_df.csv', index_col=0)
employees.head()

Create your DataFrame using data from a CSV file.¶

I can use the pandas pd.read_csv() function to read the data from a CSV file into a pandas DataFrame.

Tip: If you read in your data using a csv file and find that you have an extra column called 'Unnamed: 0', adding index_col=0 as an argument will get rid of that.

# If my csv file is in the same directory as my notebook, I can do this.
df = pd.read_csv('file_name.csv', index_col=0)

# If my csv file is not in the same directory as my notebook, I have to include the file path.
df = pd.read_csv('file_path/file_name.csv', index_col=0)

student_df = pd.read_csv('data/interesting_data.csv')
student_df.head()

Create your DataFrame using PyDataset.¶

I have to import the following:

from pydataset import data

This code snippet will show me the information doc on the dataset.

data(df_string_name, show_doc=True)

This code snippet will load the dataset for use as a pandas DataFrame.

df = data(df_string_name)

There are 757 available datasets using pydataset. Running the following code snippet in a cell will return a DataFrame with all of your options.

data()

Warning: When you import data from pydataset, do not assign anything else in your notebook to the variable 'data'. If you name some dictionary or list in your notebook 'data', and then try to import a pydataset farther down, you will go crazy trying to figure out why you can't access the datasets from pydataset anymore! Just don't do it! And when you DO do it, because someday you will, remember this warning, find the problematic variable, rename it, and run your imports again.

# This is a pydataset called HairEyeColor.

colors = data('HairEyeColor')
colors.head()

Create your DataFrame using copy-pasted tabular data from your clipboard!¶

This is a super cool pandas function that allows you to create a DataFrame from data that you have just copied to your clipboard. Depending on the format of the data you copy, you may need to tweak some parameters to get your DataFrame exactly as you want. Instead of going into a bunch of detail here, check out this simple medium article on how to use this method.

Already have a header? This works.
```
pd.read_clipboard()
```
No header on the data you copied? No problem, you can create your own with this.

pd.read_clipboard(headers=None, names=desired_column_names)

So What's So Great About a Pandas DataFrame?

A DataFrame can be composed of columns of different data types, it allows for indexing and subsetting of data, and comes with many useful attributes and methods adding an enormous array of functionality to this pandas object. One of the most impressive features of the pandas library is that it offers vectorized functions optimized to operate on Series and DataFrame objects. I explored what this means for the pandas Series object in the Series Review Notebook; this notebook will dig into what this means for a DataFrame object.

I'll explore all of these awesome features in depth throughout the rest of this notebook.

Now What?

Most of the methods and attributes covered in my pandas Series review notebook also belong to pandas DataFrame objects. They may function slightly differently, have additional parameters and/or functionality, or return different objects, but you will find most of them useful whether working with Series or DataFrames. In this notebook, I will expand upon some of the methods and attributes previously covered, but for the most part my goal is to add to my existing toolbox.

# There are 433 methods and attributes of a pandas Series.

series_attribute_methods = set(dir(pd.Series))
len(series_attribute_methods)

433

# There are 430 methods and attributes of a pandas DataFrame.

df_attribute_methods = set(dir(pd.DataFrame))
len(df_attribute_methods)

430

# There are 377 methods and attributes that belong to BOTH pandas Series AND DataFrames.

len(series_attribute_methods & df_attribute_methods)

377

Indexing¶

Like the pandas Series object, the pandas DataFrame object supports both position- and label-based indexing using the indexing operator [].

I will demonstrate concrete examples of indexing using the indexing operator [] alone and with the .loc and .iloc attributes below.

`[]`¶

I can use a list of columns from a DataFrame with the indexing operator (aka bracket notation) to return a subset of my original DataFrame.

# Peek at columns in student_df

student_df.head(1)

student_df.shape

(100, 60)

# Create a df that is a subset of the original student_df.

student_subset = student_df[['Country', 'Region']]
student_subset.head()

# I can choose columns and rows like this.

student_df[['Country', 'Region']][10:20]

# I can create a subset using a boolean Series; I will use this to subset my original df.

female_bool_series = student_df.Gender == 'Female'
female_bool_series.head()

0     True
1    False
2     True
3    False
4     True
Name: Gender, dtype: bool

# I use my boolean Series to select only observations where `Gender == Female`.

female_subset = student_df[female_bool_series]
female_subset.head()

`.loc`¶

I can use the .loc attribute to select specific rows AND columns by index label. My index label can be a number, but it can also be a string label. This method offers a lot of flexibility! The .loc attribute's indexing is inclusive and uses an index label, not position.

df.loc[row_indexer, column_indexer]

# I want the rows from start through 5 (inclusive) and columns from Gender through Armspan_cm (inclusive).

loc_subset = student_df.loc[:5, 'Gender': 'Armspan_cm']
loc_subset

loc_subset.shape

(6, 6)

Example of Boolean Indexing Using .loc

Here I am passing a boolean Series as a selector to the .loc attribute called on my original series from above. As you can see below, where the boolean Series has a True value, the observation from the original Series is returned.

# Create a bool Series to select a subset of ambidextrous students using `.loc`

ambi_bool = student_df.Handed == 'Ambidextrous'
ambi_bool.head()

0    False
1    False
2     True
3    False
4    False
Name: Handed, dtype: bool

# Create subset of ambidextrous students using my bool Series

student_df.loc[ambi_bool]

# I can select only students who are ambidextrous using my bool Series AND select specific columns.

student_df.loc[ambi_bool, ['Gender', 'ClassGrade' ]]

I can use a lambda function with with the .loc attribute to subset my data.

student_df.loc[lambda df: df['Region'] == 'TX']

`.iloc`¶

I can use the .iloc attribute to select specific rows and colums by index position. .iloc does not accept a boolean Series as a selector like .loc does. It takes in integers representing index position and is NOT inclusive.

df.iloc[row_indexer, column_indexer]

student_df.iloc[90:, :3 ]

Manipulating the Index¶

`.set_index()`¶

This method allows me to set my DataFrame index using an existing column. I can also pass a Series or Index object the same length as my DataFrame as an argument to create an new index. This will not change my original DataFrame because the default is inplace=False.

df.set_index(keys, drop=True, append=False, inplace=False, verify_integrity=False)

# My original fam_df.

fam_df.head(1)

# I can set the `name` column to be my new index.

fam_df.set_index('name')

# Create a new index using a pandas Index object of the same length.

fam_df.set_index(pd.Index([1, 2, 3, 4]))

`.reset_index()`¶

This method will come in handy a lot as we move into methodologies, but for now I'll at least introduce it. This method does not change your original DataFrame unless you pass inplace=True; otherwise, just reassign or assign the new copy.

df.reset_index(level=None, drop=False, inplace=False, col_level=0, col_fill='')

# This resets the fam_df index to default and add my original index as a column.

fam_df.reset_index()

# I reset my index and rename original index column if I want to keep it.

fam_df.reset_index().rename(columns={'index': 'id'})

# If I don't want the original index as a column, I can set `drop=True`.

fam_df.reset_index(drop=True)

Aggregating¶

`.groupby()`¶

This powerful method allows you to group your data by one or more columns and apply any type of function to each group returning the calculations in a Series or DataFrame.

I can use a single grouping column, a single aggregating column, and a single aggregating function.

df.groupby('grouping_column').agg_column.agg_func()

student_df.head(1)

# Use a `groupby()` to calculate the average age by Gender; return a df by using [['double_brackets']].

student_df.groupby('Gender')[['Ageyears']].mean()

I can use a list of grouping columns and a single aggregating function.

df.groupby(['list', 'of', 'grouping', 'columns']).agg_func()

# Perform a multi-level groupby aggregation.

student_df.groupby(['Gender', 'Handed']).size()

Gender  Handed      
Female  Ambidextrous     4
        Left-Handed      4
        Right-Handed    38
Male    Ambidextrous     1
        Left-Handed      8
        Right-Handed    42
dtype: int64

`.agg()`¶

Chaining the .agg() method with a .groupby() provides more flexibility when I'm aggregating.

I can use a list of grouping columns and perform more than one function on my aggregating column.

df.groupby(['list', 'of', 'grouping', 'columns']).agg_column.agg_func(['func', 'other_func'])

student_df.groupby(['Gender', 'Handed']).Ageyears.agg(['mean', 'median'])

I can use a list of grouping columns and pass a dictionary to the .agg() method to use different aggregating functions on different columns.

student_df.groupby(['Gender', 'Handed']).agg({'Ageyears': 'mean', 'Text_Messages_Sent_Yesterday': 'median'})

Joining¶

`.concat()`¶

This function takes in a list or dictionary of Series or DataFrame objects and joins them along a particular axis, row-wise axis=0 or column-wise axis=1.

# For example, concat with a list of two DataFrames
pd.concat([df1, df2], axis=0)

When your list contains at least one DataFrame, a DataFrame is returned.

When concatenating only Series objects row-wise, axis=0, a Series is returned.

When concatenating Series or DataFrames column-wise, axis=1, a DataFrame is returned.

# Default is set to row-wise concatenation using an outer join.
pd.concat(objs, axis=0, join='outer')

Row-wise Concat¶

Combine two DataFrame objects with identical columns:

fam_df

# Create a list of dictionaries to be new rows in a DataFrame concatenated to my original fam_df.

new = [{'name': 'Penny', 'signs': 'Libra', 'age': '0'},
       {'name': 'Betty', 'signs': 'Libra', 'age': '1'},
       {'name': 'Pris', 'signs': 'Scorpio', 'age': '2'}]

# Create new_df using my list of dictionaries above.

new_df = pd.DataFrame(new, index=['kane_5', 'kane_6', 'kane_7'])
new_df

# Concatenate my new_df to my original fam_df; the default, `axis=0`, will stack these dfs.

fam_df = pd.concat([fam_df, new_df])
fam_df

Column-wise Concat¶

Combine two DataFrame objects with identical index labels:

new_cols_df = pd.DataFrame({'eyes': ['brown', 'brown', 'blue', 'brown', 'amber', 'brown', 'hazel'],
                           'hair': ['brown', 'black', 'blonde', 'red', 'red', 'black', 'red']},
                           index=['kane_1', 'kane_2', 'kane_3','kane_4', 'kane_5', 'kane_6', 'kane_7']) 

new_cols_df

fam_df = pd.concat([fam_df, new_cols_df], axis=1)
fam_df

`df.merge()`¶

This method is similar to a SQL join. Here's a cool read making a comparison between the two, if you're interested.

left_df.merge(right_df, how='inner', on=None, left_on=None, right_on=None, left_index=False, right_index=False, sort=False, suffixes='_x', '_y', copy=True, indicator=False, validate=None)

How does changing the default argument of the how parameter change my resulting DataFrame?

`how`¶

Type of merge to be performed.

how=left: use only keys from left frame, similar to a SQL left outer join; preserve key order.

how=right: use only keys from right frame, similar to a SQL right outer join; preserve key order.

how=outer: use union of keys from both frames, similar to a SQL full outer join; sort keys lexicographically.

how=inner: use intersection of keys from both frames, similar to a SQL inner join; preserve the order of the left keys.

`on`¶

Merge on Label or List

The default argument for the on parameter is None, so if I am not merging on the indexes of the Series or DataFrame objects I'm joining, this defaults to the intersection of the columns in both objects. Otherwise, I can pass the column(s) name or index level to join on.

# Read in some data from a CSV file to create my `titles` DataFrame.

titles = pd.read_csv('data/titles.csv', index_col=0)
titles.head(2)

# Peek at columns in table I want to merge with.

employees.head(2)

# Merge employees and titles DataFrames on `emp_no` column.

all_emp_titles = employees.merge(titles, on='emp_no')
all_emp_titles.head()

Creating and Deleting Columns¶

`.drop()`¶

The .drop() method allows me to drop rows or columns from a DataFrame. I can specify a single label or list of labels to drop; the default for the axis parameter is axis=0 (rows).

DataFrame.drop(labels=None, axis=0, index=None, columns=None, level=None, inplace=False, errors='raise')

If I want to drop columns, I can adjust the axis parameter or, simpler, I can just pass a single label or list of labels as an argument to columns. Again, the inplace=False, so I have to either save the resulting DataFrame to a new variable or reassign it to my original variable; this WILL mutate my original data, so be careful.

If I reassign my new DataFrame, thus changing my original, and I try to rerun that code, I WILL get an error because that column(s) is gone. Don't be surprised when you hit this error; you will.

student_df.head(1)

# Country and Region are gone in the resulting df copy; I did not reassign, so temporarily.

student_df.drop(columns=['Country', 'Region']).head(3)

# Drop a row by index position or list of positions.

student_df.drop(index = student_df.index[[2, 5]]).head()

`.rename()`¶

I can rename my columns using this method, but unless I'm changing column names as a part of method chaining or I don't want to change my original DataFrame, I will usually just rename my columns using the .columns attribute. It's good to know how to use both, so you can use the best one for your context.

df.rename(columns={'original_name': 'new_name', 'original_name': 'new_name'})

OR

df.columns = ['new_name_1', 'new_name_2', 'new_name_3']

# If I want to keep these changes, I need to reassign.

student_df.rename(columns={'DataYear': 'year', 'ClassGrade': 'grade'}).head(1)

fam_df

# Use `.rename()` method as a part of method chaining. It's a good thing.

rename_fam_df = fam_df.drop(columns='signs').rename(columns={'name': 'family_member'})
rename_fam_df

`.columns`¶

Here's a fast way to rename a lot of columns by accessing and reassigning to the .columns attribute.

df.columns = ['new_col_name', 'new_col_name', 'new_col_name'...]

employees.head(1)

# Print out a list of my original column names to steal for my code below.

employees.columns.to_list()

['emp_no', 'birth_date', 'first_name', 'last_name', 'gender', 'hire_date']

# Copy and paste list and change what I want. Then, reassign new list to df.columns

employees.columns = ['emp_no', 'birth', 'first', 'last', 'gender',
       'hire_date']

employees.head(1)

`.assign()`¶

The .assign() method is used to create new columns from existing columns in my DataFrame. I can also create new columns using bracket notation, so I'll demonstrate both ways here, and as always, my specific context will most likely dictate which method I employ in the wild.

If I'm making one new column, and I want to add it to my original DataFrame, I will usually just use bracket notation, but when I want to create multiple columns, I'm method chaining, or I don't want to change my existing DataFrame, .assign() is a great tool.

df.assign(new_column = some_expression_or_calculation,
          another_new_column = some_other_expression_if_I_want)

fam_df

# I can make more than one new column at once! I have to assign or reassign this to keep it.

fam_df.assign(shout = fam_df.name.str.upper(),
              name_letter_count = fam_df.name.apply(len), 
              sign_letter_count = fam_df.signs.apply(len))

# My original DataFrame is unchanged because I did not reassign above.

fam_df.head(2)

# I can use bracket notation to add one column at a time to my original DataFrame.

fam_df['shout'] = fam_df.name.str.upper()
fam_df

# I'm changing my original DataFrame each time I create a new column using bracket notation.

fam_df['name_letter_count'] = fam_df.name.apply(len)
fam_df

`np.where()`¶

This is a really useful NumPy function that I find myself reaching for all the time! I like to use it to create a new column in my DataFrame with values based on another column in my DataFrame.

I can base my new column values on whether the values in the existing column meet my condition.

There is a df.where() pandas method that is very similar, but it does not function in the same way; don't get confused by this.

np.where(condition, this_where_True, this_where_False)

fam_df.info()

<class 'pandas.core.frame.DataFrame'>
Index: 7 entries, kane_1 to kane_7
Data columns (total 7 columns):
 #   Column             Non-Null Count  Dtype 
---  ------             --------------  ----- 
 0   name               7 non-null      object
 1   signs              7 non-null      object
 2   age                7 non-null      object
 3   eyes               7 non-null      object
 4   hair               7 non-null      object
 5   shout              7 non-null      object
 6   name_letter_count  7 non-null      int64 
dtypes: int64(1), object(6)
memory usage: 448.0+ bytes

# I need to convert my age column to be integers if I want to use it as such.

fam_df['age'] = fam_df.age.astype(int)

# Create a new column with values based on a conditional test on another column.

fam_df['age_group'] = np.where(fam_df.age > 18, 'adult', 'minor')
fam_df.head()

Sorting¶

`.sort_values()`¶

This is a very useful method of both pandas Series and DataFrames. When I call this method on a DataFrame, I have to pass an argument to the by parameter to specifiy which column(s) to sort my DataFrame by.

I can pass a string value (column name) or a list (column_names) as the argument to the by parameter, and I can also pass a single boolean value or a list of boolean values to the ascending parameter to curate my sort.

# Defaults for sort_values method.

df.sort_values(by, axis=0, ascending=True, inplace=False, kind='quicksort', 
               na_position='last', ignore_index=False)

# Here I am sorting by first name in ascending order and last name in descending order!

employees.sort_values(by=['first', 'last'], ascending=[True, False]).head()

`.sort_index()`¶

Just as I can sort by the values in my Series or DataFrame, I can also sort by the index. I also use this method all the time; keep it in mind.

DataFrame.sort_index(axis=0, level=None, ascending=True, inplace=False, kind='quicksort', na_position='last', sort_remaining=True, ignore_index=False)

fam_df.head(3)

# I can reverse the order of my fam_df by the index if I want.

fam_df.sort_index(ascending=False)

# I can also sort my DataFrame columns by setting `axis=1`.

fam_df.sort_index(axis=1)

Reshaping Data¶

`.T`¶

I can access this property of my DataFrame to transpose its indexes.

fam_df.T

`.pivot_table()`¶

This pandas function allows me to create a spreadsheet-style pivot table as a DataFrame. I'll demonstrate a very simple pivot here, but as we deal with more complex data, pivots can do a lot more.

pd.pivot_table(data, values=None, index=None, columns=None, aggfunc=’mean’)

# View original DataFrame.

cart_items

# Create a new column calculating price and quantity for use below.

cart_items['total'] = cart_items.price * cart_items.quantity
cart_items

# A simple pivot table setting only `values` and `columns`.

pd.pivot_table(cart_items, values=['quantity', 'price', 'total'], columns='title')

# I can choose different metrics for each of my values/rows.

pd.pivot_table(cart_items, values=['quantity', 'price', 'total'], columns='title', aggfunc={'price':'mean', 'quantity':'sum', 'total':'sum'} )

`.crosstab()`¶

This function basically creates a Frequency Table.

pd.crosstab(index_series, col_series)

# Here I'm reading in a CSV file I created to create my DataFrame.

dept_titles = pd.read_csv('data/dept_titles.csv', index_col=0)
dept_titles.head()

# Create a frequency table of titles by department

all_titles_crosstab = pd.crosstab(dept_titles.dept_name, dept_titles.title)
all_titles_crosstab

Styling¶

Make it easier to read your DataFrame by using the style.background_gradient() method of the pandas DataFrame! You can find great advice about choosing the right colormap for your purpose here.

By default, colors are computed based on the row values in each column.

df.style.background_gradient(cmap='PuBu' , axis=0)

If you want to compute the colors based on the entire matrix, axis=None.

I'll show a couple of useful built-in styling methods here, and if you want to read more about styling, check out this styling guide.

Here, I want colors computed based on the column values in each row (departments), so I set axis=1.

all_titles_crosstab.style.background_gradient(cmap='PuBuGn', axis=1)

Maybe you just want to find the max number in each row (department name) quickly.

all_titles_crosstab.style.highlight_max(axis=1)

Want more on reshaping using the pandas crosstab function? This article is a lot of fun!

Chaining¶

I can chain methods onto my Series or DataFrame as long as my method returns a Series or DataFrame object.

series.method().method().method()

dataframe.method().method().method()

`.pipe()`¶

I can also chain functions using the .pipe() method; it is like .apply() for an entire DataFrame.

I can chain functions together to make a pipline where the output from running one function serves as the input for the next function.

df.pipe.(function).pipe(function)

	emp_no	birth_date	first_name	last_name	gender	hire_date
0	10001	1953-09-02	Georgi	Facello	M	1986-06-26
1	10002	1964-06-02	Bezalel	Simmel	F	1985-11-21
2	10003	1959-12-03	Parto	Bamford	M	1986-08-28
3	10004	1954-05-01	Chirstian	Koblick	M	1986-12-01
4	10005	1955-01-21	Kyoichi	Maliniak	M	1989-09-12

	Country	Region	DataYear	ClassGrade	Gender	Ageyears	Handed	Height_cm	Footlength_cm	Armspan_cm	...	Watching_TV_Hours	Paid_Work_Hours	Work_At_Home_Hours	Schoolwork_Pressure	Planned_Education_Level	Favorite_Music	Superpower	Preferred_Status	Role_Model_Type	Charity_Donation
0	USA	TX	2018	12	Female	17.0	Right-Handed	152	22.5	24	...	0.0	35.0	1.0	Very little	Graduate degree	Rap/Hip hop	Invisibility	Happy	Business person	Environment
1	USA	TN	2018	12	Male	18.0	NaN	NaN	NaN	NaN	...	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN
2	USA	CO	2018	12	Female	17.0	Ambidextrous	164cm	22cm	160cm	...	6.0	2.0	2.0	Some	Graduate degree	Rap/Hip hop	Invisibility	Happy	Friend	International aid
3	USA	ID	2018	12	Male	23.0	Right-Handed	171.5	21.6	162.5	...	2.0	0.0	8.0	A lot	Graduate degree	Gospel	Freeze time	Happy	Relative	Religious
4	USA	NC	2018	12	Female	17.0	Right-Handed	136	26	54	...	0.0	2.0	2.0	A lot	Undergraduate degree	Rap/Hip hop	Freeze time	Happy	Relative	International aid

	Hair	Eye	Sex	Freq
1	Black	Brown	Male	32
2	Brown	Brown	Male	53
3	Red	Brown	Male	10
4	Blond	Brown	Male	3
5	Black	Blue	Male	11

	Country	Region
0	USA	TX
1	USA	TN
2	USA	CO
3	USA	ID
4	USA	NC

	Country	Region
10	USA	IA
11	USA	NJ
12	USA	TX
13	USA	FL
14	USA	OK
15	USA	ID
16	USA	TX
17	USA	IL
18	USA	CA
19	USA	ID

	tax	items
0	0.08	{'title': 'orange juice', 'price': 3.99, 'quantity': 1}
1	0.08	{'title': 'rice', 'price': 1.99, 'quantity': 3}
2	0.08	{'title': 'beans', 'price': 0.99, 'quantity': 3}
3	0.08	{'title': 'chili sauce', 'price': 2.99, 'quantity': 1}
4	0.08	{'title': 'chocolate', 'price': 0.75, 'quantity': 9}

	title	price	quantity
0	orange juice	3.99	1
1	rice	1.99	3
2	beans	0.99	3
3	chili sauce	2.99	1
4	chocolate	0.75	9

	name	signs	age
kane_1	Milla	Virgo	14
kane_2	Steve	Gemini	44
kane_3	Faith	Aquarius	34
kane_4	Declan	Aries	7

	Country	Region
10	USA	IA
11	USA	NJ
12	USA	TX
13	USA	FL
14	USA	OK
15	USA	ID
16	USA	TX
17	USA	IL
18	USA	CA
19	USA	ID

	emp_no	title	from_date	to_date
0	10001	Senior Engineer	1986-06-26	9999-01-01
1	10002	Staff	1996-08-03	9999-01-01

	emp_no	birth	first	last	gender	hire_date
35	10036	1959-08-10	Adamantios	Portugali	M	1992-01-03
34	10035	1953-02-08	Alain	Chappelet	M	1988-09-05
58	10059	1953-09-19	Alejandro	McAlpine	F	1991-06-26
38	10039	1959-10-01	Alejandro	Brender	M	1988-01-19
90	10091	1955-10-04	Amabile	Gomatam	M	1992-11-18

	name	signs	age	eyes	hair	shout	name_letter_count	age_group
kane_7	Pris	Scorpio	2	hazel	red	PRIS	4	minor
kane_6	Betty	Libra	1	brown	black	BETTY	5	minor
kane_5	Penny	Libra	0	amber	red	PENNY	5	minor
kane_4	Declan	Aries	7	brown	red	DECLAN	6	minor
kane_3	Faith	Aquarius	34	blue	blonde	FAITH	5	adult
kane_2	Steve	Gemini	44	brown	black	STEVE	5	adult
kane_1	Milla	Virgo	14	brown	brown	MILLA	5	minor

	kane_1	kane_2	kane_3	kane_4	kane_5	kane_6	kane_7
name	Milla	Steve	Faith	Declan	Penny	Betty	Pris
signs	Virgo	Gemini	Aquarius	Aries	Libra	Libra	Scorpio
age	14	44	34	7	0	1	2
eyes	brown	brown	blue	brown	amber	brown	hazel
hair	brown	black	blonde	red	red	black	red
shout	MILLA	STEVE	FAITH	DECLAN	PENNY	BETTY	PRIS
name_letter_count	5	5	5	6	5	5	4
age_group	minor	adult	adult	minor	minor	minor	minor

title	beans	chili sauce	chocolate	orange juice	rice
price	0.99	2.99	0.75	3.99	1.99
quantity	3.00	1.00	9.00	1.00	3.00
total	2.97	2.99	6.75	3.99	5.97

	emp_no	title	from_date	to_date	dept_name
0	10011	Staff	1990-01-22	1996-11-09	Customer Service
1	10038	Senior Staff	1996-09-20	9999-01-01	Customer Service
2	10038	Staff	1989-09-20	1996-09-20	Customer Service
3	10049	Senior Staff	2000-05-04	9999-01-01	Customer Service
4	10049	Staff	1992-05-04	2000-05-04	Customer Service

		Ageyears	Text_Messages_Sent_Yesterday
Gender	Handed
Female	Ambidextrous	17.000000	15.0
	Left-Handed	17.000000	200.0
	Right-Handed	17.026316	17.0
Male	Ambidextrous	17.000000	30.0
	Left-Handed	17.250000	20.0
	Right-Handed	18.119048	15.0

	Country	Region
10	USA	IA
11	USA	NJ
12	USA	TX
13	USA	FL
14	USA	OK
15	USA	ID
16	USA	TX
17	USA	IL
18	USA	CA
19	USA	ID

title	Assistant Engineer	Engineer	Manager	Senior Engineer	Senior Staff	Staff	Technique Leader
dept_name
Customer Service	298	2362	4	2027	13925	16150	309
Development	7769	58135	2	49326	1247	1424	7683
Finance	0	0	2	0	12139	13929	0
Human Resources	0	0	2	0	12274	14342	0
Marketing	0	0	2	0	13940	16196	0
Production	6445	49649	4	42205	1270	1478	6557
Quality Management	1831	13852	4	11864	0	0	1795
Research	378	2986	2	2570	11637	13495	393
Sales	0	0	2	0	36191	41808	0