import pandas as pd
import numpy as np
from requests import get
import re
from bs4 import BeautifulSoup
import os
Beautiful Soup is a Python library for scraping information from web pages. You have to be careful when web scraping data from sites that the site allows the practice. You can check by typing /robots.txt
after the base url of a site. Using the headers
parameter in your request is also a part of web scraping best practices.
Check out the docs for BeautifulSoup here. I also found it very useful to code along with some articles and tutorials I found online to get a feel for scraping.
Here, we are looking to retrieve content from a web page, but the web page is written in HTML (HyperText Markup Language), so we will use the requests
library to get a response with the HTML from our desired page and BeautifulSoup
to parse the HTML response. As you begin scraping, it would be helpful to have a basic understanding of the different HTML elements and attributes used to create web pages.
An HTML element consists of a start tag and an end tag along with the content between the tags. For example:
<div>content...content</div>
HTML elements can be nested or contain other elements. For example:
<!DOCTYPE html>
<html>
<body>
<h1>My First Heading</h1>
<p>My first paragraph.</p>
</body>
</html>
<html>
tags define the html element --- the whole document.
<body>
tags define the body element --- the document body.
<h1>
to <h6>
tags define a heading element --- a heading.
<h1>
- <h6>
, largest to smallest heading size)<p>
tags define a paragraph element --- a new pargraph of text.
<a>
tags define an anchor element, which tells the browser to render a hyperlink to a web page, file, email address, etc. Anchor elements use the href
attribute to tell the link where to go.
<a href='url_of_link'>Text of the link</a>
<div>
tags define a division element, like a container; it is used to group and style block-level content using the class
or id
attributes (defined below).
<span>
element is also like a container, like the <div>
element above, but for styling inline elements instead of block-level.
<img>
element defines an image and uses the src
attribute to hold the image address. The <img>
tag is self-closing, which mean it doesn't need a closing tag.
These are optional and appear inside of the opening tag, usually as name/value pairs name='value'
.but they make the HTML elements easier to work with because they give the elements names. You will have to examine a web page to find out if it uses these properties. For example, let's add a class attribute to our <div>
element from above.
<div class='descriptive_class_name'>content...content</div>
class
is an attribute of an HTML element that defines equal styles for tags with the same class. One element can have multiple classes and different elements can share the same classes, so classes cannot be used as unique identifiers.
id
is an attribute of an HTML element. Each element can only have one id, so they can be used as unique identifiers.
itemprop
is an attribute that consists of a name-value pair and is used to add properties to an element.
href
is an attribute of an <a>
element that contains the link address.
<a href=“destination.com”></a>
src
is an attribute of an <img>
element that contains the address for an image. I can size my image using the width=
and height=
attributes, as well, if I like.
<img src="img_name.jpg" width="500" height="600">
We will need to use the requests
library to retrieve the HTML from a web page we want to scrape. You can review how to use the requests
library in my notebook here.
Next, we will inspect the structure of the web page by right-clicking on the page we want to scrape and clicking inspect
. By clicking the icon in the far upper left of the new window, we can move our cursor over the part of the web page we want to scrape and see the responsible HTML code for that section high-lighted on the right.
We can use HTML tags, CSS class (class_=''
), Regex patterns, CSS selectors, and more with BeautifulSoup
search methods to retrieve the information we want. For example:
# Create our soup object using BeautifulSoup and our response string using get() method from requests library.
from requests import get
from bs4 import BeautifulSoup
response = get(url, headers=headers)
soup = BeautifulSoup(response.text, 'html.parser')
# Extract the first instance of the specific tag_name.
# param name -> A filter on tag name. Default: name=None
# param attrs -> A dictionary of filters on attribute values. Default: attrs={}
soup.find(name, attrs)
# Extract all of the instances of the specific tag_name.
soup.find_all(name, attrs)
# Return a dictionary of all attributes of this tag.
tag.attrs
# Return all the test in this tag
tag.text
# Return a list of all children elements of this tag.
tag.contents
You can find more about filtering your HTML requests with BeautifulSoup
search methods here.
url = 'https://codeup.com/codeups-data-science-career-accelerator-is-here/'
headers = {'User-Agent': 'Codeup Data Science'}
response = get(url, headers=headers)
response.ok
# Here's our long string; we'll use this to make our soup object
print(type(response.text))
# Use BeautifulSoup using our response string
soup = BeautifulSoup(response.text, 'html.parser')
# Now we have our BeautifulSoup object, we can use its built-in methods and properties
print(type(soup))
Goals: Write a function to scrape urls from main Codeup blog web page and write a function that returns a dictionary of blog titles and text for each blog page.
Here I use the .find()
method on my soup with the <h1>
tag and its itemprop
attribute equal to headline
. As always, there is no one way to accomplish our task, so I'm demonstrating one way to scrape the headline, not THE way to scrape the headline.
# This is what the h1 element contains. I want to access the itemprop headline
soup.find('h1')
# I will use the find method on my soup passing in h1 element and itemprop attribute
title = soup.find('h1', itemprop='headline').text
print(title)
print(type(title))
# I will use the find method on my soup passing in h1 element and itemprop attribute
text = soup.find('div', itemprop='text').text
print(text[:250])
print(type(text))
urls = ['https://codeup.com/codeups-data-science-career-accelerator-is-here/',
'https://codeup.com/data-science-myths/',
'https://codeup.com/data-science-vs-data-analytics-whats-the-difference/',
'https://codeup.com/10-tips-to-crush-it-at-the-sa-tech-job-fair/',
'https://codeup.com/competitor-bootcamps-are-closing-is-the-model-in-danger/']
def get_blog_articles(urls, cache=False):
'''
This function takes in a list of Codeup Blog urls and a parameter
with default cache == False which returns a df from a csv file.
If cache == True, the function scrapes the title and text for each url,
creates a list of dictionaries with the title and text for each blog,
converts list to df, and returns df.
'''
if cache == False:
df = pd.read_csv('big_blogs.csv', index_col=0)
else:
headers = {'User-Agent': 'Codeup Bayes Data Science'}
# Create an empty list to hold dictionaries
articles = []
# Loop through each url in our list of urls
for url in urls:
# get request to each url saved in response
response = get(url, headers=headers)
# Create soup object from response text and parse
soup = BeautifulSoup(response.text, 'html.parser')
# Save the title of each blog in variable title
title = soup.find('h1', itemprop='headline').text
# Save the text in each blog to variable text
text = soup.find('div', itemprop='text').text
# Create a dictionary holding the title and text for each blog
article = {'title': title, 'content': text}
# Add each dictionary to the articles list of dictionaries
articles.append(article)
# convert our list of dictionaries to a df
df = pd.DataFrame(articles)
# Write df to csv file for faster access
df.to_csv('big_blogs.csv')
return df
# Here cache == True, so the function will do a fresh scrape of the urls
blogs = get_blog_articles(urls=urls, cache=True)
blogs
# I'm going to hit Codeup's main blog page to scrape the urls
url = 'https://codeup.com/resources/#blog'
headers = {'User-Agent': 'Codeup Data Science'}
# Request the HTML
response = get(url, headers=headers)
# Create the soup object to parse the HTML
soup = BeautifulSoup(response.text, 'html.parser')
# I'm using the `a` element with class_ to get a list of tag elements from my soup object
link_list = soup.find_all('a', class_='jet-listing-dynamic-link__link')
link_list[:2]
# Using find_all has returned a bs ResultSet with 99 bs tags inside
print(f'Our variable link_list is a {type(link_list)}.')
print(f'Our element ResultSet is made up of {type(link_list[0])}.')
print(f'Our ResultSet contains {len(link_list)} element tags.')
# Create empty urls list and for each tag above, grab the href/link
# Add each link to the urls list
urls = []
for link in link_list:
urls.append(link['href'])
# Wow, 99 links! Ready to scrape titles and text from each
print(len(urls))
urls[:10]
def get_all_urls():
'''
This function scrapes all of the Codeup blog urls from
the main Codeup blog page and returns a list of urls.
'''
# The main Codeup blog page with all the urls
url = 'https://codeup.com/resources/#blog'
headers = {'User-Agent': 'Codeup Data Science'}
# Send request to main page and get response
response = get(url, headers=headers)
# Create soup object using response
soup = BeautifulSoup(response.text, 'html.parser')
# Create empty list to hold the urls for all blogs
urls = []
# Create a list of the element tags that hold the href/links
link_list = soup.find_all('a', class_='jet-listing-dynamic-link__link')
# get the href/link from each element tag in my list
for link in link_list:
# Add the link to my urls list
urls.append(link['href'])
return urls
# Now I can use my same function with my new urls list function!
# cache == True does a fresh scrape.
big_blogs = get_blog_articles(urls=get_all_urls(), cache=True)
big_blogs.head()
big_blogs.info()
# cache == False reads in a df from the `big_blogs.csv`.
big_blogs = get_blog_articles(urls=get_all_urls(), cache=False)
big_blogs.head()
Goal: Write a function that scrapes the news articles for the following topics:
url = 'https://inshorts.com/en/read/entertainment'
response = get(url)
response.ok
soup = BeautifulSoup(response.text, 'html.parser')
# Scrape a ResultSet of all the news cards on the page and look at first card
cards = soup.find_all('div', class_='news-card')
print(type(cards))
cards[0]
# Save the title of each news card to list titles
titles = []
for card in cards:
title = card.find('span', itemprop='headline').text
titles.append(title)
titles[:5]
# Save the author of the news card to list authors
authors = []
for card in cards:
author = card.find('span', class_='author').text
authors.append(author)
authors[:5]
# Save the text of each article to a list of texts
texts = []
for card in cards:
text = card.find('div', itemprop='articleBody').text
texts.append(text)
texts[:2]
# Create an empty list, articles, to hold the dictionaries for each article
articles = []
# Loop through each news card on the page and get what we want
for card in cards:
title = card.find('span', itemprop='headline' ).text
author = card.find('span', class_='author').text
content = card.find('div', itemprop='articleBody').text
# Create a dictionary, article, for each news card
article = {'title': title, 'author': author, 'content': content}
# Add the dictionary, article, to our list of dictionaries, articles.
articles.append(article)
# Here we see our list contains 24-25 dictionaries for news cards
print(len(articles))
articles[:2]
def get_news_articles(cache=False):
'''
This function uses a cache parameter with default cache == False to give the option of
returning in a df of inshorts topics and info by reading a csv file or
of doing a fresh scrape of inshort pages with topics business, sports, technology,
and entertainment and writing the returned df to a csv file.
'''
# default to read in a csv instead of scrape for df
if cache == False:
df = pd.read_csv('articles.csv', index_col=0)
# cache == True completes a fresh scrape for df
else:
# Set base_url and headers that will be used in get request
base_url = 'https://inshorts.com/en/read/'
headers = {'User-Agent': 'Codeup Data Science'}
# List of topics to scrape
topics = ['business', 'sports', 'technology', 'entertainment']
# Create an empty list, articles, to hold our dictionaries
articles = []
for topic in topics:
# Get a response object from the main inshorts page
response = get(base_url + topic, headers=headers)
# Create soup object using response from inshort
soup = BeautifulSoup(response.text, 'html.parser')
# Scrape a ResultSet of all the news cards on the page
cards = soup.find_all('div', class_='news-card')
# Loop through each news card on the page and get what we want
for card in cards:
title = card.find('span', itemprop='headline' ).text
author = card.find('span', class_='author').text
content = card.find('div', itemprop='articleBody').text
# Create a dictionary, article, for each news card
article = ({'topic': topic,
'title': title,
'author': author,
'content': content})
# Add the dictionary, article, to our list of dictionaries, articles.
articles.append(article)
# Why not return it as a DataFrame?!
df = pd.DataFrame(articles)
# Write df to csv for future use
df.to_csv('articles.csv')
return df
# Test our function with cache == True to do a freash scrape and write to `articles.csv`
df = get_news_articles(cache=True)
df.head()
df.topic.value_counts()
df.info()
# Test our function to read in the df from `articles.csv`
df = get_news_articles(cache=False)
df.head()
df.info()