In [1]:
import pandas as pd
import numpy as np

from requests import get
import re
from bs4 import BeautifulSoup
import os

Beautiful Soup - Web Scraping

What is Beautiful Soup?

Beautiful Soup is a Python library for scraping information from web pages. You have to be careful when web scraping data from sites that the site allows the practice. You can check by typing /robots.txt after the base url of a site. Using the headers parameter in your request is also a part of web scraping best practices.

Check out the docs for BeautifulSoup here. I also found it very useful to code along with some articles and tutorials I found online to get a feel for scraping.

  • Simple but useful web scraping with Beautiful Soup article.

  • Dataquest tutorial from curriculum and intro Dataquest tutorial.


So What Do I Do With Beautiful Soup?

Here, we are looking to retrieve content from a web page, but the web page is written in HTML (HyperText Markup Language), so we will use the requests library to get a response with the HTML from our desired page and BeautifulSoup to parse the HTML response. As you begin scraping, it would be helpful to have a basic understanding of the different HTML elements and attributes used to create web pages.

HTML Tree Diagram

HTML Tree Diagram Image

HTML Elements

An HTML element consists of a start tag and an end tag along with the content between the tags. For example:

<div>content...content</div>

HTML elements can be nested or contain other elements. For example:

<!DOCTYPE html>
<html>
<body>

<h1>My First Heading</h1>
<p>My first paragraph.</p>

</body>
</html>

<html> tags define the html element --- the whole document.

<body> tags define the body element --- the document body.

<h1> to <h6> tags define a heading element --- a heading.

  • (<h1> - <h6>, largest to smallest heading size)

<p> tags define a paragraph element --- a new pargraph of text.

<a> tags define an anchor element, which tells the browser to render a hyperlink to a web page, file, email address, etc. Anchor elements use the href attribute to tell the link where to go.

<a href='url_of_link'>Text of the link</a>

<div> tags define a division element, like a container; it is used to group and style block-level content using the class or id attributes (defined below).

<span> element is also like a container, like the <div> element above, but for styling inline elements instead of block-level.

<img> element defines an image and uses the src attribute to hold the image address. The <img> tag is self-closing, which mean it doesn't need a closing tag.

HTML or Tag Attributes

These are optional and appear inside of the opening tag, usually as name/value pairs name='value'.but they make the HTML elements easier to work with because they give the elements names. You will have to examine a web page to find out if it uses these properties. For example, let's add a class attribute to our <div> element from above.

<div class='descriptive_class_name'>content...content</div>

class is an attribute of an HTML element that defines equal styles for tags with the same class. One element can have multiple classes and different elements can share the same classes, so classes cannot be used as unique identifiers.

id is an attribute of an HTML element. Each element can only have one id, so they can be used as unique identifiers.

itemprop is an attribute that consists of a name-value pair and is used to add properties to an element.

href is an attribute of an <a> element that contains the link address.

<a href=“destination.com”></a>

src is an attribute of an <img> element that contains the address for an image. I can size my image using the width= and height= attributes, as well, if I like.

<img src="img_name.jpg" width="500" height="600">
CSS Selectors

Now What?

We will need to use the requests library to retrieve the HTML from a web page we want to scrape. You can review how to use the requests library in my notebook here.

Next, we will inspect the structure of the web page by right-clicking on the page we want to scrape and clicking inspect. By clicking the icon in the far upper left of the new window, we can move our cursor over the part of the web page we want to scrape and see the responsible HTML code for that section high-lighted on the right.

BeautifulSoup Methods

We can use HTML tags, CSS class (class_=''), Regex patterns, CSS selectors, and more with BeautifulSoup search methods to retrieve the information we want. For example:

# Create our soup object using BeautifulSoup and our response string using get() method from requests library.

from requests import get
from bs4 import BeautifulSoup

response = get(url, headers=headers)
soup = BeautifulSoup(response.text, 'html.parser')

# Extract the first instance of the specific tag_name.
# param name -> A filter on tag name. Default: name=None
# param attrs -> A dictionary of filters on attribute values. Default: attrs={}

soup.find(name, attrs)

# Extract all of the instances of the specific tag_name.

soup.find_all(name, attrs)

# Return a dictionary of all attributes of this tag.

tag.attrs

# Return all the test in this tag

tag.text

# Return a list of all children elements of this tag.

tag.contents

You can find more about filtering your HTML requests with BeautifulSoup search methods here.

In [2]:
url = 'https://codeup.com/codeups-data-science-career-accelerator-is-here/'
headers = {'User-Agent': 'Codeup Data Science'} 
    
response = get(url, headers=headers)
response.ok
Out[2]:
True
In [3]:
# Here's our long string; we'll use this to make our soup object

print(type(response.text))
<class 'str'>
In [5]:
# Use BeautifulSoup using our response string

soup = BeautifulSoup(response.text, 'html.parser')

# Now we have our BeautifulSoup object, we can use its built-in methods and properties

print(type(soup))
<class 'bs4.BeautifulSoup'>

Codeup Blogs

Goals: Write a function to scrape urls from main Codeup blog web page and write a function that returns a dictionary of blog titles and text for each blog page.

Grab Title from Page

Here I use the .find() method on my soup with the <h1> tag and its itemprop attribute equal to headline. As always, there is no one way to accomplish our task, so I'm demonstrating one way to scrape the headline, not THE way to scrape the headline.

In [6]:
# This is what the h1 element contains. I want to access the itemprop headline

soup.find('h1')
Out[6]:
<h1 class="jupiterx-post-title" itemprop="headline">Codeup’s Data Science Career Accelerator is Here!</h1>
In [31]:
# I will use the find method on my soup passing in h1 element and itemprop attribute

title = soup.find('h1', itemprop='headline').text
print(title)
Codeup’s Data Science Career Accelerator is Here!
In [32]:
print(type(title))
<class 'str'>

Grab Text from Page

In [33]:
# I will use the find method on my soup passing in h1 element and itemprop attribute

text = soup.find('div', itemprop='text').text
print(text[:250])
The rumors are true! The time has arrived. Codeup has officially opened applications to our new Data Science career accelerator, with only 25 seats available! This immersive program is one of a kind in San Antonio, and will help you land a job in Gla
In [34]:
print(type(text))
<class 'str'>

Build Blog Function

In [35]:
urls = ['https://codeup.com/codeups-data-science-career-accelerator-is-here/',
        'https://codeup.com/data-science-myths/',
        'https://codeup.com/data-science-vs-data-analytics-whats-the-difference/',
        'https://codeup.com/10-tips-to-crush-it-at-the-sa-tech-job-fair/',
        'https://codeup.com/competitor-bootcamps-are-closing-is-the-model-in-danger/']

def get_blog_articles(urls, cache=False):
    '''
    This function takes in a list of Codeup Blog urls and a parameter
    with default cache == False which returns a df from a csv file.
    If cache == True, the function scrapes the title and text for each url, 
    creates a list of dictionaries with the title and text for each blog, 
    converts list to df, and returns df.
    '''
    if cache == False:
        df = pd.read_csv('big_blogs.csv', index_col=0)
    else:
        headers = {'User-Agent': 'Codeup Bayes Data Science'} 

        # Create an empty list to hold dictionaries
        articles = []

        # Loop through each url in our list of urls
        for url in urls:

            # get request to each url saved in response
            response = get(url, headers=headers)

            # Create soup object from response text and parse
            soup = BeautifulSoup(response.text, 'html.parser')

            # Save the title of each blog in variable title
            title = soup.find('h1', itemprop='headline').text

            # Save the text in each blog to variable text
            text = soup.find('div', itemprop='text').text

            # Create a dictionary holding the title and text for each blog
            article = {'title': title, 'content': text}

            # Add each dictionary to the articles list of dictionaries
            articles.append(article)
            
        # convert our list of dictionaries to a df
        df = pd.DataFrame(articles)

        # Write df to csv file for faster access
        df.to_csv('big_blogs.csv')
    
    return df

Test Function

In [37]:
# Here cache == True, so the function will do a fresh scrape of the urls

blogs = get_blog_articles(urls=urls, cache=True)
blogs
Out[37]:
title content
0 Codeup’s Data Science Career Accelerator is Here! The rumors are true! The time has arrived. Cod...
1 Data Science Myths By Dimitri Antoniou and Maggie GiustData Scien...
2 Data Science VS Data Analytics: What’s The Dif... By Dimitri AntoniouA week ago, Codeup launched...
3 10 Tips to Crush It at the SA Tech Job Fair 10 Tips to Crush It at the SA Tech Job FairSA ...
4 Competitor Bootcamps Are Closing. Is the Model... Competitor Bootcamps Are Closing. Is the Model...

Bonus URL Scrape

In [36]:
# I'm going to hit Codeup's main blog page to scrape the urls

url = 'https://codeup.com/resources/#blog'
headers = {'User-Agent': 'Codeup Data Science'} 

# Request the HTML
response = get(url, headers=headers)

# Create the soup object to parse the HTML
soup = BeautifulSoup(response.text, 'html.parser')
In [37]:
# I'm using the `a` element with class_ to get a list of tag elements from my soup object

link_list = soup.find_all('a', class_='jet-listing-dynamic-link__link')
link_list[:2]
Out[37]:
[<a class="jet-listing-dynamic-link__link" href="https://codeup.com/bootcamp-to-bootcamp/"><span class="jet-listing-dynamic-link__label">Read More</span></a>,
 <a class="jet-listing-dynamic-link__link" href="https://codeup.com/how-to-get-started-on-a-programming-exercise/"><span class="jet-listing-dynamic-link__label">Read More</span></a>]
In [43]:
# Using find_all has returned a bs ResultSet with 99 bs tags inside

print(f'Our variable link_list is a {type(link_list)}.')
print(f'Our element ResultSet is made up of {type(link_list[0])}.')
print(f'Our ResultSet contains {len(link_list)} element tags.')
Our variable link_list is a <class 'bs4.element.ResultSet'>.
Our element ResultSet is made up of <class 'bs4.element.Tag'>.
Our ResultSet contains 99 element tags.
In [44]:
# Create empty urls list and for each tag above, grab the href/link
# Add each link to the urls list

urls = []
for link in link_list:
    urls.append(link['href'])
In [45]:
# Wow, 99 links! Ready to scrape titles and text from each

print(len(urls))
urls[:10]
99
Out[45]:
['https://codeup.com/bootcamp-to-bootcamp/',
 'https://codeup.com/how-to-get-started-on-a-programming-exercise/',
 'https://codeup.com/career-in-data-science/',
 'https://codeup.com/getting-hired-in-a-remote-environment/',
 'https://codeup.com/codeup-remote-students/',
 'https://codeup.com/covid-relief/',
 'https://codeup.com/discovering-my-passion-through-codeup/',
 'https://codeup.com/covid-19/',
 'https://codeup.com/15-tips-for-virtual-interview-and-meetings/',
 'https://codeup.com/setting-myself-up-for-success-at-codeup/']

Bonus URL Function

In [17]:
def get_all_urls():
    '''
    This function scrapes all of the Codeup blog urls from
    the main Codeup blog page and returns a list of urls.
    '''
    # The main Codeup blog page with all the urls
    url = 'https://codeup.com/resources/#blog'
    
    headers = {'User-Agent': 'Codeup Data Science'} 
    
    # Send request to main page and get response
    response = get(url, headers=headers)
    
    # Create soup object using response
    soup = BeautifulSoup(response.text, 'html.parser')
    
    # Create empty list to hold the urls for all blogs
    urls = []
    
    # Create a list of the element tags that hold the href/links
    link_list = soup.find_all('a', class_='jet-listing-dynamic-link__link')
    
    # get the href/link from each element tag in my list
    for link in link_list:
        
        # Add the link to my urls list
        urls.append(link['href'])
        
    return urls
In [18]:
# Now I can use my same function with my new urls list function!
# cache == True does a fresh scrape.

big_blogs = get_blog_articles(urls=get_all_urls(), cache=True)
In [19]:
big_blogs.head()
Out[19]:
title content
0 From Bootcamp to Bootcamp: Two Military Vetera... Are you a veteran or active-duty military memb...
1 How to Get Started On Any Programming Exercise Programming is hard. Whether you’re just begin...
2 The Best Path to a Career in Data Science In our blog, “The Best Path To A Career In Sof...
3 Getting Hired in a Remote Environment As a career accelerator with a tuition refund ...
4 The Remote Codeup Student Experience Communities across Texas have now lived in a r...
In [20]:
big_blogs.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 99 entries, 0 to 98
Data columns (total 2 columns):
 #   Column   Non-Null Count  Dtype 
---  ------   --------------  ----- 
 0   title    99 non-null     object
 1   content  99 non-null     object
dtypes: object(2)
memory usage: 1.7+ KB
In [21]:
# cache == False reads in a df from the `big_blogs.csv`.

big_blogs = get_blog_articles(urls=get_all_urls(), cache=False)
big_blogs.head()
Out[21]:
title content
0 From Bootcamp to Bootcamp: Two Military Vetera... Are you a veteran or active-duty military memb...
1 How to Get Started On Any Programming Exercise Programming is hard. Whether you’re just begin...
2 The Best Path to a Career in Data Science In our blog, “The Best Path To A Career In Sof...
3 Getting Hired in a Remote Environment As a career accelerator with a tuition refund ...
4 The Remote Codeup Student Experience Communities across Texas have now lived in a r...

Inshorts News Articles

Goal: Write a function that scrapes the news articles for the following topics:

  • Business
  • Sports
  • Technology
  • Entertainment
In [3]:
url = 'https://inshorts.com/en/read/entertainment'

response = get(url)
response.ok
Out[3]:
True
In [4]:
soup = BeautifulSoup(response.text, 'html.parser')

Scrape News Cards from Main Page

In [24]:
# Scrape a ResultSet of all the news cards on the page and look at first card

cards = soup.find_all('div', class_='news-card')
print(type(cards))
cards[0]
<class 'bs4.element.ResultSet'>
Out[24]:
<div class="news-card z-depth-1" itemscope="" itemtype="http://schema.org/NewsArticle">
<span content="" itemid="https://inshorts.com/en/news/prithviraj-shares-pic-of-transformation-after-having-dangerously-low-fat-percentage-1590492948306" itemprop="mainEntityOfPage" itemscope="" itemtype="https://schema.org/WebPage"></span>
<span itemprop="author" itemscope="itemscope" itemtype="https://schema.org/Person">
<span content="Daisy Mowke" itemprop="name"></span>
</span>
<span content="Prithviraj shares pic of transformation after having 'dangerously low fat percentage'" itemprop="description"></span>
<span itemprop="image" itemscope="" itemtype="https://schema.org/ImageObject">
<meta content="https://static.inshorts.com/inshorts/images/v1/variants/jpg/m/2020/05_may/26_tue/img_1590491282323_173.jpg?" itemprop="url"/>
<meta content="864" itemprop="width"/>
<meta content="483" itemprop="height"/>
</span>
<span itemprop="publisher" itemscope="itemscope" itemtype="https://schema.org/Organization">
<span content="https://inshorts.com/" itemprop="url"></span>
<span content="Inshorts" itemprop="name"></span>
<span itemprop="logo" itemscope="" itemtype="https://schema.org/ImageObject">
<span content="https://assets.inshorts.com/inshorts/images/v1/variants/jpg/m/2018/11_nov/21_wed/img_1542823931298_497.jpg" itemprop="url"></span>
<meta content="400" itemprop="width"/>
<meta content="60" itemprop="height"/>
</span>
</span>
<div class="news-card-image" style="background-image: url('https://static.inshorts.com/inshorts/images/v1/variants/jpg/m/2020/05_may/26_tue/img_1590491282323_173.jpg?')">
</div>
<div class="news-card-title news-right-box">
<a class="clickable" href="/en/news/prithviraj-shares-pic-of-transformation-after-having-dangerously-low-fat-percentage-1590492948306" onclick="ga('send', {'hitType': 'event', 'eventCategory': 'TitleOfNews', 'eventAction': 'clicked', 'eventLabel': 'Prithviraj%20shares%20pic%20of%20transformation%20after%20having%20'dangerously%20low%20fat%20percentage')' });" style="color:#44444d!important">
<span itemprop="headline">Prithviraj shares pic of transformation after having 'dangerously low fat percentage'</span>
</a>
<div class="news-card-author-time news-card-author-time-in-title">
<a href="/prev/en/news/prithviraj-shares-pic-of-transformation-after-having-dangerously-low-fat-percentage-1590492948306"><span class="short">short</span></a> by <span class="author">Daisy Mowke</span> / 
      <span class="time" content="2020-05-26T11:35:48.000Z" itemprop="datePublished">05:05 pm</span> on <span clas="date">26 May 2020,Tuesday</span>
</div>
</div>
<div class="news-card-content news-right-box">
<div itemprop="articleBody">South Indian actor Prithviraj Sukumaran today shared a picture of his physical transformation. "One month since we finished the last of...bare body scenes for 'Aadujeevitham'. On the last day, I had dangerously low fat percentage and visceral fat levels," he wrote. Prithviraj, who was stranded in Jordan with the film crew for almost three months, returned to Kochi on Friday.</div>
<div class="news-card-author-time news-card-author-time-in-content">
<a href="/prev/en/news/prithviraj-shares-pic-of-transformation-after-having-dangerously-low-fat-percentage-1590492948306"><span class="short">short</span></a> by <span class="author">Daisy Mowke</span> / 
      <span class="time" content="2020-05-26T11:35:48.000Z" itemprop="dateModified">05:05 pm</span> on <span class="date">26 May</span>
</div>
</div>
<div class="news-card-footer news-right-box">
<div class="read-more">read more at <a class="source" href="https://www.instagram.com/p/CApLZkbA6MN/?utm_campaign=fullarticle&amp;utm_medium=referral&amp;utm_source=inshorts " onclick="ga('send', {'hitType': 'event', 'eventCategory': 'ReadMore', 'eventAction': 'clicked', 'eventLabel': 'Instagram' });" target="_blank">Instagram</a></div>
</div>
</div>

Scrape the Title from Each News Card

In [25]:
# Save the title of each news card to list titles

titles = []
for card in cards:
    title = card.find('span', itemprop='headline').text
    titles.append(title)
    
titles[:5]
Out[25]:
["Prithviraj shares pic of transformation after having 'dangerously low fat percentage'",
 'Akshay Kumar resumes outdoor shooting amid lockdown; pics from set surface online',
 'Karan Johar confirms 2 house helps tested COVID-19 +ve, says he tested -ve',
 "Actress Preksha Mehta commits suicide at 25, wrote 'Death of dreams' in Insta story",
 "Nolan crashed a real plane into a real building in 'Tenet': Actor John Washington"]

Scrape Author from News Cards

In [26]:
# Save the author of the news card to list authors

authors = []
for card in cards:
    author = card.find('span', class_='author').text
    authors.append(author)
    
authors[:5]
Out[26]:
['Daisy Mowke', 'Daisy Mowke', 'Daisy Mowke', 'Daisy Mowke', 'Daisy Mowke']

Scrape Text from News Cards

In [27]:
# Save the text of each article to a list of texts

texts = []
for card in cards:
    text = card.find('div', itemprop='articleBody').text
    texts.append(text)
    
texts[:2]
Out[27]:
['South Indian actor Prithviraj Sukumaran today shared a picture of his physical transformation. "One month since we finished the last of...bare body scenes for \'Aadujeevitham\'. On the last day, I had dangerously low fat percentage and visceral fat levels," he wrote. Prithviraj, who was stranded in Jordan with the film crew for almost three months, returned to Kochi on Friday.',
 'Akshay Kumar has become the first Bollywood actor to shoot on outdoor location amid lockdown. He shot for a project with director R Balki. Several pictures and videos from the shoot have surfaced on social media in which the team, including Akshay and Balki, are seen wearing masks. They can also be seen maintaining social distancing.']
In [28]:
# Create an empty list, articles, to hold the dictionaries for each article
articles = []

# Loop through each news card on the page and get what we want
for card in cards:
    title = card.find('span', itemprop='headline' ).text
    author = card.find('span', class_='author').text
    content = card.find('div', itemprop='articleBody').text
    
    # Create a dictionary, article, for each news card
    article = {'title': title, 'author': author, 'content': content}
    
    # Add the dictionary, article, to our list of dictionaries, articles.
    articles.append(article)
In [29]:
# Here we see our list contains 24-25 dictionaries for news cards

print(len(articles))
articles[:2]
25
Out[29]:
[{'title': "Prithviraj shares pic of transformation after having 'dangerously low fat percentage'",
  'author': 'Daisy Mowke',
  'content': 'South Indian actor Prithviraj Sukumaran today shared a picture of his physical transformation. "One month since we finished the last of...bare body scenes for \'Aadujeevitham\'. On the last day, I had dangerously low fat percentage and visceral fat levels," he wrote. Prithviraj, who was stranded in Jordan with the film crew for almost three months, returned to Kochi on Friday.'},
 {'title': 'Akshay Kumar resumes outdoor shooting amid lockdown; pics from set surface online',
  'author': 'Daisy Mowke',
  'content': 'Akshay Kumar has become the first Bollywood actor to shoot on outdoor location amid lockdown. He shot for a project with director R Balki. Several pictures and videos from the shoot have surfaced on social media in which the team, including Akshay and Balki, are seen wearing masks. They can also be seen maintaining social distancing.'}]

Build Article Function

In [30]:
def get_news_articles(cache=False):
    '''
    This function uses a cache parameter with default cache == False to give the option of 
    returning in a df of inshorts topics and info by reading a csv file or
    of doing a fresh scrape of inshort pages with topics business, sports, technology,
    and entertainment and writing the returned df to a csv file.
    '''
    # default to read in a csv instead of scrape for df
    if cache == False:
        df = pd.read_csv('articles.csv', index_col=0)
        
    # cache == True completes a fresh scrape for df    
    else:
    
        # Set base_url and headers that will be used in get request

        base_url = 'https://inshorts.com/en/read/'
        headers = {'User-Agent': 'Codeup Data Science'}
        
        # List of topics to scrape
        topics = ['business', 'sports', 'technology', 'entertainment']

        # Create an empty list, articles, to hold our dictionaries
        articles = []

        for topic in topics:

            # Get a response object from the main inshorts page
            response = get(base_url + topic, headers=headers)

            # Create soup object using response from inshort
            soup = BeautifulSoup(response.text, 'html.parser')

            # Scrape a ResultSet of all the news cards on the page
            cards = soup.find_all('div', class_='news-card')

            # Loop through each news card on the page and get what we want
            for card in cards:
                title = card.find('span', itemprop='headline' ).text
                author = card.find('span', class_='author').text
                content = card.find('div', itemprop='articleBody').text

                # Create a dictionary, article, for each news card
                article = ({'topic': topic, 
                            'title': title, 
                            'author': author, 
                            'content': content})

                # Add the dictionary, article, to our list of dictionaries, articles.
                articles.append(article)
            
        # Why not return it as a DataFrame?!
        df = pd.DataFrame(articles)
        
        # Write df to csv for future use
        df.to_csv('articles.csv')
    
    return df
In [31]:
# Test our function with cache == True to do a freash scrape and write to `articles.csv`

df = get_news_articles(cache=True)
df.head()
Out[31]:
topic title author content
0 business Firm whose stock surged 1000% in 2020 starts h... Krishna Veera Vanamali US biotech company Novavax said it has started...
1 business India's economic growth seen at 1.2% in Q4 FY2... Dharna India's economy is estimated to have grown at ...
2 business TVS Motor cuts employees' salaries by up to 20... Dharna TVS Motor Company has said it is cutting the s...
3 business Lockdown extensions won't help, cases will con... Anushka Dixit Mahindra Group Chairman Anand Mahindra said th...
4 business Uber India fires 600 employees reducing 25% of... Dharna Uber is firing 600 employees in India, or 25% ...
In [32]:
df.topic.value_counts()
Out[32]:
sports           25
business         25
entertainment    25
technology       24
Name: topic, dtype: int64
In [33]:
df.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 99 entries, 0 to 98
Data columns (total 4 columns):
 #   Column   Non-Null Count  Dtype 
---  ------   --------------  ----- 
 0   topic    99 non-null     object
 1   title    99 non-null     object
 2   author   99 non-null     object
 3   content  99 non-null     object
dtypes: object(4)
memory usage: 3.2+ KB
In [34]:
# Test our function to read in the df from `articles.csv`

df = get_news_articles(cache=False)
df.head()
Out[34]:
topic title author content
0 business Firm whose stock surged 1000% in 2020 starts h... Krishna Veera Vanamali US biotech company Novavax said it has started...
1 business India's economic growth seen at 1.2% in Q4 FY2... Dharna India's economy is estimated to have grown at ...
2 business TVS Motor cuts employees' salaries by up to 20... Dharna TVS Motor Company has said it is cutting the s...
3 business Lockdown extensions won't help, cases will con... Anushka Dixit Mahindra Group Chairman Anand Mahindra said th...
4 business Uber India fires 600 employees reducing 25% of... Dharna Uber is firing 600 employees in India, or 25% ...
In [35]:
df.info()
<class 'pandas.core.frame.DataFrame'>
Int64Index: 99 entries, 0 to 98
Data columns (total 4 columns):
 #   Column   Non-Null Count  Dtype 
---  ------   --------------  ----- 
 0   topic    99 non-null     object
 1   title    99 non-null     object
 2   author   99 non-null     object
 3   content  99 non-null     object
dtypes: object(4)
memory usage: 3.9+ KB