Scraping Gold Price using Python
Web Scraping Project End to End
What is Web Scraping?
Web Scraping is the process through which we extract data from a website, and save it in a form which is easy to read, to understand and to work on.
When we say ‘Easy to work on’, we mean to say that the data thus extracted can be used to get a lot of useful insights and answer a lot of questions, finding answers to which would not be such an easy task, if we did not have that data stored with us in a simple and sorted manner, i.e. generally in an
Excel File or a CSV file
.
How does Web Scraping Works?
In order to understand web scraping, it’s important to first understand that web pages which are built using text-based mark-up languages and the most common being used is
HTML
.The HyperText Markup Language or HTML is the standard markup language for documents designed to be displayed in a web browser. It can be assisted by technologies such as Cascading Style Sheets and scripting languages such as JavaScript.
Note : Not all websites allow Web Scraping, especially when it comes to user’s personal data, so we always have to make sure that we are not publishing any personal data of user online. Websites usually have protection and if they see that we are downloading a large amount of data from their website, they will block us from accessing the website.
Project Idea
In this project, we will parse through the Gold Returns’s website to get details about the Gold Rates in the top cities of INDIA.
We will retrieve information from the page ’Gold Rate in Top Cities of India’ using web scraping: the process of programmatically retrieving information from the web. Web scraping is not magic, and yet some readers can get information every day. For example, a recent graduate can copy and paste information about the companies they are applying to into a job application management spreadsheet.
Project Goal
The goal of the project is to create a web scraper that takes all the required information and compiles it into a CSV. The output CSV file format is given below:
Project steps
Here is an outline of the steps we’ll follow :
- Download the webpage using
requests
- Parse the HTML source code using
BeautifulSoup
library and extract the desired information - Building the scraper components
- Compile the extracted information into Python list and dictionaries
- Converting the python dictionaries into
Pandas DataFrames
- Write information to the final CSV file
- Future work and references
Resources
- Data Source: Gold Price
- Packages Used: Requests, BeautifulSoup4, Pandas
Lets begin :
What is requests
?
Requests is a Python HTTP library that allows us to send HTTP requests to servers of websites, instead of using browsers to communicate the web.
We use
pip
, a package-management system, to install and manage softwares. Since the platform we selected is Binder, we would have to type a line of code!pip install
to installrequests
. You will see lots codes of!pip
when installing other packages.When we attempt to use some prewritten functions from a certain library, we would use the
import
statement. e.g. When we would have to typeimport requests
after installation, we are able to use any function fromrequests
library.
!pip install requests --quiet --upgrade
import requests
requests.get()
In order to download a web page, we use
requests.get()
to send the HTTP request to the Good Return's server and what the function returns is a response object, which is the HTTP response.
city_name = 'mumbai'
#The URL Address of the webpage we will scrape, i.e. Gold Rate for Mumbai City
gold_rate_url = 'https://www.goodreturns.in/gold-rates/'+city_name+'.html'
response = requests.get(gold_rate_url) #requests.get()
Status code
Now, we have to
check
if we successfully send the HTTP request and get a HTTP response back on purpose. This is because we're NOT using browsers, because of which we can't getthe feedback
directly if we didn't send HTTP requests successfully.In general, the method to check out if the server sended a HTTP response back is the status code. In
requests
library,requests.get
returns a response object, which containing the page contents and the information about status code indicating if the HTTP request was successful. Learn more about HTTP status codes here: https://developer.mozilla.org/en-US/docs/Web/HTTP/Status.If the request was successful,
response.status_code
is set to a value between 200 and 299.
response.status_code #Here we are checking the Status code, -> 200-299 will mean that the request was successful
The HTTP response contains HTML that is ready to be displayed in browser. Here we can use
response.text
to retrieve the HTML document.
page_contents = response.text
len(page_contents) #The `len` fucnction tells us the length of the response object
HOORAY ! We have ~4.17 Lac characters within the HTML that we have downloaded just in a second
page_contents[:500] #This displays the first 500 characters of `page_contents`
- What we see above is the source code of the web page. It is written in a language called HTML.
- It defines and display the content and structure of the web page by the help of the browsers like Chrome
with open("gold_price_mumbai.html", 'w') as f: #Writing the html page to a file locally, i.e. a replica of real html page
f.write(page_contents)
Here, we save the text that we have got into a
HTML
file withopen
statement.Now, a HTML File is created by the name
gold_price_mumbai.html
Parse the HTML source code using Beautiful Soup library
What is Beautiful Soup?
Beautiful Soup is a Python package for parsing HTML and XML documents. Beautiful Soup enables us to get data out of sequences of characters. It creates a parse tree for parsed pages that can be used to extract data from HTML. It’s a handy tool when it comes to web scraping. You can read more on their documentation site. https://www.crummy.com/software/BeautifulSoup/bs4/doc/#getting-help
To extract information from the HTML source code of a webpage programmatically, we can use the Beautiful Soup library. Let’s install the library and import the BeautifulSoup class from the bs4 module.
!pip install beautifulsoup4 --quiet --upgrade
from bs4 import BeautifulSoup
doc = BeautifulSoup(page_contents, 'html.parser')
#Now 'doc' contains entire html in parsed format
type(doc)
Inspecting the HTML source code of a web page
In Beautiful Soup library, we can specify html.parser
to ask Python to read components of the page, instead of reading it as a long string.
What is HTML?
Before we dive into how to inspect HTML, we should know the basic knowledge about HTML.
The HyperText Markup Language, or HTML is the standard markup language for documents designed to be displayed in a web browser. It can be assisted by technologies such as Cascading Style Sheets and scripting languages such as JavaScript.
An HTML tag comprises of three parts:
- Name: (
html
,head
,body
,div
, etc.) Indicates what the tag represents and how a browser should interpret the information inside it. - Attributes: (
href
,target
,class
,id
, etc.) Properties of tag used by the browser to customize how a tag is displayed and decide what happens on user interactions. - Children: A tag can contain some text or other tags or both between the opening and closing segments, e.g.,
<div>Some content</div>
.
Common tags and attributes
1. Tags in HTML
There are around 100 types of HTML tags but on a day to day basis, around 15 to 20 of them are the most common use, such as
<div>
tag,<p>
tag,<section>
tag,<img>
tag,<a>
tags.Of many tags, I wanted to highlight
<a>
tag, which can contain attributes such ashref
(hyperlink reference), because<a>
tag allows users to click and they would be directed to another site. That's why the name of<a>
tag is anchor.
2. Attributes
Each tag supports several attributes. Following are some common attributes used to modify the behaviour of tags
id
style
class
href
(used with<a>
)src
(used with<img>
)
What we can do with a BeautifulSoup object is to get a specifc types of a tag in HTML by calling the name of a tag, as shown in code cell below.`
Here, we use the
find()
function of BeautifulSoup to find the first<title>
tag in the HTML document and display its content
title = doc.find('title')
title
Inspecting HTML in the Browser
To view the source code of any webpage right within your browser, you can right click anywhere on a page and select the “Inspect” option. You access the “Developer Tools” mode, where you can see the source code as a tree. You can expand and collapse various nodes and find the source code for a specific portion of the page
As shown in the photo above, I’ve cursored over the “Today 22 Carart Gold Price Per Gram in Mumbai (INR)” to display how the how the entire content was presented. I found out that table was present inside the
<div>
tag. And i got the specific classclass = "gold_silver_table right-align-content"
for this tag.Since I’ve pulled a single page and return to a BeautifulSoup object, we can start to use some function from Beautiful Soup library to withdraw the piece of information we want.
City Name
Now we will use
BeautifulSoup
to extract theNames
,Gram
,Price
andChange in Price
of the Gold Price Table from the HTML Page
city_name = city_name.upper()
city_name
Gold Weight
table_div = doc.find("div",{"class": "gold_silver_table right-align-content"})
table = table_div.find("table").find_all("tr")
gram = table[1].find_all("td")[0].text
gram
Today’s Price
today_price = table[1].find_all("td")[1].text.replace(",", "")
today_price
Yesterday’s Price
yesterday_price = table[1].find_all("td")[2].text.replace(",", "")
yesterday_price
Daily Price Change
daily_price_change = table[1].find_all("td")[3].text.strip().replace(",", "")
daily_price_change
Creating a DataFrame using Pandas for Lists derived till now
What is Pandas?
Pandas is a software library written for the Python programming language for data manipulation and analysis. In particular, it offers data structures and operations for manipulating numerical tables and time series.
What is DataFrame?
A Pandas DataFrame is a 2 dimensional data structure, like a 2 dimensional array, or a table with rows and columns. DataFrame makes it easier for us to work with tablular data and analse it.
!pip install pandas --quiet --upgrade #Installing Pandas Library
import pandas as pd
Now, Will create a
Python Dictionary
with theCity Name
,Gold Weight
,Today's Price
,Yesterday's Price
andDaily Price Change
that we have extracted above.
gold_price_dict = {'City' : [city_name],
'Gram' : [gram],
'Today Price' : [today_price],
'Yesterday Price' : [yesterday_price],
'Daily Price Change' : [daily_price_change]}
gold_price_dict
gold_price_df = pd.DataFrame(gold_price_dict) #Here we convert the dictionary into a Pandas DataFrame
Now, Let us check the Dataframe that we have created which contains the
City Name
,Gram
,Today Price
,Yesterday Price
andDaily Price Change
gold_price_df
We can see the DataFrame consists of the
city_name
and other details Therefore, we can be sure that we have extracted the required data
We have finally created a DataFrame which contains the required information
Next Steps
Now, we will go into each individual City's
page and extract the required information
city_list = ["ahmedabad","bangalore","bhubaneswar","chandigarh","chennai",
"coimbatore","delhi","hyderabad","jaipur","kerala","kolkata",
"lucknow","madurai","mangalore","mumbai","mysore","nagpur",
"nashik","patna","pune","surat","vadodara","vijayawada","visakhapatnam"]
gold_price_list = []
for city_name in city_list:
url = "https://www.goodreturns.in/gold-rates/"+city_name+".html"
response = requests.get(url)
if response.status_code != 200:
raise Exception("Url",url,"not found")
content = response.text doc = BeautifulSoup(content,'html.parser') div_22car = doc.find("div",{"class": "gold_silver_table right-align-content"})
table = div_22car.find("table").find_all("tr")
for data in range(1,len(table)):
city = city_name.upper()
gram = table[data].find_all("td")[0].text
today_price = table[data].find_all("td")[1].text.replace(",", "")
yesterday_price = table[data].find_all("td")[2].text.replace(",", "")
daily_price_change = table[data].find_all("td")[3].text.strip().replace(",", "") gold_price_list.append({'City' : city,
'Gram' : gram,
"Today's Price" : today_price,
"Yesterday's Price" : yesterday_price,
"Daily Price Change" : daily_price_change })
Now we have all the required information for the gold price Let us see what all we have got
gold_price_list
Now, we will write different functions to combine the details for any Given Day
install_libraries()
will install all the required libraries
def install_libraries():
!pip install jovian --upgrade --quiet
import requests
!pip install Beautifulsoup4 --upgrade --quiet
from bs4 import BeautifulSoup
from datetime import date
import pandas as pd
print("Successfully installed all the required Libraries")
get_doc()
will take city_name as an input and it will porvide us with an html doc
def get_doc(city_name):
url = "https://www.goodreturns.in/gold-rates/"+city_name+".html"
response = requests.get(url)
if response.status_code != 200:
raise Exception("Url",url,"not found")
content = response.text
doc = BeautifulSoup(content,'html.parser')
return doc
append_carat_detail()
will add the details about the particularCarat
gold to the data
def append_carat_detail(city_name, data, table, gold_price_list):
#For 22_Carat Gold
carat = table[0].find_all("td",{"class": "heading"})[1].text[:8]
city = city_name.upper()
gram = table[data].find_all("td")[0].text
today_price = table[data].find_all("td")[1].text.replace(",", "")
yesterday_price = table[data].find_all("td")[2].text.replace(",", "")
daily_price_change = table[data].find_all("td")[3].text.strip().replace(",", "")
gold_price_list.append({'City' : city,
'Carat' : carat,
'Gram' : gram,
"Today's Price" : today_price,
"Yesterday's Price" : yesterday_price,
"Daily Price Change" : daily_price_change }) return gold_price_list
list_of_dict()
will return the data for all the cities
def list_of_dict(city_list):
gold_price_list = []
for city_name in city_list:
#FUNCITON CALL `get_doc()`
doc = get_doc(city_name)
div_car = doc.find_all("div",{"class": "gold_silver_table right-align-content"})
table_22 = div_car[0].find("table").find_all("tr")
table_24 = div_car[1].find("table").find_all("tr")
for data in range(1,len(table_22)):
gold_price_list = append_carat_detail(city_name, data, table_22, gold_price_list)
gold_price_list = append_carat_detail(city_name, data, table_24, gold_price_list)
return gold_price_list
list_gold_price_today()
will give us a list of dictionary which will contain all the required data
def list_gold_price_today():
city_list = ["ahmedabad","bangalore","bhubaneswar","chandigarh","chennai","coimbatore","delhi","hyderabad","jaipur","kerala","kolkata","lucknow","madurai","mangalore","mumbai","mysore","nagpur","nashik","patna","pune","surat","vadodara","vijayawada","visakhapatnam"]
gold_price_list = list_of_dict(city_list)
return gold_price_list
Let us test, if the function is working fine. To do this we will call our function list_gold_price_today
gold_price = list_gold_price_today() #To fetch the data for all the top cities in India
gold_price
Now that we have the information for all the cities, Let us convert this dictionary to a DataFrame just like we did previously to easily work with the tabular data using Pandas.
gold_price_today_df = pd.DataFrame(gold_price)
gold_price_today_df
Let us now save this DataFrame as a CSV file. And for that we a write a funciton
write_csv()
def write_csv(items,path):
with open(path,'w',encoding="utf-8") as f:
if len(items) == 0:
return
#write the headers
headers = list(items[0].keys())
f.write(','.join(headers) + '\n')
for item in items:
values = []
for header in headers:
values.append(str(item.get(header,"")))
f.write(','.join(values) + '\n')
Using
write_csv()
function we will write yourgold_price
into a.csv
file
print("Converting dictionary to a csv file")
write_csv(gold_price,"todayprice.csv")
print("Successfully created the csv file ")
print("Converting dictionary to a csv file")
write_csv(gold_price,"todayprice.csv")
print("Successfully created the csv file ")
Now you can successfully see the todayprice.csv
file into File-->Open
Let’s write a single funciton which will Parse
and also write back to a CSV
file
def get_today_price():
#FUNCTION CALL `install_libraries()`
install_libraries()
from datetime import date
path = None
if path == None:
path = date.today().strftime("%B-%d-%Y") + '-gold-price.csv'
print("Writting Today's Gold Price to the file",path)
#FUNCTION CALL `write_csv`
write_csv(list_gold_price_today(),path)
print("Today's Gold Price writtten to file",path)
#Printing the csv file
print("Printing your csv file",path)
return pd.read_csv(path)
Let us test, if the function is working fine.
To do this will call our function get_today_price()
df = get_today_price()
df
Summary
Finally, we have managed to parse
'Good Returns Website' to get our hands on very interesting and insightful data when it comes to the Finance/Money.
We have saved all the information we could extract from that website for our needs in a CSV
file using which we can further get answers to a lot of questions we may want to ask, e.g - Which City is having the highest Gold Price in India?, What was the price of gold in Mumbai on 10 July 2022?
Let us look at the steps that we took from start to finish :
(1)We downloaded the webpage using requests
(2) We parsed
the HTML source code using BeautifulSoup
library and extracted the desired information, i.e.
- The City Name
- Price and other details of the Gold in that particular city
(3) We created a DataFrame
using Pandas
for Python Lists
that we derived from the previous step
(4) We extracted detailed information of gold rates for top cities in India , such as :
- Today’s Price
- Yesterday’s Price
- Price based on Weights(gram)
- Daily Price Change
(5) We then created a Python Dictionary
to save all these details
(6) We converted the python dictionary into Pandas DataFrames
(7) With our DataFrame in hand, we then converted it into a single CSV
file, which was the goal of our project.
Get Complete Jupyter notebook link below
Connect with me
LinkedIn: http://linkedin.com/in/abhishek-mehta2k