Scraping Gold Price using Python

Web Scraping Project End to End

Abhishek Mehta
12 min readDec 12, 2022
Web Scraping Project End to End

What is Web Scraping?

Web Scraping is the process through which we extract data from a website, and save it in a form which is easy to read, to understand and to work on.

When we say ‘Easy to work on’, we mean to say that the data thus extracted can be used to get a lot of useful insights and answer a lot of questions, finding answers to which would not be such an easy task, if we did not have that data stored with us in a simple and sorted manner, i.e. generally in an Excel File or a CSV file.

How does Web Scraping Works?

In order to understand web scraping, it’s important to first understand that web pages which are built using text-based mark-up languages and the most common being used isHTML.

The HyperText Markup Language or HTML is the standard markup language for documents designed to be displayed in a web browser. It can be assisted by technologies such as Cascading Style Sheets and scripting languages such as JavaScript.

Note : Not all websites allow Web Scraping, especially when it comes to user’s personal data, so we always have to make sure that we are not publishing any personal data of user online. Websites usually have protection and if they see that we are downloading a large amount of data from their website, they will block us from accessing the website.

Project Idea

In this project, we will parse through the Gold Returns’s website to get details about the Gold Rates in the top cities of INDIA.

We will retrieve information from the page ’Gold Rate in Top Cities of India’ using web scraping: the process of programmatically retrieving information from the web. Web scraping is not magic, and yet some readers can get information every day. For example, a recent graduate can copy and paste information about the companies they are applying to into a job application management spreadsheet.

Project Goal

The goal of the project is to create a web scraper that takes all the required information and compiles it into a CSV. The output CSV file format is given below:

Project steps

Here is an outline of the steps we’ll follow :

  1. Download the webpage using requests
  2. Parse the HTML source code using BeautifulSoup library and extract the desired information
  3. Building the scraper components
  4. Compile the extracted information into Python list and dictionaries
  5. Converting the python dictionaries into Pandas DataFrames
  6. Write information to the final CSV file
  7. Future work and references

Resources

Lets begin :

What is requests ?

Requests is a Python HTTP library that allows us to send HTTP requests to servers of websites, instead of using browsers to communicate the web.

We use pip, a package-management system, to install and manage softwares. Since the platform we selected is Binder, we would have to type a line of code !pip install to install requests. You will see lots codes of !pip when installing other packages.

When we attempt to use some prewritten functions from a certain library, we would use the import statement. e.g. When we would have to type import requests after installation, we are able to use any function from requests library.

!pip install requests --quiet --upgrade
import requests

requests.get()

In order to download a web page, we use requests.get() to send the HTTP request to the Good Return's server and what the function returns is a response object, which is the HTTP response.

city_name = 'mumbai'
#The URL Address of the webpage we will scrape, i.e. Gold Rate for Mumbai City
gold_rate_url = 'https://www.goodreturns.in/gold-rates/'+city_name+'.html'
response = requests.get(gold_rate_url) #requests.get()

Status code

Now, we have to check if we successfully send the HTTP request and get a HTTP response back on purpose. This is because we're NOT using browsers, because of which we can't get the feedback directly if we didn't send HTTP requests successfully.

In general, the method to check out if the server sended a HTTP response back is the status code. In requests library, requests.get returns a response object, which containing the page contents and the information about status code indicating if the HTTP request was successful. Learn more about HTTP status codes here: https://developer.mozilla.org/en-US/docs/Web/HTTP/Status.

If the request was successful, response.status_code is set to a value between 200 and 299.

response.status_code    #Here we are checking the Status code, -> 200-299 will mean that the request was successful

The HTTP response contains HTML that is ready to be displayed in browser. Here we can use response.text to retrieve the HTML document.

page_contents = response.text
len(page_contents) #The `len` fucnction tells us the length of the response object

HOORAY ! We have ~4.17 Lac characters within the HTML that we have downloaded just in a second

page_contents[:500]   #This displays the first 500 characters of `page_contents`

- What we see above is the source code of the web page. It is written in a language called HTML.

- It defines and display the content and structure of the web page by the help of the browsers like Chrome

with open("gold_price_mumbai.html", 'w') as f:  #Writing the html page to a file locally, i.e. a replica of real html page
f.write(page_contents)

Here, we save the text that we have got into a HTML file with open statement.

Now, a HTML File is created by the name gold_price_mumbai.html

Parse the HTML source code using Beautiful Soup library

What is Beautiful Soup?

Beautiful Soup is a Python package for parsing HTML and XML documents. Beautiful Soup enables us to get data out of sequences of characters. It creates a parse tree for parsed pages that can be used to extract data from HTML. It’s a handy tool when it comes to web scraping. You can read more on their documentation site. https://www.crummy.com/software/BeautifulSoup/bs4/doc/#getting-help

To extract information from the HTML source code of a webpage programmatically, we can use the Beautiful Soup library. Let’s install the library and import the BeautifulSoup class from the bs4 module.

!pip install beautifulsoup4 --quiet --upgrade
from bs4 import BeautifulSoup
doc = BeautifulSoup(page_contents, 'html.parser')
#Now 'doc' contains entire html in parsed format
type(doc)

Inspecting the HTML source code of a web page

In Beautiful Soup library, we can specify html.parser to ask Python to read components of the page, instead of reading it as a long string.

What is HTML?

Before we dive into how to inspect HTML, we should know the basic knowledge about HTML.

The HyperText Markup Language, or HTML is the standard markup language for documents designed to be displayed in a web browser. It can be assisted by technologies such as Cascading Style Sheets and scripting languages such as JavaScript.

An HTML tag comprises of three parts:

  1. Name: (html, head, body, div, etc.) Indicates what the tag represents and how a browser should interpret the information inside it.
  2. Attributes: (href, target, class, id, etc.) Properties of tag used by the browser to customize how a tag is displayed and decide what happens on user interactions.
  3. Children: A tag can contain some text or other tags or both between the opening and closing segments, e.g., <div>Some content</div>.

Common tags and attributes

1. Tags in HTML

There are around 100 types of HTML tags but on a day to day basis, around 15 to 20 of them are the most common use, such as <div> tag, <p> tag, <section> tag, <img> tag, <a> tags.

Of many tags, I wanted to highlight <a> tag, which can contain attributes such as href (hyperlink reference), because <a> tag allows users to click and they would be directed to another site. That's why the name of <a> tag is anchor.

2. Attributes

Each tag supports several attributes. Following are some common attributes used to modify the behaviour of tags

  • id
  • style
  • class
  • href (used with <a>)
  • src (used with <img>)

What we can do with a BeautifulSoup object is to get a specifc types of a tag in HTML by calling the name of a tag, as shown in code cell below.`

Here, we use the find() function of BeautifulSoup to find the first <title> tag in the HTML document and display its content

title = doc.find('title')
title

Inspecting HTML in the Browser

To view the source code of any webpage right within your browser, you can right click anywhere on a page and select the “Inspect” option. You access the “Developer Tools” mode, where you can see the source code as a tree. You can expand and collapse various nodes and find the source code for a specific portion of the page

As shown in the photo above, I’ve cursored over the “Today 22 Carart Gold Price Per Gram in Mumbai (INR)” to display how the how the entire content was presented. I found out that table was present inside the <div> tag. And i got the specific class class = "gold_silver_table right-align-content"for this tag.

Since I’ve pulled a single page and return to a BeautifulSoup object, we can start to use some function from Beautiful Soup library to withdraw the piece of information we want.

City Name

Now we will use BeautifulSoup to extract the Names , Gram,Price and Change in Price of the Gold Price Table from the HTML Page

city_name = city_name.upper()
city_name

Gold Weight

table_div = doc.find("div",{"class": "gold_silver_table right-align-content"})
table = table_div.find("table").find_all("tr")
gram = table[1].find_all("td")[0].text
gram

Today’s Price

today_price = table[1].find_all("td")[1].text.replace(",", "")
today_price

Yesterday’s Price

yesterday_price = table[1].find_all("td")[2].text.replace(",", "")
yesterday_price

Daily Price Change

daily_price_change = table[1].find_all("td")[3].text.strip().replace(",", "")      
daily_price_change

Creating a DataFrame using Pandas for Lists derived till now

What is Pandas?

Pandas is a software library written for the Python programming language for data manipulation and analysis. In particular, it offers data structures and operations for manipulating numerical tables and time series.

What is DataFrame?

A Pandas DataFrame is a 2 dimensional data structure, like a 2 dimensional array, or a table with rows and columns. DataFrame makes it easier for us to work with tablular data and analse it.

!pip install pandas --quiet --upgrade   #Installing Pandas Library
import pandas as pd

Now, Will create a Python Dictionary with the City Name, Gold Weight, Today's Price, Yesterday's Price and Daily Price Change that we have extracted above.

gold_price_dict = {'City' : [city_name],
'Gram' : [gram],
'Today Price' : [today_price],
'Yesterday Price' : [yesterday_price],
'Daily Price Change' : [daily_price_change]}
gold_price_dict
gold_price_df = pd.DataFrame(gold_price_dict)  #Here we convert the dictionary into a Pandas DataFrame

Now, Let us check the Dataframe that we have created which contains the City Name,Gram,Today Price,Yesterday Price and Daily Price Change

gold_price_df

We can see the DataFrame consists of the city_name and other details Therefore, we can be sure that we have extracted the required data

We have finally created a DataFrame which contains the required information

Next Steps

Now, we will go into each individual City's page and extract the required information

city_list = ["ahmedabad","bangalore","bhubaneswar","chandigarh","chennai",
"coimbatore","delhi","hyderabad","jaipur","kerala","kolkata",
"lucknow","madurai","mangalore","mumbai","mysore","nagpur",
"nashik","patna","pune","surat","vadodara","vijayawada","visakhapatnam"]
gold_price_list = []
for city_name in city_list:
url = "https://www.goodreturns.in/gold-rates/"+city_name+".html"
response = requests.get(url)
if response.status_code != 200:
raise Exception("Url",url,"not found")
    content = response.text    doc = BeautifulSoup(content,'html.parser')    div_22car = doc.find("div",{"class": "gold_silver_table right-align-content"})
table = div_22car.find("table").find_all("tr")
for data in range(1,len(table)):
city = city_name.upper()
gram = table[data].find_all("td")[0].text
today_price = table[data].find_all("td")[1].text.replace(",", "")
yesterday_price = table[data].find_all("td")[2].text.replace(",", "")
daily_price_change = table[data].find_all("td")[3].text.strip().replace(",", "")
gold_price_list.append({'City' : city,
'Gram' : gram,
"Today's Price" : today_price,
"Yesterday's Price" : yesterday_price,
"Daily Price Change" : daily_price_change })

Now we have all the required information for the gold price Let us see what all we have got

gold_price_list

Now, we will write different functions to combine the details for any Given Day

install_libraries() will install all the required libraries

def install_libraries():

!pip install jovian --upgrade --quiet
import requests
!pip install Beautifulsoup4 --upgrade --quiet
from bs4 import BeautifulSoup
from datetime import date
import pandas as pd
print("Successfully installed all the required Libraries")

get_doc() will take city_name as an input and it will porvide us with an html doc

def get_doc(city_name):
url = "https://www.goodreturns.in/gold-rates/"+city_name+".html"
response = requests.get(url)
if response.status_code != 200:
raise Exception("Url",url,"not found")
    content = response.text
doc = BeautifulSoup(content,'html.parser')

return doc

append_carat_detail() will add the details about the particular Carat gold to the data

def append_carat_detail(city_name, data, table, gold_price_list):
#For 22_Carat Gold
carat = table[0].find_all("td",{"class": "heading"})[1].text[:8]
city = city_name.upper()
gram = table[data].find_all("td")[0].text
today_price = table[data].find_all("td")[1].text.replace(",", "")
yesterday_price = table[data].find_all("td")[2].text.replace(",", "")
daily_price_change = table[data].find_all("td")[3].text.strip().replace(",", "")
    gold_price_list.append({'City' : city,
'Carat' : carat,
'Gram' : gram,
"Today's Price" : today_price,
"Yesterday's Price" : yesterday_price,
"Daily Price Change" : daily_price_change })
return gold_price_list

list_of_dict() will return the data for all the cities

def list_of_dict(city_list):
gold_price_list = []
for city_name in city_list:
#FUNCITON CALL `get_doc()`
doc = get_doc(city_name)

div_car = doc.find_all("div",{"class": "gold_silver_table right-align-content"})

table_22 = div_car[0].find("table").find_all("tr")
table_24 = div_car[1].find("table").find_all("tr")

for data in range(1,len(table_22)):
gold_price_list = append_carat_detail(city_name, data, table_22, gold_price_list)
gold_price_list = append_carat_detail(city_name, data, table_24, gold_price_list)
return gold_price_list

list_gold_price_today() will give us a list of dictionary which will contain all the required data

def list_gold_price_today():
city_list = ["ahmedabad","bangalore","bhubaneswar","chandigarh","chennai","coimbatore","delhi","hyderabad","jaipur","kerala","kolkata","lucknow","madurai","mangalore","mumbai","mysore","nagpur","nashik","patna","pune","surat","vadodara","vijayawada","visakhapatnam"]
gold_price_list = list_of_dict(city_list)

return gold_price_list

Let us test, if the function is working fine. To do this we will call our function list_gold_price_today

gold_price = list_gold_price_today()  #To fetch the data for all the top cities in India
gold_price

Now that we have the information for all the cities, Let us convert this dictionary to a DataFrame just like we did previously to easily work with the tabular data using Pandas.

gold_price_today_df = pd.DataFrame(gold_price)
gold_price_today_df

Let us now save this DataFrame as a CSV file. And for that we a write a funciton write_csv()

def write_csv(items,path):
with open(path,'w',encoding="utf-8") as f:
if len(items) == 0:
return

#write the headers
headers = list(items[0].keys())
f.write(','.join(headers) + '\n')

for item in items:
values = []
for header in headers:
values.append(str(item.get(header,"")))
f.write(','.join(values) + '\n')

Using write_csv() function we will write your gold_price into a .csv file

print("Converting dictionary to a csv file")
write_csv(gold_price,"todayprice.csv")
print("Successfully created the csv file ")
print("Converting dictionary to a csv file")
write_csv(gold_price,"todayprice.csv")
print("Successfully created the csv file ")

Now you can successfully see the todayprice.csv file into File-->Open

Let’s write a single funciton which will Parse and also write back to a CSV file

def get_today_price():
#FUNCTION CALL `install_libraries()`
install_libraries()
from datetime import date

path = None
if path == None:
path = date.today().strftime("%B-%d-%Y") + '-gold-price.csv'
print("Writting Today's Gold Price to the file",path)

#FUNCTION CALL `write_csv`
write_csv(list_gold_price_today(),path)

print("Today's Gold Price writtten to file",path)

#Printing the csv file
print("Printing your csv file",path)
return pd.read_csv(path)

Let us test, if the function is working fine.

To do this will call our function get_today_price()

df = get_today_price()
df

Summary

Finally, we have managed to parse 'Good Returns Website' to get our hands on very interesting and insightful data when it comes to the Finance/Money.
We have saved all the information we could extract from that website for our needs in a CSV file using which we can further get answers to a lot of questions we may want to ask, e.g - Which City is having the highest Gold Price in India?, What was the price of gold in Mumbai on 10 July 2022?

Let us look at the steps that we took from start to finish :

(1)We downloaded the webpage using requests

(2) We parsed the HTML source code using BeautifulSoup library and extracted the desired information, i.e.

  • The City Name
  • Price and other details of the Gold in that particular city

(3) We created a DataFrame using Pandas for Python Lists that we derived from the previous step

(4) We extracted detailed information of gold rates for top cities in India , such as :

  • Today’s Price
  • Yesterday’s Price
  • Price based on Weights(gram)
  • Daily Price Change

(5) We then created a Python Dictionary to save all these details

(6) We converted the python dictionary into Pandas DataFrames

(7) With our DataFrame in hand, we then converted it into a single CSV file, which was the goal of our project.

--

--

Abhishek Mehta
Abhishek Mehta

Studied Data Science with focus on Building Machine Learning Models. Worked as Business Analyst, I enjoy Machine Learning & to Procure solutions for businesses.

No responses yet