Web Scraping with Python in Indonesian E-Commerce (Tokopedia)
In 2006, British mathematician and Tesco marketing mastermind Clive Humby shouted from the rooftops, “Data is the new oil”. And today we can see data has become a powerful weapon that can influence the direction of the world. It can decide the next action that needs to be taken in a business, increase goods selling by providing the product related to the customer’s taste, create good Artificial Intelligence to minimize human work, etc.
In this article, we will study how to get data from an existing website, this action is usually called web scraping. For this one, we will use Tokopedia, an Indonesian E-Commerce, as a study case.
Getting the Data
The first step to web scraping is to decide what data we want to get. In this case, I want to get shoe(sepatu) data and it will be sorted by the review(ulasan).
Then let’s see the website. The website is shaped by a markup language named HTML. And we can get the data that we need by searching the HTML of the page thoroughly.
First, let’s open Tokopedia’s page at https://www.tokopedia.com.
Then let’s search for shoes on the search bar. Shoes in Indonesia are called sepatu, so I will insert the word “sepatu” into the search bar.
But it’s sorted by the most suitable one, so let’s change to sort by the review by changing the dropdown “Urutkan” to the “Ulasan”.
css-y5gcsw
. Then inside the card, we can see some information of the product.And I’m interested in the name, price, city, and image URL of the product so let’s try to see the HTML element of these data.
scraper.py
as a place for the Scraper
class to reside.Scraper
that will have the responsibility to get the data from the website. In this class we let’s create a property named driver that will be filled with Selenium Webdriver. Webdriver is a class that Selenium will use to create a session with the browser and communicate with the browser.Firefox()
from the Webdriver class.from selenium import webdriver
class Scraper:
def __init__(self):
self.driver = webdriver.Firefox()
get_data()
to get the data from the website. For this purpose, we need to have a URL from the website. If we see the website again, we can see the URL is https://www.tokopedia.com/search?navsource=&ob=5&srp_component_id=02.01.00.00&srp_page_id=&srp_page_title=&st=product&q=sepatu
.driver.get("URL")
function.def get_data(self):
self.driver.get('https://www.tokopedia.com/search?navsource=&ob=5&srp_component_id=02.01.00.00&srp_page_id=&srp_page_title=&st=product&q=sepatu')
counter_page = 0
datas = []
When I checked the page, I found that the page has around 6500 pixels and we will scroll per 500 pixels. For each iteration, we will wait for 0.1 seconds so we do not put some load at the same time on the
server.
while counter_page < 10:
for _ in range(0, 6500, 500):
time.sleep(0.1)
self.driver.execute_script("window.scrollBy(0,500)")
datas
variable.elements = self.driver.find_elements(by=By.CLASS_NAME, value='css-y5gcsw')
for element in elements:
img = element.find_element(by=By.CLASS_NAME, value='css-1c345mg').get_attribute('src')
name = element.find_element(by=By.CLASS_NAME, value='css-1b6t4dn').text
price = element.find_element(by=By.CLASS_NAME, value='css-1ksb19c').text
city = element.find_element(by=By.CLASS_NAME, value='css-1kdc32b').text
datas.append({
'img': img,
'name': name,
'price': price,
'city': city
})
css-1ix4b60-unf-pagination-item
its class. And we can specify which button we want to click by using the counter variable.counter_page += 1
next_page = self.driver.find_element(by=By.XPATH, value="//button[@class='css-1ix4b60-unf-pagination-item' and text()='" + str(counter_page + 1) + "']")
next_page.click()
And finally, return the data as the function’s return value.
return datas
from selenium.webdriver.common.by import By
from selenium import webdriver
import time
class Scraper:
def __init__(self):
self.driver = webdriver.Firefox()
def get_data(self):
self.driver.get('https://www.tokopedia.com/search?navsource=&ob=5&srp_component_id=02.01.00.00&srp_page_id=&srp_page_title=&st=product&q=sepatu')
counter_page = 0
datas = []
while counter_page < 10:
for _ in range(0, 6500, 500):
time.sleep(0.1)
self.driver.execute_script("window.scrollBy(0,500)")
elements = self.driver.find_elements(by=By.CLASS_NAME, value='css-y5gcsw')
for element in elements:
img = element.find_element(by=By.CLASS_NAME, value='css-1c345mg').get_attribute('src')
name = element.find_element(by=By.CLASS_NAME, value='css-1b6t4dn').text
price = element.find_element(by=By.CLASS_NAME, value='css-1ksb19c').text
city = element.find_element(by=By.CLASS_NAME, value='css-1kdc32b').text
datas.append({
'img': img,
'name': name,
'price': price,
'city': city
})
counter_page += 1
next_page = self.driver.find_element(by=By.XPATH, value="//button[@class='css-1ix4b60-unf-pagination-item' and text()='" + str(counter_page + 1) + "']")
next_page.click()
return datas
from scraper import Scraper
if __name__ == "__main__":
scraper = Scraper()
datas = scraper.get_data()
index = 1
for data in datas:
print(
index,
data['name'],
data['img'],
data['price'],
data['city']
)
index += 1
Present the Data as a CSV File
csv_creator.py
as a file where the class that generates CSV resides.CSVCreator
that has a responsibility to create a CSV file from the data provided.import csv
class CSVCreator:
create_csv
. This method will receive the data generated by the Scraper
object.@staticmethod
def create_csv(datas):
Scraper
will havedictionary form and take
name
, price
, city
, and img
as their key. So let’s define the key, it will be useful later.
fields = ["name", "price", "city", "img"]
IOWrapper
. We can create that by using thiscode.
with open('shoes.csv', 'w') as f:
csv
library to write our data into the shoes.csv
file. First, we need to create an DictionaryWriter
object from the csv
library. This action is needed because the shape of each of our data is a Python 3’s dictionary. And we will input our fields
variable to the DictionaryWriter
object so they can recognize the key of the data that needs to be written to the file. We can create the object with this code.writer = csv.DictWriter(f, fieldnames=fields)
writer.writeheader()
writer.writerows(datas)
CSVCreator
class. For all the codes you can check this.import csv
class CSVCreator:
@staticmethod
def create_csv(datas):
fields = ["name", "price", "city", "img"]
with open('shoes.csv', 'w') as f:
writer = csv.DictWriter(f, fieldnames=fields)
writer.writeheader()
writer.writerows(datas)
main.py
file, we can delete the part that prints all the data to the terminal and change it to the CSVCreator
and call the create_csv
method. So the code will be like this.from scraper import Scraper
from csv_creator import CSVCreator
if __name__ == "__main__":
scraper = Scraper()
datas = scraper.get_data()
index = 1
CSVCreator.create_csv(datas)
main.py
file, we can see that shoes.csv
the file was created and has data like this.Present the Data as a Web Page
Flask
it in this section and we can install Flask
it by following this link.datatables
it because we can see the data easily and there are many features that we can get if we create a table using that library. It will load from our HTML page directly. So let’s create an HTML file named view.html
and fill it with this code.
<!doctype html>
<html lang="en">
<head>
<meta charset="utf-8">
<meta name="viewport" content="width=device-width, initial-scale=1, shrink-to-fit=no">
<link rel="stylesheet" href="https://stackpath.bootstrapcdn.com/bootstrap/4.3.1/css/bootstrap.min.css" integrity="sha384-ggOyR0iXCbMQv3Xipma34MD+dH/1fQ784/j6cY/iJTQUOhcWr7x9JvoRxT2MZw1T" crossorigin="anonymous">
<link rel="stylesheet" href="https://cdn.datatables.net/1.12.1/css/jquery.dataTables.min.css">
<title>Product Data</title>
</head>
<body>
<nav class="navbar navbar-expand-md navbar-light bg-light">
<a class="navbar-brand" href="{{url_for('index')}}">Product Data</a>
</nav>
<div class="container">
<table id="datas" class="display table"style="width:100%">
<thead>
<tr>
<th>Name</th>
<th>Price</th>
<th>City</th>
<th>Image</th>
</tr>
</thead>
<tbody>
{% for item in datas %}
<tr>
<td>{{item["name"]}}</td>
<td>{{item["price"]}}</td>
<td>{{item["city"]}}</td>
<td>
<img src="{{item['img']}}" alt="" srcset="">
</td>
</tr>
{% endfor %}
</tbody>
</table>
</div>
<script src="https://code.jquery.com/jquery-3.3.1.slim.min.js" integrity="sha384-q8i/X+965DzO0rT7abK41JStQIAqVgRVzpbzo5smXKp4YfRvH+8abtTE1Pi6jizo" crossorigin="anonymous"></script>
<script src="https://stackpath.bootstrapcdn.com/bootstrap/4.3.1/js/bootstrap.min.js" integrity="sha384-JjSmVgyd0p3pXB1rRibZUAYoIIy6OrQ6VrjIEaFf/nJGzIxFDsf4x0xIM+B07jRM" crossorigin="anonymous"></script>
<script src="https://cdn.datatables.net/1.12.1/js/jquery.dataTables.min.js"></script>
<script>
$(document).ready(function () {
$('#datas').DataTable({
pagingType: 'full_numbers',
});
});
</script>
</body>
</html>
bootstrap
and datatables
in these two lines.
<link rel="stylesheet" href="https://stackpath.bootstrapcdn.com/bootstrap/4.3.1/css/bootstrap.min.css" integrity="sha384-ggOyR0iXCbMQv3Xipma34MD+dH/1fQ784/j6cY/iJTQUOhcWr7x9JvoRxT2MZw1T" crossorigin="anonymous">
<link rel="stylesheet" href="https://cdn.datatables.net/1.12.1/css/jquery.dataTables.min.css">
jQuery
, bootstrap
, and datatables
in these lines.
<script src="https://code.jquery.com/jquery-3.3.1.slim.min.js" integrity="sha384-q8i/X+965DzO0rT7abK41JStQIAqVgRVzpbzo5smXKp4YfRvH+8abtTE1Pi6jizo" crossorigin="anonymous"></script>
<script src="https://stackpath.bootstrapcdn.com/bootstrap/4.3.1/js/bootstrap.min.js" integrity="sha384-JjSmVgyd0p3pXB1rRibZUAYoIIy6OrQ6VrjIEaFf/nJGzIxFDsf4x0xIM+B07jRM" crossorigin="anonymous"></script>
<script src="https://cdn.datatables.net/1.12.1/js/jquery.dataTables.min.js"></script>
{% %}
and this {{ }}
.
<table id="datas" class="display table" style="width:100%">
<thead>
<tr>
<th>Name</th>
<th>Price</th>
<th>City</th>
<th>Image</th>
</tr>
</thead>
<tbody>
{% for item in datas %}
<tr>
<td>{{item["name"]}}</td>
<td>{{item["price"]}}</td>
<td>{{item["city"]}}</td>
<td>
<img src="{{item['img']}}" alt="" srcset="">
</td>
</tr>
{% endfor %}
</tbody>
</table>
datatable
‘s table with this code.
<script>
$(document).ready(function () {
$('#datas').DataTable({
pagingType:'full_numbers',
});
});
</script>
main.py
file again and modify it to be like this.
from scraper import Scraper
from csv_creator import CSVCreator
from flask import Flask, render_template
scraper = Scraper()
datas = scraper.get_data()
app = Flask(__name__, template_folder=".")
@app.route('/')
def index():
return render_template('view.html', datas=datas)
if __name__ == "__main__":
CSVCreator.create_csv(datas)
app.run()
Flask
‘s routing.Flask
app.
app = Flask(__name__, template_folder=".")
template_folder="."
parameters. It must be done because we did not create any templates
folder which is the default folder that Flask
app will search for HTML files.
@app.route('/')
def index():
return render_template('view.html', datas=datas)
view.html
as the HTML file that will be rendered to the browser and passing the data with datas=datas
argument.Flask
app with this line.
app.run()
main.py
file and the scraping process is done, we will get a CSV file and the Flask app can be seen running at the terminal.And that’s all from me. Of course, we can improve the app more by adding some features, grabbing more useful data, presenting in a better format, etc. This is the link to the repository if you want to see the code directly.
Leave a Reply