Featured
- Get link
- Other Apps
Web Scraping with Python in Indonesian E-Commerce (Tokopedia)
In 2006, British mathematician and Tesco marketing mastermind Clive Humby shouted from the rooftops, "Data is the new oil". And today we can see data has become a powerful weapon that can influence the direction of the world. It can decide the next action that needs to be taken in a business, increase goods selling by providing the product related to the customer's taste, create good Artificial Intelligence to minimize human work, etc.
And in this article, we will study how to get data from an
existing website, this action is usually called web scraping. For this one,
we will use Tokopedia, one Indonesian E-Commerce, as a study case.
Getting the Data
The first step to web scraping is to decide what data we
want to get. In this case, I want to get shoe(sepatu) data and it will be
sorted by the review(ulasan).
Then let's see the website. The website is shaped by a
markup language named HTML. And we can get the data that we need by
searching the HTML of the page thoroughly.
First, let's open Tokopedia's page at
https://www.tokopedia.com.
Then let's search for shoes on the search bar. Shoes in Indonesia are called sepatu, so I will insert the word "sepatu" into the search bar.
But it's sorted by the most suitable one, so let's change to sort by the review by changing the dropdown "Urutkan" to the "Ulasan".
css-y5gcsw
. Then inside the card, we can see some information of
the product.
And I'm interested in the name, price, city, and image URL of the product so let's try to see the HTML element of these data.
css-1b6t4dn
class, the price with
css-1ksb19c
class, the city with css-1kdc32b
class, and the image with
css-1c345mg
class.
scraper.py
as a place for the Scraper
class
to reside.Scraper
that will have the responsibility to get the data from the website. In
this class we let's create a property named driver that will be filled
with Selenium Webdriver. Webdriver is a class that Selenium will use
to create a session with the browser and communicate with the browser.
So if webdriver commands the browser to open a certain page, then the
page will be opened in the browser. To create a Webdriver object that
is connected to the Firefox browser, we can call the static function
Firefox()
from the Webdriver class.from selenium import webdriver
class Scraper:
def __init__(self):
self.driver = webdriver.Firefox()
get_data()
to get the data from the website. For
this purpose, we need to have an URL from the website. If we see
the website again, we can see the URL is https://www.tokopedia.com/search?navsource=&ob=5&srp_component_id=02.01.00.00&srp_page_id=&srp_page_title=&st=product&q=sepatu
.
driver.get("URL")
function.def get_data(self):
self.driver.get('https://www.tokopedia.com/search?navsource=&ob=5&srp_component_id=02.01.00.00&srp_page_id=&srp_page_title=&st=product&q=sepatu')
Then let's create a counter for the page that shows the products and list to place all the data.
counter_page = 0
datas = []
We will get the data until page 10. For each page, we will make the driver command the browser to scroll until the end of the page because the page will not load the data if we did not scroll through it. When I checked the page, I found that the page has around 6500 pixels and we will scroll per 500 pixels. For each iteration, we will wait for 0.1 seconds so we did not put some load at the same time to the server.
while counter_page < 10:
for _ in range(0, 6500, 500):
time.sleep(0.1)
self.driver.execute_script("window.scrollBy(0,500)")
And after the iteration for scrolling, we will get the card's element, iterator over all the elements, get the name, price, city, and image data, and finally put the data to the
datas
variable.
elements = self.driver.find_elements(by=By.CLASS_NAME, value='css-y5gcsw')
for element in elements:
img = element.find_element(by=By.CLASS_NAME, value='css-1c345mg').get_attribute('src')
name = element.find_element(by=By.CLASS_NAME, value='css-1b6t4dn').text
price = element.find_element(by=By.CLASS_NAME, value='css-1ksb19c').text
city = element.find_element(by=By.CLASS_NAME, value='css-1kdc32b').text
datas.append({
'img': img,
'name': name,
'price': price,
'city': city
})
css-1ix4b60-unf-pagination-item
its
class. And we can specify which button we want to click
by using the counter variable.
counter_page += 1
next_page = self.driver.find_element(by=By.XPATH, value="//button[@class='css-1ix4b60-unf-pagination-item' and text()='" + str(counter_page + 1) + "']")
next_page.click()
return datas
For the overall codes, you can check this.
from selenium.webdriver.common.by import By
from selenium import webdriver
import time
class Scraper:
def __init__(self):
self.driver = webdriver.Firefox()
def get_data(self):
self.driver.get('https://www.tokopedia.com/search?navsource=&ob=5&srp_component_id=02.01.00.00&srp_page_id=&srp_page_title=&st=product&q=sepatu')
counter_page = 0
datas = []
while counter_page < 10:
for _ in range(0, 6500, 500):
time.sleep(0.1)
self.driver.execute_script("window.scrollBy(0,500)")
elements = self.driver.find_elements(by=By.CLASS_NAME, value='css-y5gcsw')
for element in elements:
img = element.find_element(by=By.CLASS_NAME, value='css-1c345mg').get_attribute('src')
name = element.find_element(by=By.CLASS_NAME, value='css-1b6t4dn').text
price = element.find_element(by=By.CLASS_NAME, value='css-1ksb19c').text
city = element.find_element(by=By.CLASS_NAME, value='css-1kdc32b').text
datas.append({
'img': img,
'name': name,
'price': price,
'city': city
})
counter_page += 1
next_page = self.driver.find_element(by=By.XPATH, value="//button[@class='css-1ix4b60-unf-pagination-item' and text()='" + str(counter_page + 1) + "']")
next_page.click()
return datas
from scraper import Scraper
if __name__ == "__main__":
scraper = Scraper()
datas = scraper.get_data()
index = 1
for data in datas:
print(
index,
data['name'],
data['img'],
data['price'],
data['city']
)
index += 1
Present the Data as a CSV File
csv_creator.py
as a file where the class that
generates CSV resides.CSVCreator
that has a responsibility to create a CSV
file from the data provided.import csv
class CSVCreator:
create_csv
. This method will receive
the data generated by the Scraper
object.
@staticmethod
def create_csv(datas):
Scraper
will have
dictionary form and take name
,
price
, city
, and img
as
their key. So let's define the key, it will be useful
later.
fields = ["name", "price", "city", "img"]
IOWrapper
. We can create that by using this
code.
with open('shoes.csv', 'w') as f:
csv
library to write our data into the
shoes.csv
file. First, we need to create an DictionaryWriter
object from the
csv
library. This action is needed
because the shape of each of our data is a Python 3's
dictionary. And we will input our
fields
variable to the
DictionaryWriter
object so they can
recognize the key of the data that need to be written
to the file. We can create the object with this
code.
writer = csv.DictWriter(f, fieldnames=fields)
writer.writeheader()
writer.writerows(datas)
And that's all for our
CSVCreator
class. For all the
codes you can check this.
import csv class CSVCreator: @staticmethod def create_csv(datas): fields = ["name", "price", "city", "img"] with open('shoes.csv', 'w') as f: writer = csv.DictWriter(f, fieldnames=fields) writer.writeheader() writer.writerows(datas)
main.py
file, we can delete
the part that prints all the data to the
terminal and change it to the
CSVCreator
and call the
create_csv
method. So the
code will be like this.
from scraper import Scraper
from csv_creator import CSVCreator
if __name__ == "__main__":
scraper = Scraper()
datas = scraper.get_data()
index = 1
CSVCreator.create_csv(datas)
main.py
file, we can
see that
shoes.csv
the file
was created and has data like
this.Present the Data as a Web Page
Flask
in this
section and we can install Flask
by
following
this link.datatables
it is because we can see the data
easily and there are many features that we
can get if we create a table using that
library. view.html
and fill with this code.
<!doctype html>
<html lang="en">
<head>
<meta charset="utf-8">
<meta name="viewport" content="width=device-width, initial-scale=1, shrink-to-fit=no">
<link rel="stylesheet" href="https://stackpath.bootstrapcdn.com/bootstrap/4.3.1/css/bootstrap.min.css" integrity="sha384-ggOyR0iXCbMQv3Xipma34MD+dH/1fQ784/j6cY/iJTQUOhcWr7x9JvoRxT2MZw1T" crossorigin="anonymous">
<link rel="stylesheet" href="https://cdn.datatables.net/1.12.1/css/jquery.dataTables.min.css">
<title>Product Data</title>
</head>
<body>
<nav class="navbar navbar-expand-md navbar-light bg-light">
<a class="navbar-brand" href="{{url_for('index')}}">Product Data</a>
</nav>
<div class="container">
<table id="datas" class="display table"style="width:100%">
<thead>
<tr>
<th>Name</th>
<th>Price</th>
<th>City</th>
<th>Image</th>
</tr>
</thead>
<tbody>
{% for item in datas %}
<tr>
<td>{{item["name"]}}</td>
<td>{{item["price"]}}</td>
<td>{{item["city"]}}</td>
<td>
<img src="{{item['img']}}" alt="" srcset="">
</td>
</tr>
{% endfor %}
</tbody>
</table>
</div>
<script src="https://code.jquery.com/jquery-3.3.1.slim.min.js" integrity="sha384-q8i/X+965DzO0rT7abK41JStQIAqVgRVzpbzo5smXKp4YfRvH+8abtTE1Pi6jizo" crossorigin="anonymous"></script>
<script src="https://stackpath.bootstrapcdn.com/bootstrap/4.3.1/js/bootstrap.min.js" integrity="sha384-JjSmVgyd0p3pXB1rRibZUAYoIIy6OrQ6VrjIEaFf/nJGzIxFDsf4x0xIM+B07jRM" crossorigin="anonymous"></script>
<script src="https://cdn.datatables.net/1.12.1/js/jquery.dataTables.min.js"></script>
<script>
$(document).ready(function () {
$('#datas').DataTable({
pagingType: 'full_numbers',
});
});
</script>
</body>
</html>
bootstrap
and
datatables
in these two
lines.
<link rel="stylesheet" href="https://stackpath.bootstrapcdn.com/bootstrap/4.3.1/css/bootstrap.min.css" integrity="sha384-ggOyR0iXCbMQv3Xipma34MD+dH/1fQ784/j6cY/iJTQUOhcWr7x9JvoRxT2MZw1T" crossorigin="anonymous">
<link rel="stylesheet" href="https://cdn.datatables.net/1.12.1/css/jquery.dataTables.min.css">
jQuery
, bootstrap
, and
datatables
in these lines.
<script src="https://code.jquery.com/jquery-3.3.1.slim.min.js" integrity="sha384-q8i/X+965DzO0rT7abK41JStQIAqVgRVzpbzo5smXKp4YfRvH+8abtTE1Pi6jizo" crossorigin="anonymous"></script>
<script src="https://stackpath.bootstrapcdn.com/bootstrap/4.3.1/js/bootstrap.min.js" integrity="sha384-JjSmVgyd0p3pXB1rRibZUAYoIIy6OrQ6VrjIEaFf/nJGzIxFDsf4x0xIM+B07jRM" crossorigin="anonymous"></script>
<script src="https://cdn.datatables.net/1.12.1/js/jquery.dataTables.min.js"></script>
{% %}
and
this {{ }}
.
<table id="datas" class="display table" style="width:100%">
<thead>
<tr>
<th>Name</th>
<th>Price</th>
<th>City</th>
<th>Image</th>
</tr>
</thead>
<tbody>
{% for item in datas %}
<tr>
<td>{{item["name"]}}</td>
<td>{{item["price"]}}</td>
<td>{{item["city"]}}</td>
<td>
<img src="{{item['img']}}" alt="" srcset="">
</td>
</tr>
{% endfor %}
</tbody>
</table>
datatable
's table with this
code.
<script>
$(document).ready(function () {
$('#datas').DataTable({
pagingType:'full_numbers',
});
});
</script>
main.py
file
again and modify it to be
like this.
from scraper import Scraper
from csv_creator import CSVCreator
from flask import Flask, render_template
scraper = Scraper()
datas = scraper.get_data()
app = Flask(__name__, template_folder=".")
@app.route('/')
def index():
return render_template('view.html', datas=datas)
if __name__ == "__main__":
CSVCreator.create_csv(datas)
app.run()
Flask
's
routing.Flask
app.
app = Flask(__name__, template_folder=".")
template_folder="."
parameters. It
must be done
because we did not
create any
templates
folder
which is the
default folder
that Flask
app
will search for
HTML files.
@app.route('/')
def index():
return render_template('view.html', datas=datas)
view.html
as
the HTML file
that will be
rendered to the browser and
passing the
data with
datas=datas
argument.Flask
app
with this
line.
app.run()
main.py
file and
the
scraping
process is
done, we
will get a
CSV file
and the
Flask app
can be
seen
running at
the
terminal.- Get link
- Other Apps
Popular Posts
Web Scarping with JavaScript in Steam Website
- Get link
- Other Apps
Comments
Post a Comment