Web Scraping with Python in Indonesian E-Commerce (Tokopedia)

In 2006, British mathematician and Tesco marketing mastermind Clive Humby shouted from the rooftops, “Data is the new oil”. And today we can see data has become a powerful weapon that can influence the direction of the world. It can decide the next action that needs to be taken in a business, increase goods selling by providing the product related to the customer’s taste, create good Artificial Intelligence to minimize human work, etc.

In this article, we will study how to get data from an existing website, this action is usually called web scraping. For this one, we will use Tokopedia, an Indonesian E-Commerce, as a study case.

Getting the Data

The first step to web scraping is to decide what data we want to get. In this case, I want to get shoe(sepatu) data and it will be sorted by the review(ulasan).

Then let’s see the website. The website is shaped by a markup language named HTML. And we can get the data that we need by searching the HTML of the page thoroughly.

First, let’s open Tokopedia’s page at https://www.tokopedia.com.

Then let’s search for shoes on the search bar. Shoes in Indonesia are called sepatu, so I will insert the word “sepatu” into the search bar.

But it’s sorted by the most suitable one, so let’s change to sort by the review by changing the dropdown “Urutkan” to the “Ulasan”.

    Then let’s see the HTML by using inspect elements and pointing to the product’s card.
    We can see that the card has a class named css-y5gcsw. Then inside the card, we can see some information of the product.

And I’m interested in the name, price, city, and image URL of the product so let’s try to see the HTML element of these data.

    And we can see we can get the name with css-1b6t4dn class, the price with css-1ksb19c class, the city with css-1kdc32b class, and the image with css-1c345mg class.
    After recognizing the HTML of the page, then let’s create a script to get the data from the page.
    Because Tokopedia is using JavaScript Framework to build its website, we will be using a browser automation library named Selenium. We can get the data from the HTML with this library. But of course, you need to install the library first and we need a browser too. You can follow the installation of Selenium at this link and don’t forget to use Python’s virtual environment for this project. Then for the browser itself, we will be using Firefox for the automation process.
    Then, let’s create a file named scraper.py as a place for the Scraper class to reside.
    Let’s create a class named Scraper  that will have the responsibility to get the data from the website. In this class we let’s create a property named driver that will be filled with Selenium Webdriver. Webdriver is a class that Selenium will use to create a session with the browser and communicate with the browser.
   So if webdriver commands the browser to open a certain page, then the page will be opened in the browser. To create a Webdriver object that is connected to the Firefox browser, we can call the static function Firefox() from the Webdriver class.
from selenium import webdriver

class Scraper:

	def __init__(self):
		self.driver = webdriver.Firefox()  
    Then let’s create a function named get_data() to get the data from the website. For this purpose, we need to have a URL from the website. If we see the website again, we can see the URL is https://www.tokopedia.com/search?navsource=&ob=5&srp_component_id=02.01.00.00&srp_page_id=&srp_page_title=&st=product&q=sepatu.
    Let’s make the driver command the browser to get the URL by calling driver.get("URL") function.
def get_data(self):
	self.driver.get('https://www.tokopedia.com/search?navsource=&ob=5&srp_component_id=02.01.00.00&srp_page_id=&srp_page_title=&st=product&q=sepatu')
Then let’s create a counter for the page that shows the products and list to place all the data.
counter_page = 0
datas = []
We will get the data until page 10. For each page, we will make the driver command the browser to scroll until the end of the page because the page will not load the data if we do not scroll through it.
When I checked the page, I found that the page has around 6500 pixels and we will scroll per 500 pixels. For each iteration, we will wait for 0.1 seconds so we do not put some load at the same time on the
server.
while counter_page < 10:
	for _ in range(0, 6500, 500):
		time.sleep(0.1)
		self.driver.execute_script("window.scrollBy(0,500)")
After the iteration for scrolling, we will get the card’s element, iterator over all the elements, get the name, price, city, and image data, and finally put the data into the datas variable.
elements = self.driver.find_elements(by=By.CLASS_NAME, value='css-y5gcsw')
for element in elements:
	img = element.find_element(by=By.CLASS_NAME, value='css-1c345mg').get_attribute('src')
    name = element.find_element(by=By.CLASS_NAME, value='css-1b6t4dn').text
    price = element.find_element(by=By.CLASS_NAME, value='css-1ksb19c').text
    city = element.find_element(by=By.CLASS_NAME, value='css-1kdc32b').text
    
    datas.append({
    	'img': img,
		'name': name,
		'price': price,
		'city': city
	})
    After we get all the data, we can go to the next page by making the driver click to the next page. If we check the HTML of the page, we can find that the page button has css-1ix4b60-unf-pagination-item its class. And we can specify which button we want to click by using the counter variable.
counter_page += 1
next_page = self.driver.find_element(by=By.XPATH, value="//button[@class='css-1ix4b60-unf-pagination-item' and text()='" + str(counter_page + 1) + "']")
next_page.click()

And finally, return the data as the function’s return value.

return datas
For the overall codes, you can check this.

from selenium.webdriver.common.by import By
from selenium import webdriver
import time

class Scraper:
  def __init__(self):
    self.driver = webdriver.Firefox()

  def get_data(self):
    self.driver.get('https://www.tokopedia.com/search?navsource=&ob=5&srp_component_id=02.01.00.00&srp_page_id=&srp_page_title=&st=product&q=sepatu')
    
    counter_page = 0
    datas = []

    while counter_page < 10:
      for _ in range(0, 6500, 500):
        time.sleep(0.1)
        self.driver.execute_script("window.scrollBy(0,500)")

      elements = self.driver.find_elements(by=By.CLASS_NAME, value='css-y5gcsw')
      for element in elements:
        img = element.find_element(by=By.CLASS_NAME, value='css-1c345mg').get_attribute('src')
        name = element.find_element(by=By.CLASS_NAME, value='css-1b6t4dn').text
        price = element.find_element(by=By.CLASS_NAME, value='css-1ksb19c').text
        city = element.find_element(by=By.CLASS_NAME, value='css-1kdc32b').text

        datas.append({
          'img': img,
          'name': name,
          'price': price,
          'city': city
        })

      counter_page += 1
      next_page = self.driver.find_element(by=By.XPATH, value="//button[@class='css-1ix4b60-unf-pagination-item' and text()='" + str(counter_page + 1) + "']")
      next_page.click()
    
    return datas
 
And then let’s create a file named “main.py” to check the class functionality. Fill the file with this code.

from scraper import Scraper

if __name__ == "__main__":
  scraper = Scraper()

  datas = scraper.get_data()
  index = 1

  for data in datas:
    print(
      index,
      data['name'],
      data['img'],
      data['price'],
      data['city']
    )

    index += 1
    If we run the file, we will open the Firefox browser and the browser will automatically navigate as the driver instructed in our code. Then we can see the result from the terminal.

We can see that we get 700 product data from the shoe searching page.

    Next step, we will try to present the data in another format other than printing to the terminal directly.

Present the Data as a CSV File

   Comma-separated values usually called CSV is a file format that uses comma (,) to separate values. So it’s quite different with Excel files that separate the value with cell if we open it with Excel file openers like Microsoft Excel or Google Sheets.
   CSV file format is widely used in the tech industry nowadays. I like it too because it is quite simple to differentiate the column (Except your data value has a comma too) and you can open it the same way as in ordinary text file format like txt.
    So now we will represent the data that we already get before to CSV file format.
    Let’s create a file named csv_creator.py as a file where the class that generates CSV resides.
    Then let’s create a class named CSVCreator that has a responsibility to create a CSV file from the data provided.
import csv

class CSVCreator:
    After that, we will create a static method in that class so we don’t need to create objects from the class every time we want to generate the CSV file. And let’s call the method create_csv. This method will receive the data generated by the Scraper object.
@staticmethod
def create_csv(datas):
 
    As we declared before, the data generated by the Scraper will have
dictionary form and take name, price, city, and img as their key. So let’s define the key, it will be useful later.

fields = ["name", "price", "city", "img"]
    To write the data into a file on the disk in Python 3, of course, we need to create an IOWrapper. We can create that by using this
code.

with open('shoes.csv', 'w') as f:
    Then, let’s use the csv library to write our data into the shoes.csv file. First, we need to create an DictionaryWriter object from the csv library. This action is needed because the shape of each of our data is a Python 3’s dictionary. And we will input our fields variable to the DictionaryWriter object so they can recognize the key of the data that needs to be written to the file. We can create the object with this code.
writer = csv.DictWriter(f, fieldnames=fields)
    The CSV file is usually shaped in two-part, the header and the value. This is how we can recognize what kind of value is in each column. In this code, we will write the header and all the data to the file too.
writer.writeheader()
writer.writerows(datas)
And that’s all for our CSVCreator class. For all the codes you can check this.
import csv

class CSVCreator:

  @staticmethod
  def create_csv(datas):
    fields = ["name", "price", "city", "img"]

    with open('shoes.csv', 'w') as f:
      writer = csv.DictWriter(f, fieldnames=fields)

      writer.writeheader()
      writer.writerows(datas)
    In the main.py file, we can delete the part that prints all the data to the terminal and change it to the CSVCreator and call the create_csv method. So the code will be like this.
from scraper import Scraper
from csv_creator import CSVCreator

if __name__ == "__main__":
  scraper = Scraper()

  datas = scraper.get_data()
  index = 1

  CSVCreator.create_csv(datas)
    And if we run the main.py file, we can see that shoes.csv the file was created and has data like this.

Present the Data as a Web Page

   Well, we can see the text data like name, city, and price directly. But we can’t see the data directly from the CSV, right? And there’s an easy way to see all the data clearly, that is by creating a web page. Of course in this article, we will keep the page quite simple.
   So in this section, we will try to create a web page to see the data that we already scraped from Tokopedia. We will use Flask it in this section and we can install Flask it by following this link.
    Then let’s create the web page. For the page itself, we will use datatables it because we can see the data easily and there are many features that we can get if we create a table using that library.  It will load from our HTML page directly. So let’s create an HTML file named view.html and fill it with this code.

<!doctype html>
<html lang="en">
	<head>
		<meta charset="utf-8">
		<meta name="viewport" content="width=device-width, initial-scale=1, shrink-to-fit=no">

		<link rel="stylesheet" href="https://stackpath.bootstrapcdn.com/bootstrap/4.3.1/css/bootstrap.min.css" integrity="sha384-ggOyR0iXCbMQv3Xipma34MD+dH/1fQ784/j6cY/iJTQUOhcWr7x9JvoRxT2MZw1T" crossorigin="anonymous">
    	<link rel="stylesheet" href="https://cdn.datatables.net/1.12.1/css/jquery.dataTables.min.css">
    
    	<title>Product Data</title>
    </head>
    <body>
		<nav class="navbar navbar-expand-md navbar-light bg-light">
        	<a class="navbar-brand" href="{{url_for('index')}}">Product Data</a>
        </nav>
        <div class="container">
			<table id="datas" class="display table"style="width:100%">
				<thead>
					<tr>
						<th>Name</th>
						<th>Price</th>
						<th>City</th>
						<th>Image</th>
					</tr>
				</thead>
				<tbody>
					{% for item in datas %}
						<tr>
							<td>{{item["name"]}}</td>
							<td>{{item["price"]}}</td>
							<td>{{item["city"]}}</td>
							<td>
								<img src="{{item['img']}}" alt="" srcset="">
							</td>
						</tr>
					{% endfor %}
				</tbody>
			</table>
		</div>

		<script src="https://code.jquery.com/jquery-3.3.1.slim.min.js" integrity="sha384-q8i/X+965DzO0rT7abK41JStQIAqVgRVzpbzo5smXKp4YfRvH+8abtTE1Pi6jizo" crossorigin="anonymous"></script>
		<script src="https://stackpath.bootstrapcdn.com/bootstrap/4.3.1/js/bootstrap.min.js" integrity="sha384-JjSmVgyd0p3pXB1rRibZUAYoIIy6OrQ6VrjIEaFf/nJGzIxFDsf4x0xIM+B07jRM" crossorigin="anonymous"></script>
		<script src="https://cdn.datatables.net/1.12.1/js/jquery.dataTables.min.js"></script>
		<script>
			$(document).ready(function () {
				$('#datas').DataTable({
					pagingType: 'full_numbers',
				});
			});
		</script>
	</body>
</html>
    Well, there are so many things that happened here.
    We import important CSS libraries like bootstrap and datatables in these two lines.

<link rel="stylesheet" href="https://stackpath.bootstrapcdn.com/bootstrap/4.3.1/css/bootstrap.min.css" integrity="sha384-ggOyR0iXCbMQv3Xipma34MD+dH/1fQ784/j6cY/iJTQUOhcWr7x9JvoRxT2MZw1T" crossorigin="anonymous">
<link rel="stylesheet" href="https://cdn.datatables.net/1.12.1/css/jquery.dataTables.min.css">
    And we import important JavaScript libraries like jQuery, bootstrap, and datatables in these lines.

<script src="https://code.jquery.com/jquery-3.3.1.slim.min.js" integrity="sha384-q8i/X+965DzO0rT7abK41JStQIAqVgRVzpbzo5smXKp4YfRvH+8abtTE1Pi6jizo" crossorigin="anonymous"></script>
<script src="https://stackpath.bootstrapcdn.com/bootstrap/4.3.1/js/bootstrap.min.js" integrity="sha384-JjSmVgyd0p3pXB1rRibZUAYoIIy6OrQ6VrjIEaFf/nJGzIxFDsf4x0xIM+B07jRM" crossorigin="anonymous"></script>
<script src="https://cdn.datatables.net/1.12.1/js/jquery.dataTables.min.js"></script>
    Then we create a table tag in the body in these lines. Aside from creating a table, we will draw the row of the table using templating syntax from Flask that uses brackets like this {% %} and this {{ }}.

<table id="datas" class="display table" style="width:100%">
	<thead>
		<tr>
			<th>Name</th>
			<th>Price</th>
			<th>City</th>
			<th>Image</th>
			</tr>
		</thead>
		<tbody>
			{% for item in datas %}
				<tr>
					<td>{{item["name"]}}</td>
					<td>{{item["price"]}}</td>
					<td>{{item["city"]}}</td>
					<td>
						<img src="{{item['img']}}" alt="" srcset="">
					</td>
				</tr>
			{% endfor %}
		</tbody>
	</table>
    Then finally we will make the table into a datatable‘s table with this code.

<script>
$(document).ready(function () {
	$('#datas').DataTable({
		pagingType:'full_numbers',
	});
});
</script>
    After we created the HTML page, now we need to make sure the web page can be accessed after the scraping process is finished.
    So we will open the main.py file again and modify it to be like this.

from scraper import Scraper
from csv_creator import CSVCreator
from flask import Flask, render_template

scraper = Scraper()
datas = scraper.get_data()

app = Flask(__name__, template_folder=".")

@app.route('/')
def index():
  return render_template('view.html', datas=datas)

if __name__ == "__main__":
  CSVCreator.create_csv(datas)

  app.run()
    There are some changes that I made to the article before.
    I move the codes to a scraper to the outside of the main section because I want the data can be accessed by the Flask‘s routing.
Then I add this line so we have a Flask app.

app = Flask(__name__, template_folder=".")
 It will be using the current directory as the directory to search for templates because of template_folder="." parameters. It must be done because we did not create any templates folder which is the default folder that Flask app will search for HTML files.
Then I define the routing and the function that will handle the routing with these lines.

@app.route('/')
def index():
  return render_template('view.html', datas=datas)
We can see that it will use view.html as the HTML file that will be rendered to the browser and passing the data with datas=datas argument.
The last is to run the Flask app with this line.

app.run()
  If we run the main.py file and the scraping process is done, we will get a CSV file and the Flask app can be seen running at the terminal.
And if we open the address that showed at the terminal, then we will see this page.

And that’s all from me. Of course, we can improve the app more by adding some features, grabbing more useful data, presenting in a better format, etc. This is the link to the repository if you want to see the code directly.

If you want to discuss something just contact me on my LinkedIn.
Thank you very much and goodbye.