Skip to main content

Featured

Web Scarping with JavaScript in Steam Website

     Before this, I already wrote an article about web scraping before. But for this article, I will write about web scraping with JavaScript for Steam Website . Getting the Data     As I said before, the first step of web scraping is to decide what data we want to get from the target.     So we will see the website first. Let's open the https://store.steampowered.com website first. Then let's check the game with the Adventure RPG category.          And we will get a page like this.           And we can choose the top-rated

Web Scraping with Python in Indonesian E-Commerce (Tokopedia)

     In 2006, British mathematician and Tesco marketing mastermind Clive Humby shouted from the rooftops, "Data is the new oil". And today we can see data has become a powerful weapon that can influence the direction of the world. It can decide the next action that needs to be taken in a business, increase goods selling by providing the product related to the customer's taste, create good Artificial Intelligence to minimize human work, etc.

    And in this article, we will study how to get data from an existing website, this action is usually called web scraping. For this one, we will use Tokopedia, one Indonesian E-Commerce, as a study case.

Getting the Data

    The first step to web scraping is to decide what data we want to get. In this case, I want to get shoe(sepatu) data and it will be sorted by the review(ulasan).

    Then let's see the website. The website is shaped by a markup language named HTML. And we can get the data that we need by searching the HTML of the page thoroughly.

    First, let's open Tokopedia's page at https://www.tokopedia.com.



    Then let's search for shoes on the search bar. Shoes in Indonesia are called sepatu, so I will insert the word "sepatu" into the search bar. 

    But it's sorted by the most suitable one, so let's change to sort by the review by changing the dropdown "Urutkan" to the "Ulasan".


    Then let's see the HTML by using inspect elements and pointing to the product's card.
    
    We can see that the card has a class named css-y5gcsw. Then inside the card, we can see some information of the product.


    And I'm interested in the name, price, city, and image URL of the product so let's try to see the HTML element of these data. 


    And we can see we can get the name with css-1b6t4dn class, the price with css-1ksb19c class, the city with css-1kdc32b class, and the image with css-1c345mg class.

    After recognizing the HTML of the page, then let's create a script to get the data from the page.
    
    Because Tokopedia is using JavaScript Framework to build its website, we will be using a browser automation library named Selenium. We can get the data from the HTML with this library. But of course, you need to install the library first and we need a browser too. You can follow the installation of Selenium at this link and don't forget to use Python's virtual environment for this project. Then for the browser itself, we will be using Firefox for the automation process.

    Then, let's create a file named scraper.py as a place for the Scraper class to reside.

    Let's create a class named Scraper  that will have the responsibility to get the data from the website. In this class we let's create a property named driver that will be filled with Selenium Webdriver. Webdriver is a class that Selenium will use to create a session with the browser and communicate with the browser. So if webdriver commands the browser to open a certain page, then the page will be opened in the browser. To create a Webdriver object that is connected to the Firefox browser, we can call the static function Firefox() from the Webdriver class.

from selenium import webdriver

class Scraper:

	def __init__(self):
		self.driver = webdriver.Firefox()  

    Then let's create a function named get_data() to get the data from the website. For this purpose, we need to have an URL from the website. If we see the website again, we can see the URL is https://www.tokopedia.com/search?navsource=&ob=5&srp_component_id=02.01.00.00&srp_page_id=&srp_page_title=&st=product&q=sepatu.
    
    Let's make the driver command the browser to get the URL by calling driver.get("URL") function.

def get_data(self):
	self.driver.get('https://www.tokopedia.com/search?navsource=&ob=5&srp_component_id=02.01.00.00&srp_page_id=&srp_page_title=&st=product&q=sepatu')

    Then let's create a counter for the page that shows the products and list to place all the data.

counter_page = 0
datas = []

    We will get the data until page 10. For each page, we will make the driver command the browser to scroll until the end of the page because the page will not load the data if we did not scroll through it. When I checked the page, I found that the page has around 6500 pixels and we will scroll per 500 pixels. For each iteration, we will wait for 0.1 seconds so we did not put some load at the same time to the server.

while counter_page < 10:
	for _ in range(0, 6500, 500):
		time.sleep(0.1)
		self.driver.execute_script("window.scrollBy(0,500)")

    And after the iteration for scrolling, we will get the card's element, iterator over all the elements, get the name, price, city, and image data, and finally put the data to the datas variable.

elements = self.driver.find_elements(by=By.CLASS_NAME, value='css-y5gcsw')
for element in elements:
	img = element.find_element(by=By.CLASS_NAME, value='css-1c345mg').get_attribute('src')
    name = element.find_element(by=By.CLASS_NAME, value='css-1b6t4dn').text
    price = element.find_element(by=By.CLASS_NAME, value='css-1ksb19c').text
    city = element.find_element(by=By.CLASS_NAME, value='css-1kdc32b').text
    
    datas.append({
    	'img': img,
		'name': name,
		'price': price,
		'city': city
	})

    And after we get all the data, we can go to the next page by making the driver click to the next page. If we check the HTML of the page, we can find that the page button has css-1ix4b60-unf-pagination-item its class. And we can specify which button we want to click by using the counter variable.

counter_page += 1
next_page = self.driver.find_element(by=By.XPATH, value="//button[@class='css-1ix4b60-unf-pagination-item' and text()='" + str(counter_page + 1) + "']")
next_page.click()

    And finally, return the data as the function's return value.

return datas

    For the overall codes, you can check this.

from selenium.webdriver.common.by import By
from selenium import webdriver
import time

class Scraper:
  def __init__(self):
    self.driver = webdriver.Firefox()

  def get_data(self):
    self.driver.get('https://www.tokopedia.com/search?navsource=&ob=5&srp_component_id=02.01.00.00&srp_page_id=&srp_page_title=&st=product&q=sepatu')
    
    counter_page = 0
    datas = []

    while counter_page < 10:
      for _ in range(0, 6500, 500):
        time.sleep(0.1)
        self.driver.execute_script("window.scrollBy(0,500)")

      elements = self.driver.find_elements(by=By.CLASS_NAME, value='css-y5gcsw')
      for element in elements:
        img = element.find_element(by=By.CLASS_NAME, value='css-1c345mg').get_attribute('src')
        name = element.find_element(by=By.CLASS_NAME, value='css-1b6t4dn').text
        price = element.find_element(by=By.CLASS_NAME, value='css-1ksb19c').text
        city = element.find_element(by=By.CLASS_NAME, value='css-1kdc32b').text

        datas.append({
          'img': img,
          'name': name,
          'price': price,
          'city': city
        })

      counter_page += 1
      next_page = self.driver.find_element(by=By.XPATH, value="//button[@class='css-1ix4b60-unf-pagination-item' and text()='" + str(counter_page + 1) + "']")
      next_page.click()
    
    return datas

And then let's create a file named "main.py" to check the class functionality. Fill the file with this code.

from scraper import Scraper

if __name__ == "__main__":
  scraper = Scraper()

  datas = scraper.get_data()
  index = 1

  for data in datas:
    print(
      index,
      data['name'],
      data['img'],
      data['price'],
      data['city']
    )

    index += 1
    If we run the file, we will open the Firefox browser and the browser will automatically navigate as the driver instructed in our code. Then we can see the result from the terminal.


    We can see that we get 700 product data from the shoe searching page.

    Next step, we will try to present the data in another format than printing to the terminal directly.

Present the Data as a CSV File

   Comma-separated values or usually called CSV is a file format that uses comma (,) to separate values. So it's quite different with Excel files that separate the value with cell if we open it with Excel file openers like Microsoft Excel or Google Sheets.
    
   CSV file format is widely used in the tech industry nowadays. And I like it too because it is quite simple to differentiate the column (Except your data value has a comma too) and you can open it the same way as ordinary text file format like txt.

    So now we will represent the data that we already get before to CSV file format.

    Let's create a file named csv_creator.py as a file where the class that generates CSV resides.

    Then let's create a class named CSVCreator that has a responsibility to create a CSV file from the data provided.

import csv

class CSVCreator:

    After that, we will create a static method in that class so we don't need to create objects from the class every time we want to generate the CSV file. And let's call the method create_csv. This method will receive the data generated by the Scraper object.

@staticmethod
def create_csv(datas):

    As we declared before, the data generated by the Scraper will have dictionary form and take name, price, city, and img as their key. So let's define the key, it will be useful later.

fields = ["name", "price", "city", "img"]

    To write the data into a file on the disk in Python 3, of course, we need to create an IOWrapper. We can create that by using this code.

with open('shoes.csv', 'w') as f:

    Then, let's use the csv library to write our data into the shoes.csv file. First, we need to create an DictionaryWriter object from the csv library. This action is needed because the shape of each of our data is a Python 3's dictionary. And we will input our fields variable to the DictionaryWriter object so they can recognize the key of the data that need to be written to the file. We can create the object with this code.

writer = csv.DictWriter(f, fieldnames=fields)
    
    The CSV file is usually shaped in two-part, the header and the value. This is how we can recognize what kind of value is in each column. In this code, we will write the header and all the data to the file too.
writer.writeheader()
writer.writerows(datas)

    And that's all for our CSVCreator class. For all the codes you can check this.
import csv

class CSVCreator:

  @staticmethod
  def create_csv(datas):
    fields = ["name", "price", "city", "img"]

    with open('shoes.csv', 'w') as f:
      writer = csv.DictWriter(f, fieldnames=fields)

      writer.writeheader()
      writer.writerows(datas)
    In the main.py file, we can delete the part that prints all the data to the terminal and change it to the CSVCreator and call the create_csv method. So the code will be like this.
from scraper import Scraper
from csv_creator import CSVCreator

if __name__ == "__main__":
  scraper = Scraper()

  datas = scraper.get_data()
  index = 1

  CSVCreator.create_csv(datas)
    And if we run the main.py file, we can see that shoes.csv the file was created and has data like this.


Present the Data as a Web Page

   Well, we can see the text data like name, city, and price directly. But we can't see the data directly from the CSV, right? And there's an easy way to see all the data clearly, that is by creating a web page. Of course in this article, we will keep the page quite simple.
    
  

   So in this section, we will try to create a web page to see the data that we already scraped from Tokopedia. We will use Flask in this section and we can install Flask by following this link.

    Then let's create the web page. For the page itself, we will use datatables it is because we can see the data easily and there are many features that we can get if we create a table using that library. 
It will load from our HTML page directly. So let's create an HTML file named view.html and fill with this code.


<!doctype html>
<html lang="en">
	<head>
		<meta charset="utf-8">
		<meta name="viewport" content="width=device-width, initial-scale=1, shrink-to-fit=no">

		<link rel="stylesheet" href="https://stackpath.bootstrapcdn.com/bootstrap/4.3.1/css/bootstrap.min.css" integrity="sha384-ggOyR0iXCbMQv3Xipma34MD+dH/1fQ784/j6cY/iJTQUOhcWr7x9JvoRxT2MZw1T" crossorigin="anonymous">
    	<link rel="stylesheet" href="https://cdn.datatables.net/1.12.1/css/jquery.dataTables.min.css">
    
    	<title>Product Data</title>
    </head>
    <body>
		<nav class="navbar navbar-expand-md navbar-light bg-light">
        	<a class="navbar-brand" href="{{url_for('index')}}">Product Data</a>
        </nav>
        <div class="container">
			<table id="datas" class="display table"style="width:100%">
				<thead>
					<tr>
						<th>Name</th>
						<th>Price</th>
						<th>City</th>
						<th>Image</th>
					</tr>
				</thead>
				<tbody>
					{% for item in datas %}
						<tr>
							<td>{{item["name"]}}</td>
							<td>{{item["price"]}}</td>
							<td>{{item["city"]}}</td>
							<td>
								<img src="{{item['img']}}" alt="" srcset="">
							</td>
						</tr>
					{% endfor %}
				</tbody>
			</table>
		</div>

		<script src="https://code.jquery.com/jquery-3.3.1.slim.min.js" integrity="sha384-q8i/X+965DzO0rT7abK41JStQIAqVgRVzpbzo5smXKp4YfRvH+8abtTE1Pi6jizo" crossorigin="anonymous"></script>
		<script src="https://stackpath.bootstrapcdn.com/bootstrap/4.3.1/js/bootstrap.min.js" integrity="sha384-JjSmVgyd0p3pXB1rRibZUAYoIIy6OrQ6VrjIEaFf/nJGzIxFDsf4x0xIM+B07jRM" crossorigin="anonymous"></script>
		<script src="https://cdn.datatables.net/1.12.1/js/jquery.dataTables.min.js"></script>
		<script>
			$(document).ready(function () {
				$('#datas').DataTable({
					pagingType: 'full_numbers',
				});
			});
		</script>
	</body>
</html>


    Well, there are so many things that happened here.

    We import important CSS libraries like bootstrap and datatables in these two lines.


<link rel="stylesheet" href="https://stackpath.bootstrapcdn.com/bootstrap/4.3.1/css/bootstrap.min.css" integrity="sha384-ggOyR0iXCbMQv3Xipma34MD+dH/1fQ784/j6cY/iJTQUOhcWr7x9JvoRxT2MZw1T" crossorigin="anonymous">
<link rel="stylesheet" href="https://cdn.datatables.net/1.12.1/css/jquery.dataTables.min.css">

    
    And we import important JavaScript libraries like jQuery, bootstrap, and datatables in these lines.


<script src="https://code.jquery.com/jquery-3.3.1.slim.min.js" integrity="sha384-q8i/X+965DzO0rT7abK41JStQIAqVgRVzpbzo5smXKp4YfRvH+8abtTE1Pi6jizo" crossorigin="anonymous"></script>
<script src="https://stackpath.bootstrapcdn.com/bootstrap/4.3.1/js/bootstrap.min.js" integrity="sha384-JjSmVgyd0p3pXB1rRibZUAYoIIy6OrQ6VrjIEaFf/nJGzIxFDsf4x0xIM+B07jRM" crossorigin="anonymous"></script>
<script src="https://cdn.datatables.net/1.12.1/js/jquery.dataTables.min.js"></script>

    
    Then we create a table tag in the body in these lines. Aside from creating a table, we will draw the row of the table using templating syntax from Flask that uses brackets like this {% %} and this {{ }}.


<table id="datas" class="display table" style="width:100%">
	<thead>
		<tr>
			<th>Name</th>
			<th>Price</th>
			<th>City</th>
			<th>Image</th>
			</tr>
		</thead>
		<tbody>
			{% for item in datas %}
				<tr>
					<td>{{item["name"]}}</td>
					<td>{{item["price"]}}</td>
					<td>{{item["city"]}}</td>
					<td>
						<img src="{{item['img']}}" alt="" srcset="">
					</td>
				</tr>
			{% endfor %}
		</tbody>
	</table>


    Then finally we will make the table into a datatable's table with this code.


<script>
$(document).ready(function () {
	$('#datas').DataTable({
		pagingType:'full_numbers',
	});
});
</script>


    After we create the HTML page, now we need to make the web page can be accessed after the scraping process is finished.

    So we will open the main.py file again and modify it to be like this.


from scraper import Scraper
from csv_creator import CSVCreator
from flask import Flask, render_template

scraper = Scraper()
datas = scraper.get_data()

app = Flask(__name__, template_folder=".")

@app.route('/')
def index():
  return render_template('view.html', datas=datas)

if __name__ == "__main__":
  CSVCreator.create_csv(datas)

  app.run()


    There are some changes that I made from the article before.

    I move the codes to a scraper to the outside of the main section because I want the data can be accessed by the Flask's routing.

    Then I add this line so we have a Flask app.


app = Flask(__name__, template_folder=".")


 It will be using the current directory as the directory to search for templates because of template_folder="." parameters. It must be done because we did not create any templates folder which is the default folder that Flask app will search for HTML files.

   Then I define the routing and the function that will handle the routing with these lines.


@app.route('/')
def index():
  return render_template('view.html', datas=datas)


   We can see that it will use view.html as the HTML file that will be rendered to the browser and passing the data with datas=datas argument.

    And the last is to run the Flask app with this line.


app.run()


  If we run the main.py file and the scraping process is done, we will get a CSV file and the Flask app can be seen running at the terminal.


  And if we open the address that showed at the terminal, then we will see this page.




  And that's all from me. Of course, we can improve the app more by adding some features, grabbing more useful data, presenting in a better format, etc. And this is the link to the repository if you want to see the code directly.

  If you want to discuss something just contact me on my LinkedIn.

  Thank you very much and goodbye.

Comments

Popular Posts