Featured

Web Scarping with JavaScript in Steam Website

     Before this, I already wrote an article about web scraping before. But for this article, I will write about web scraping with JavaScript for Steam Website.

Getting the Data

    As I said before, the first step of web scraping is to decide what data we want to get from the target.
    So we will see the website first. Let's open the https://store.steampowered.com website first. Then let's check the game with the Adventure RPG category.




        And we will get a page like this.

    
    And we can choose the top-rated category to see the top-rated games first.
   
    

    And we will see this request in the Network tab when we scroll down.


    And for the response, we can see the response contains the HTML that is used for the list of games.


    So we already know where we can get the data. The next step is creating code to scrape the data from the website. There are three main libraries that we will use, that is axios for creating requests to the server, fast-csv for creating the CSV file, and node-html-parser to parse the HTML texts.

    So let's create a project first with npm init:


    Then install the required library with this command:


npm install axios fast-csv node-html-parser

    Create a file named main.js for the project. Then insert these codes into the file:


const axios = require('axios');
const fastcsv = require("fast-csv");
const fs = require("fs");
var HTMLParser = require('node-html-parser');
    Aside from the other library that I mentioned before, I also add library fs for saving the data to a file.

    Then let's create a function named getData for getting the data from the website.


async function getData() {
}


    After that, we can add some variables to help us form the algorithm to the getData function:


async function getData() {
	let start = 0;
	const count = 15;
	const datas = [];
}

    
     In this code, we define the start that for saving the counter for how much the data we already received, count for knowing how much data we will get for each request, and datas variable for saving the data that we received.

    Then let's add an iteration for requesting the data continuously to the getData function:


async function getData() {
	.
        .
        .
	while (start < 1000) {
		start += count;
	}
}

    
    We want to get around 1000 data so we doing the iteration 1000 times. Aside from that we also update the start variable so we do not get to an infinite loop.

    Copy the URL from the request before and use axios to get the data from the website with these codes:


async function getData() {
    .
    .
    .
    while (start < 1000) {
        await axios
	.get(`https://store.steampowered.com/contenthub/querypaginated/category/TopRated/render/?query=&start=${start}&count=${count}&cc=ID&l=english&v=4&tag=&category=adventure_rpg`)
	.then((response) => {
	})
	.catch((error) => {
	    console.error(error)
	});
        .
        .
        .
    }
}

    In this code, we use the start and count variables before the URL, aside from that we also print the error message when something goes wrong.

    Because the request is shaped in an HTML string, then we can see the class for the HTML tag directly on the page.


    After that, let's add this code to get the DOM for the results_html before:


async function getData() {
.
.
.
while (start < 1000) {
await axios
.get(`https://store.steampowered.com/contenthub/querypaginated/category/TopRated/render/?query=&start=${start}&count=${count}&cc=ID&l=english&v=4&tag=&category=adventure_rpg`)
.then((response) => {
const html = "<div>" + response.data.results_html + "</div>";
const dom = HTMLParser.parse(html);
const games = dom.querySelectorAll(".tab_item");
games.forEach(game => {
});
})
.catch((error) => {
console.error(error)
});
.
.
.
}
}

    From the response to the request, we get the results_html field and wrap it into a div tag. Then we parse it with the HTMLParser object from the node-html-parser library before. And after that, we get all the elements with tab_item the class that contains the item for the games. And finally, we iterate through all the games for getting each data of the games.

    Finally, let's get all the data from each of the games with these codes:


async function getData() {
	.
        .
        .
	while (start < 1000) {
		await axios
		.get(`https://store.steampowered.com/contenthub/querypaginated/category/TopRated/render/?query=&start=${start}&count=${count}&cc=ID&l=english&v=4&tag=&category=adventure_rpg`)
		.then((response) => {
			const html = "<div>" + response.data.results_html + "</div>";
			const dom = HTMLParser.parse(html);
			const games = dom.querySelectorAll(".tab_item");

			games.forEach(game => {
				const gameName = game.querySelector(".tab_item_name").textContent.trim();
				const gameLink = game.attributes.href.trim();
				const gameImg = game.querySelector(".tab_item_cap_img").attributes.src.trim();
				let gamePrice = "";
				
				if (game.querySelector(".discount_final_price")) {
					gamePrice = game.querySelector(".discount_final_price").text.trim();
				}

				let tags = game.querySelector(".tab_item_top_tags").childNodes;
				let gameTags = "";

				tags.forEach(tag => {
					gameTags += tag.text;
				});

				datas.push({
					Name: gameName,
					Link: gameLink,
					"Image Link": gameImg,
					Price: gamePrice,
					Tags: gameTags,
				});
			});
		})
		.catch((error) => {
			console.error(error)
		});
        .
        .
        .
	}
    
    return datas;
}


    We will get the game's name, link, image, price, and tag data. So like the HTML document that we see before, we can get the game's name from the element with tab_item_name class. Then get the game's link from the game object itself because it already represents an a tag. We can get the game image's link from the tag with tab_item_cap_img class and get the src attributes. And for the price itself, it can be existed or not so I check the discount_final_price class first if it exists, if not then we just ignore it. And for the game's tags, we can iterate through tags with tab_item_top_tags class and append each of them to one variable. Finally, we can wrap it up and put it to the datas variable and return the datas variable.

    And we can check it by creating and using a function named main like this:


async function main() {
	const datas = await getData();

	console.log(datas);
}

main();


    If we run the main.js file with node main.js command, we can see the result like this:


Putting the Data to CSV File

    After getting the data from the website, let's put the data into a CSV file so we can see the data anytime we want.
    
    Let's create a function named generateCSV to generate the CSV that we needed.


function generateCSV(datas) {
}

    And we can add a code for open a stream to write the data into a file like this:


function generateCSV(datas) {
	const writeStream = fs.createWriteStream("data.csv");
}


    And then, let's add the codes for writing data to the stream like this:


function generateCSV(datas) {
	.
    .
    .
    fastcsv
	.write(datas, { headers: true })
	.on("finish", function() {
		console.log("Write to CSV successfully!");
	})
	.pipe(writeStream);
}


    And if it is successfully written, then we can see the Write to CSV successfully! message on the screen.

    Finally, let's wrap it up and add the function call to the main function.


async function main() {
	const datas = await getData();

	generateCSV(datas);
}


    If we run the main.js file, then the terminal will be like this:

    
    And we can see the content of the file already contains the data that we want.



    And that's all from me. Of course, we can improve the app more by adding some features, grabbing more useful data, presenting it in a better format, etc. And this is the link to the repository if you want to see the code directly.

  If you want to discuss something just contact me on my LinkedIn.

  Thank you very much and goodbye.


Comments

Popular Posts