Quick Summary:
This blog demonstrates how Creole Studios utilizes Node.js and the Playwright library to create a web scraper for extracting data from websites. The tutorial employs the NestJS framework and provides a step-by-step guide, including project setup, resource creation, and browser automation for scraping dynamic content. Key features include URL crawling, text extraction, data cleaning, and recursive functionality to gather data from all relevant pages. The blog concludes with instructions to make a POST request to trigger the scraper and retrieve data in JSON format, showcasing how to build a robust web scraper effectively.
Introduction:
At Creole Studios, we’re always excited to share insights and practical demonstrations that can help developers enhance their skills. As a leading Node.js development company, we take pride in building efficient and scalable solutions. In this blog, we’ll walk you through creating a web scraper using Node.js, a task that allows you to extract data from websites efficiently.
A web scraper is essentially an algorithm designed to extract data from websites. To follow along with this demo, a basic understanding of Node.js and JavaScript is recommended. For this demonstration, we’ll be utilizing Playwright, a powerful browser automation library. Experienced developers can skip the foundational steps and jump directly to point number 8 for the core logic. Let’s dive in!
Step-by-Step Guide:
1. Install the NestJS CLI Tool
First, install the NestJS CLI tool globally with the following command:
npm i -g @nestjs/cli
2. Create a New Project
Create a new project with your desired name using the command below:
nest new nodejs-web-scraper
This is how it will look when you successfully create the project.
3. Navigate to the Project Folder
Move to the project directory and run the project using these commands:
cd nodejs-web-scraper
npm run start:dev
You’ll see the project running in the terminal window.
4. Open the Project in a Code Editor
Open the project in your preferred code editor (we recommend VS Code). The folder structure will look like this:
5. Add a Resource Named Scraper
Use the following command to generate a new resource named scraper:
nest g resource scraper
You should see the screen like below after you run the command.
6. Install the Playwright Library
The core logic in scraping a site is to get the html of the page you want to scrape and access the content from the html to get the required data. Sometimes the html is dynamically rendered, For example: Pagination, Tabs etc. So we need to perform these actions by clicking those elements. Basically we want to open a browser and do some automation. There are some browser automation libraries like Selenium, Puppeteer, Playwright etc.
To automate browser interactions, install the Playwright library:
npm i playwright
7. Install Browsers for Playwright
Next, install the browsers that Playwright will use with the command:
npx playwright install
8. Implement Core Scraping Logic
Now, let’s create a function in scraper.service.ts to launch a browser, visit the provided URL, and recursively extract text content from all pages starting with the given base URL. Here’s the code snippet:
//scraper.service.ts
import { Injectable } from '@nestjs/common';
import { Page, chromium } from 'playwright';
@Injectable()
export class ScraperService {
//main scraping function which will be called from the controller.
async scrape(baseUrl: string) {
// Launch a browser and create a new page
// headless is set to false to see the browser in action
const browser = await chromium.launch({ headless: false });
// Create a new browser context and a new page
const context = await browser.newContext();
const page = await context.newPage();
//maintain a set of visited urls and a queue of pending urls
const visitedUrls = new Set<string>();
const pendingUrls = [baseUrl];
//maintain an array of scraped data
const scrapedData: { url: string; cleanedText: string }[] = [];
//loop through the pending urls and scrape the data
while (pendingUrls.length > 0) {
const url = pendingUrls.shift();
if (!url || visitedUrls.has(url)) continue;
console.log('crawling', url);
visitedUrls.add(url);
try {
await page.goto(url, {
waitUntil: 'networkidle',
timeout: 360000,
});
const textContent = await page.innerText('body', { timeout: 360000 });
const cleanedText = this.cleanUpTextContent(textContent);
scrapedData.push({ url, cleanedText });
console.log('scrapedData length', scrapedData.length);
const newUrls = await this.extractUrls(page, baseUrl);
newUrls.forEach((newUrl) => {
if (!visitedUrls.has(newUrl)) {
pendingUrls.push(newUrl);
}
});
} catch (error) {
console.error(`Error loading ${url}:`, error);
}
}
// Close the browser
await page.close();
await context.close();
await browser.close();
console.log('scrapedData length', scrapedData.length);
return scrapedData;
}
cleanUpTextContent(text: string): string {
// Remove extra whitespace and irrelevant text using regular expressions
const cleanedText = text.replace(/\s+/g, ' ').trim(); // Replace multiple spaces with a single space and trim
return cleanedText;
}
async extractUrls(page: Page, baseUrl: string): Promise<string[]> {
const hrefs = await page.$$eval(
'a',
(links, baseUrl) => {
// Function to add or remove 'www' subdomain based on baseUrl
const adjustWwwSubdomain = (url: string, baseUrl: string) => {
const urlObj = new URL(url);
const baseObj = new URL(baseUrl);
if (baseObj.hostname.startsWith('www.')) {
// If baseUrl has 'www' subdomain, ensure 'www' in extracted URLs
if (!urlObj.hostname.startsWith('www.')) {
urlObj.hostname = 'www.' + urlObj.hostname;
}
} else {
// If baseUrl doesn't have 'www' subdomain, remove 'www' in extracted URLs
urlObj.hostname = urlObj.hostname.replace(/^www\./, '');
}
return urlObj.href;
};
return links.map((link) => {
try {
let href = link.href;
// Ignore empty hrefs or hash-only hrefs
if (!href || href === '#' || href.startsWith('javascript:')) {
return null;
}
// Convert relative URLs to absolute URLs
if (href.startsWith('/')) {
const protocol = baseUrl.startsWith('https://')
? 'https://'
: 'http://';
href = protocol + new URL(href, baseUrl).hostname + href;
}
// Handle protocol-relative URLs
if (href.startsWith('//')) {
const protocol = baseUrl.startsWith('https://')
? 'https:'
: 'http:';
href = protocol + href;
}
const fragment = href.split('/').pop().startsWith('#');
if (fragment) {
const arr = href.split('#');
href = arr[0];
}
const includesHash =
!href.split('/').pop().startsWith('#') &&
href.split('/').pop().includes('#');
if (includesHash) {
return null;
}
// Ensure 'www' subdomain consistency
href = adjustWwwSubdomain(href, baseUrl);
return href;
} catch (error) {
console.log('Error extracting URL:', error);
return null; // Ignore invalid URLs
}
});
},
baseUrl,
);
const filteredUrls = hrefs.filter((href) => {
return href !== null && href.startsWith(baseUrl);
});
return filteredUrls;
}
}
9. Create a Controller
Define a controller to handle API calls for scraping. Here’s the scraper.controller.ts file
//scraper.service.ts
import { Body, Controller, Post } from '@nestjs/common';
import { ScraperService } from './scraper.service';
import { ScraperDto } from './dto/scrape.dto';
@Controller()
export class ScraperController {
constructor(private readonly scraperService: ScraperService) {}
@Post('scraper')
async scrape(@Body() scraperDto: ScraperDto) {
return this.scraperService.scrape(scraperDto.url);
}
}
10. Create a DTO for Input Validation
Here’s the scraper.dto.ts file defining the input structure:
//scraper.dto.ts
export class ScraperDto {
url: string;
}
11. Make a POST Request
Now, make a POST request to the /scraper endpoint with the required URL.
12. Watch the Browser in Action
The browser will visit and extract text from all pages starting with the provided URL, e.g., www.creolestudios.com
13. Receive Data in JSON Format
The scraped data will be returned in JSON format, containing the URLs and the extracted content.
Conclusion:
Web scraping is a powerful technique to extract valuable data for analysis, business insights, and automation. With tools like Node.js, NestJS, and Playwright, it’s easier than ever to build scalable and efficient scrapers. At Creole Studios, we leverage modern technologies and frameworks to create tailored solutions for complex business challenges. If you’re looking to hire expert Node.js developers for web scraping or automation services, feel free to reach out to us for expert guidance and development solutions!