In some situations, you may need to crawl websites with complex structures or interactions. To make sure you capture all necessary data, you can use the Puppeteer crawler.
Puppeteer works by programmatically interacting with websites during the crawling process. It can simulate user interactions like clicks and scrolling, waiting for content to fully load. This approach guarantees that all required content is successfully captured, even from dynamic or interactive pages.
Puppeteer can be used to crawl various types of content, including:
- Specific page sections: extract the main content while excluding unrelated elements.
- Dynamically loaded content: capture data from websites that load content through JavaScript after the initial page load.
- Paginated or filtered results: crawl knowledge base articles, forums or portals that use dynamic filtering or pagination.
- Versioned pages: access content from websites with versioned data.
To use the Puppeteer crawler, define it in the crawler parameter in the corpus() function and specify a function to crawl the required content.
You can:
- Use the default crawler API to exclude certain selectors
- Create custom crawling functions for more specialized use cases
Default crawler
The default crawler API allows you to wait for the page content to fully load and extract specific sections, such as the main article content, while excluding unrelated elements like the navigation menu, ads or footers.
To use the default crawler API, in the crawler parameter of the corpus() function, pass api.defaultCrawler({}) and define the crawling parameters:
corpus({
title: `HTTP corpus`,
urls: [
`https://developer.mozilla.org/en-US/docs/Web/HTTP/Overview`,
`https://developer.mozilla.org/en-US/docs/Web/HTTP/Messages`,
`https://developer.mozilla.org/en-US/docs/Web/HTTP/Session`
],
crawler: {
puppeteer: api.defaultCrawler({
waitAfterLoad: 1000,
excludeSelectors: [
`div.sticky-header-container`,
`aside.sidebar`,
`aside.toc`,
`aside.article-footer`,
`div.bottom-banner-container`,
`footer.page-footer`
]
}),
},
depth: 3,
maxPages: 3,
priority: 1
});Custom crawling functions
For more specialized cases, such as handling unique page structures or extracting specific types of content, you can implement custom crawling functions with Puppeteer. This approach allows you to achieve precise control over how data is gathered and processed.
corpus({
title: `Knowledge Base`,
urls: [`urls to crawl`]
crawler: {
puppeteer: crawlPages(),
browserLog: 'on',
args: {arg1: 'value1', agr2: 'value2'},
},
depth: 10,
maxPages: 10,
priority: 1
});
async function* crawlPages({url, page, document, args}) {
// crawlPages function code ...
}Crawler Parameters:
puppeteer: function used to crawl dataargs: arguments to be passed to the crawler functionbrowserLog: set to 'on' to print logs from the browser interaction process, or 'off' to disable logging