To instruct the corpus() function to selectively include or exclude pages from the data corpus, use the include and exclude parameters.
These parameters accept arrays of URLs or regular expressions (RegEx) for matching documents.
Example
Assume you want to index a page that contains a link to a PDF document and want to make sure this PDF document gets obligatory crawled. At the same time, you want to exclude auxiliary pages such as Jobs and Web Accessibility Statement. To do this, use the following script:
corpus({
urls: ["https://catalog.manhattan.edu/undergraduate/"],
include: [/.*\.pdf/],
exclude: [
"https://manhattan.edu/web-accessibility-statement",
"https://inside.manhattan.edu/offices/human-resources/jobs"
],
maxPages: 10,
depth: 1
});Include Parameter
Forces the crawler to index specific resources that match the pattern.
/.*\.pdf/Include all PDF files
/.*\/docs\/.*/Include all pages in /docs/
Exclude Parameter
Prevents the crawler from indexing specific resources that match the pattern.
https://example.com/jobsExclude specific URL
/.*\.zip/Exclude all ZIP files
Tip:
Use regular expressions for pattern matching and exact URLs for specific pages. The include parameter takes precedence over exclude when both match the same resource.