Zenera | Powering AI-Native Transformation

To instruct the corpus() function to selectively include or exclude pages from the data corpus, use the include and exclude parameters.

These parameters accept arrays of URLs or regular expressions (RegEx) for matching documents.

Example

Assume you want to index a page that contains a link to a PDF document and want to make sure this PDF document gets obligatory crawled. At the same time, you want to exclude auxiliary pages such as Jobs and Web Accessibility Statement. To do this, use the following script:

corpus({
    urls: ["https://catalog.manhattan.edu/undergraduate/"],
    include: [/.*\.pdf/],
    exclude: [
        "https://manhattan.edu/web-accessibility-statement", 
        "https://inside.manhattan.edu/offices/human-resources/jobs"
    ],
    maxPages: 10,
    depth: 1
});

Include Parameter

Forces the crawler to index specific resources that match the pattern.

/.*\.pdf/

Include all PDF files

/.*\/docs\/.*/

Include all pages in /docs/

Exclude Parameter

Prevents the crawler from indexing specific resources that match the pattern.

https://example.com/jobs

Exclude specific URL

/.*\.zip/

Exclude all ZIP files

Tip:

Use regular expressions for pattern matching and exact URLs for specific pages. The include parameter takes precedence over exclude when both match the same resource.