When developing an AI assistant that works with multiple data corpuses, it's often necessary to filter the data based on criteria like user roles, product versions, or individual preferences.
By segmenting the content this way, you ensure that users only access relevant information, enhancing the AI assistant's functionality and efficiency.
To filter corpuses, you can use the visual state. You can create filters for each data segment, such as user roles or product versions, and define the corpus() function within these filters.
Example: Filtering static corpuses
Assume you want to allow users to access GitHub documentation based on their role: user or admin. To do this, you can implement the following script:
const user = visual(state => state.role === "user");
const admin = visual(state => state.role === "admin");
user(() => {
corpus({
title: `User docs`,
urls: [`https://docs.github.com/en`],
maxPages: 10,
maxDepth: 3,
});
});
admin(() => {
corpus({
title: `Admin docs`,
urls: [`https://docs.github.com/en/enterprise-cloud@latest/admin`],
maxPages: 10,
maxDepth: 3,
});
});To test it, in Alan AI Studio, set the visual state to "user":"admin" and ask a question: How do I create an account?. The AI assistant will provide an answer from the Admin docs corpus.
Example: Filtering corpuses with Puppeteer
Assume you want to crawl Python documentation for two versions: 3.11 and 3.12. You can implement the following script using the Puppeteer crawler and visual state:
async function* crawlDocs({url, page, document, args}) {
let {version} = args;
if (url.includes(`docs.python.org`)) {
await page.waitForSelector(`div.documentwrapper`, {timeout: 15000});
const html = await page.evaluate(() => {
const selector = 'div.body';
return document.querySelector(selector).innerHTML;
});
let urls = await page.evaluate(() => {
const anchorElements = document.querySelectorAll('a');
return Array.from(anchorElements).map(anchor => anchor.href);
});
let content = await api.html2md_v2({html});
yield {content, mimeType: 'text/markdown'};
urls = urls.filter(f=> f.includes(version));
yield {urls};
}
}
const v311 = visual(state => state.version === "3.11");
const v312 = visual(state => state.version === "3.12");
v311(() => {
corpus({
title: `Python docs 3.11`,
urls: [`https://docs.python.org/3/`],
crawler: {
args: {version: '3.11'},
puppeteer: crawlDocs,
browserLog: 'on',
},
maxPages: 10,
maxDepth: 100,
});
});
v312(() => {
corpus({
title: `Python docs 3.12`,
urls: [`https://docs.python.org/3/`],
crawler: {
args: {version: '3.12'},
puppeteer: crawlDocs,
browserLog: 'on',
},
maxPages: 10,
maxDepth: 100,
});
});Use Cases:
- Role-based access control (user, admin, manager)
- Product version filtering (v1.0, v2.0, latest)
- Language-specific documentation
- Region-specific content
- Feature flag-based content access