Sources
With LoyJoy, you have the ability to add your own data by incorporating external sources, which can then be utilized by the GPT large language model to generate answers to user questions. Adding external sources is a straightforward process - simply enter the URL of the desired source into the provided input field. The system will automatically recognize the file type associated with the URL.
Index Your Website
In addition to specific file types, you can also index entire websites by inputting their URLs. When you enter a URL, LoyJoy will first look for a sitemap associated with it. A sitemap is a file that indicates which pages belong to a website and should be crawled. If we find a sitemap, we will tell you how many pages will be crawled.
If no sitemap is found, you can still crawl the website. This is done by checking which links are contained on the homepage and examining which links are present on the linked pages and so on. This approach takes more time and is limited to at most 800 pages to be crawled.
If you know the URL to a sitemap, you can enter it directly. This can also be helpful if there are multiple sitemaps present on your website and you want to select a specific one.
Expert Settings
To further specify the beviour of your crawling, take a look into the export settings. To reduce the load for your webserver you can set a delay time in seconds that causes the crawler to take a pause after each request. Further, you may want to look for certain parts of your webpage that you may want to index by providing a set of CSS-selectors
.
These CSS-selectors
are then used as a filter on the crawled pages. Only the content that is contained in the elements that match the CSS-selectors
will be indexed. This can be useful if you want to exclude certain parts of your website from being indexed. For an example recipe site for example we might only want to include the text contained in the element with the ID recipe_detail_container
as well as the elements with the class utils
. To achieve this we can use the following CSS-selectors
:
In most browsers you can use right-click and "Inspect" to find the CSS-selectors
of a specific element. Check out this guide for more information on how to use CSS-selectors
.
Exclusions
When you add your website and its subpages for indexing, you may want to exclude certain pages from being included in the index. LoyJoy provides four distinct ways to achieve this.
After you exclude URLs, make sure to click the re-index icon of the website. Only after that, the exclusions will apply.
Exclude if URL path contains the following: You can specify a string that, if found in the URL path, will result in the page being excluded from indexing.
For example, specify
2025
to exclude all pages that contains the string2025
in their URL path. The first 2 URLs are included, while the third one is excluded.✅
https://example.com/blog/2024/some_article.html
✅
https://example.com/blog/2026/some_article.html
❌
https://example.com/blog/2025/some_article.html
Exclude if URL path ends with the following: You can specify a string that, if found at the end of the URL path, will result in the page being excluded from indexing.
For example, specify
.xml
to exclude all pages that ends with.xml
. The first URL is included, while the second and third ones are excluded.✅
https://example.com/blog/2024/some_article.html
❌
https://example.com/blog/2025/index.xml
❌
https://example.com/blog/2026/index.xml
Exclude if URL path is exactly the following: You can specify a string that, if found as the exact URL path, will result in the page being excluded from indexing.
For example, specify
/blog/2025/some_other_article.html
to exclude the URL with this specific path. The first two URLs are included, while the third one is excluded.✅
https://example.com/blog/2024/some_article.html
✅
https://example.com/blog/2025/some_article.html
❌
https://example.com/blog/2025/some_other_article.html
Exclude if URL path starts with the following: You can specify a string that, if found at the beginning of the URL path, will result in the page being excluded from indexing.
For example, specify
/blog/2025/
to exclude all pages that starts with this string. The first URL is included, while the second and third ones are excluded.✅
https://example.com/blog/2024/some_article.html
❌
https://example.com/blog/2025/some_article.html
❌
https://example.com/blog/2025/another_article.html
Upload Files
You can upload files like pdf, pptx (PowerPoint), docx (Word), xlsx (Excel), and txt. Select one or several files or drag and drop them in the upload field. You can add a link for the source at the dedicated field. Otherwise the file could not be downloaded by the chat user (but will be mentioned as source!).