What is the search crawler? [EAP is over - the crawler is now generally available]

Forum|Forum|4 years ago
January 11, 2022
33 replies
0 views

Gorka

The Help center search crawler lets you easily surface great self service content that is not hosted in Help center when your end users and agents search for answers to their issues.

Important: You are responsible for using the Help Center search crawler in compliance with all applicable laws and the terms and conditions of the relevant websites. You should only add Sitemaps where you own the domain associated with such Sitemaps. By using the Help Center search crawler, you confirm that you own the domains for all Sitemaps added to the crawler and that you have the right to crawl such websites.

When we first launched Federated search we made it possible to take all that great content that can be helpful to your customers, but is not native help center content (articles, posts and comments) and weave it into help center search results when it is relevant to the users query.

This has until now only been possible by ingesting records of that content via the External content API and have thus required you to build and maintain a middleware to integrate the service, website, LMS, blog, etc. that host the external content and Help Center.

Now you can just setup a crawler in a few clicks, with no coding necessary.

How does it work?

To use the crawler to get federated search results to show up on the SERP on your help center you will need to do the following 3 things:

Setup your theme for federated search
Setup your crawlers. See below.
Configure federated search in your help centers

You can set up multiple crawlers to crawl and index different content in the same or different web properties.

When created, the crawlers will run once every 24 hours, where they will visit the pages in the sitemap you have specified for the crawler ((See Configuring your sitemap below) then scrape the content and metadata and ingest it into your Help Center’s search index.
Note that this means that the crawler, despite its name, does not crawl links on the pages it visits, follow them and scrape and index the linked pages, but only visits the pages in the sitemap it is configured to use (See Configuring your sitemap).

What does it index?

The crawler indexes content that is in the page source on the initial page load, even if it's hidden by something like an accordion. The crawler does not crawl content that is dynamically rendered after the initial page load or rendered by JavaScript since the crawler does not execute JavaScript.

Setting up your crawlers

To add a crawler go to Guide > Settings > Search Settings > Federated search > Crawler and click “add crawler.

Here you will be be guided through 4 steps to setting up your crawler:

1) Add sitemap

In the Sitemap URL field you can input either a domain address or a specific URL path to your desired sitemap XML file.

If you enter a domain address the crawler will try to identify the sitemap location by trying common sitemap paths.

If you enter a specific url for your sitemap file the crawler will try to fetch the sitemap from that location. This means that if you do not want the crawler to crawl all pages on your site you can create a dedicated sitemap that you point the crawler to.

If you set up multiple crawlers on the same site, they can each use different sitemaps to scope what they index.

The sitemap has to follow the Sitemaps XML protocol and it only visits the pages in the sitemap that it is configured to use.

2) Verifying your domain ownership

For security and abuse reasons the crawler has to be able to verify ownership of your domain. To enable that you need to copy and paste the HTML tag into the <head> section in the HTML of your site's non-logged-in home page.

The crawler will try to verify the domain ownership everytime it tries to crawl, by looking for the HTML tag, so to stay verified don't remove it.

3) Add search properties

The content crawled by each crawler needs to have an associated Source and Type. These are used for users to filter search results.

Sources will appear as values in the Source filter, alongside other external sources and if you use multi-help center search other Help Centers in your instance.

Types will appear as values in the Types filter together with the native types “Articles” and “Community”.

You can either choose a source or type you have created previously or you can create a new one by typing the source or type name and clicking “Add as new source/type”, as long as you haven’t hit the sources and types product limit.

When you click a source or a type the search results list will only include results of that source or type. Source and type filtering can be combined.

4) Get email notifications

As the last step you can set an owner of the crawler. The owner is by default the user creating the crawler, but you can set any other Guide Admin as crawler owner. The crawler owner will be the one that receives email notifications to let you know that the crawler has succeeded or if there are issues with domain verification, processing your sitemap or crawling pages.

To set up more than one crawler you just repeat the steps described.

What happens when the crawler is added?

Once you hit “Finish” your crawler is created and pending. Within 24 hours the crawler will first try to verify ownership of you domain. If it fails the crawler owner will receive a notification by email with troubleshooting tips and the crawler will try again in 24 hours and so on.

When the crawler succeeds in verifying your domain it immediately proceeds to attempt to fetch and parse (process) the sitemap you have specified. If this fails the crawler owner will again receive a notification by email with troubleshooting tips and the crawler will try again in 24 hours and so on. Note that the crawler fetches the current sitemap each time it tries to crawl, so if the sitemap changes, the scope of which pages are crawled will also change and previously indexed pages that no longer appears in the sitemap will no longer appear in search results.

When sitemap processing succeeds the crawler immediately proceeds to crawl the pages and index its content. Once that’s done the crawler owner will receive an email with a CSV report of all the pages that were attempted to be crawled and indexed and whether they succeeded or failed. Each failed page will have an error message to help troubleshoot.

Remember that for search federated restults to appear you also need to:

Can I use the external content API and the crawler simultaneously?

You can absolutely use the external content API and the crawler simultaneously. You should just be aware that if you delete a source or a type via the API, then any crawler that is creating or updating records for the deleted source or type will stop working.

Current limitations

The content to crawl has to be publicly available. This limitation we are not planning to solve in v1.
You can't delete and edit sources and types though the GUI.
Dedicated custom sitemaps are necessary if you want to have several crawlers crawl the same domain or scope crawling.
Email notifications are text only.
The content to crawl has to be UTF-8 encoded. The crawler does not support other encodings.

Product limits

General Federated search limit of 50000 external records.
Record max title length - 255 characters
Record max body length - 10000 characters

Forum|pagination.label 1 / 2

Jacob20
Forum|Forum|4 years ago
January 13, 2022

Sounds very interesting @gorka! 🙌

Will searching external sources be limited to queries made from within the Guide product? Or will it also be something that could be available for agents from within the Support product?

Patrick25
Forum|Forum|4 years ago
January 13, 2022

Very cool!

Curious how the 10,000 character body length limit works. Does content that exceeds that limit simply not appear at all in searches? Or does the crawler crawl only up to 10,000 characters of a piece of content and then stop—but the content still appears in search based on what the crawler indexed in that 10,000 characters?

Julien14
Forum|Forum|4 years ago
January 13, 2022

Hey, that sounds very interesting, is that content availalbe for Answer Bot ? (this was in your roadmap for Federated Search...)

조

조규승
Forum|Forum|4 years ago
January 17, 2022

Will this also be available for other areas where Zendesk recommend articles like web form (subject) & mobile SDK?

Gorka
Author
Forum|Forum|4 years ago
January 18, 2022

@jacob20, @julien14 and @lee26 Thank you for your interest in the EAP! I'll try to answer all of you at once.

Right now users can search external content through Help center native search and the Unified search API. Soon it will also be possible within the Knowledge search in the context panel in Zendesk Support.

As for other search interfaces, such as Article suggestions, Answer Bot (powers messaging and email suggestions), Web Widget, Mobile SDK and Instant search we are at different stages of development with all of them, but too early to give an ETA. We intend to eventually deliver all of them and you will see releases in this area in 2022 if nothing unforeseen prevents us.

Gorka
Author
Forum|Forum|4 years ago
January 18, 2022

@patrick25 Thank you for your question and interest in the EAP!

As it is now, the crawler will not index the page. You will get email with a CSV of all pages the crawler has intended to index where you can see the error for each page that failed so you can see which page is longer than 10.000 characters and split it or shorten it.

I'm curious which behaviour of the two you would prefer? and why?

Anyone else that has an opinion on the matter I would also love to hear from here in the comments.

Patrick25
Forum|Forum|4 years ago
January 18, 2022

@gorka, I suppose I'd say I'd prefer there were no length limits :-), but having some content crawled and indexed is better than none, as long as that piece of content appears in search. (And if a piece of content is 10k characters long, it's likely the important keywords and context for your search engine appears in that first 10k characters.)

Having some but not all content findable through search defeats the purpose of federated search. If users search in our Zendesk-based Guide for a training that lives in our LMS and they don't find it because it's too long, they'll assume it just doesn't exist. I want federated search to make it so our users can know for certain what content is available to them, regardless of the platform it's hosted on.

Here's some background on our use cases:

Product Training Courses from a 3rd-party LMS

In one of our use cases, we have a number of product training courses in our LMS that I'd love to be indexed for search in our Zendesk-based Guide. These courses include mostly videos, but we transcribe them for accessibility reasons. I'm sure most of our trainings exceed the 10,000 character limit.

(Note: I suppose this depends on what records are being indexed from our LMS. If it's the Course-level descriptions, 10k characters is probably fine. If it's lessons within those courses, 10k characters is probably too few)

Industry Content from Our Corporate Site

Our other use case is to include some of industry-related educational content from our corporate site into our Zendesk-based Guide. Much of this content is usually quite long form, and is created and managed by another department. They won't be splitting or shortening this content just to have it be found through federated search in our Guide. It's already been optimized to be useful for our audience and findable on public search engines.

I'm happy to answer any other questions about our use cases~ Thanks!

Gorka
Author
Forum|Forum|4 years ago
January 24, 2022

@patrick25 Thank you for the very thorough explanation and description of your use cases, I understand why it is important.

For now I think I have the information I need, but I'll reach out if I have followup questions.

Gina16
Forum|Forum|4 years ago
January 26, 2022

Hi,

You note that there is a limitation based on whether or not the information is public.

Can you expand on what you mean by public? Does this mean anything that requires a password to access cannot be used? is it everything that's set to public within a shared environment?

For example, can we crawl information on our company's Confluence, which all of our agents have access to, but the general public does not?

Nick12
Forum|Forum|4 years ago
January 27, 2022

Will the crawler search metafields?

Gorka
Author
Forum|Forum|4 years ago
February 4, 2022

@gina16 good question. It means that it needs to be on a website with no restrictions, like user logins, password restriction, IP restriction or similar because the crawler does not have the capability to subvert these barriers. With IP restrictions you could white label our IP adrress on your side and it should work, but that's a work around you would have to implement in your system.

I'm actually not sure which IP addressed the crawlers use, but I can figure it out if it is needed, but here would be a place to start.

Gorka
Author
Forum|Forum|4 years ago
February 4, 2022

@Jordan Brown Could you expand on which meta fields you mean?

We do for example try to determine the language and locale from among other things the lang attribute in the <html> tag and the <meta> tag.

Nick12
Forum|Forum|4 years ago
February 4, 2022

@gorka Metafields are extra, hidden data in each objects or in your shopfront that informs you more about the object itself without revealing them. These look like drop downs or accordion type content that is hidden unless it's expanded.

Gorka
Author
Forum|Forum|4 years ago
February 16, 2022

We have heard that several EAP participants have had problems adding sources and types when setting up the crawler. There is a UX issue that we are aware of, but until we roll out a better experience I have added a post on how to work around it.

Gorka
Author
Forum|Forum|4 years ago
February 17, 2022

@Jordan Brown

Content that is in the page source once initially loaded is crawled even if it's hidden by something like an accordion, but the crawler does not crawl content that is dynamically rendered after the initial page load or rendered by JavaScript since the crawler does not execute JavaScript (Added to post).

I'm not super familiar with metafields, though I can see that at least sometimes they are rendered on the initial page load and would thus be indexed, but wether that is always the case I can't say.

Gorka
Author
Forum|Forum|4 years ago
March 7, 2022

Hi @Korak Purkayastha

This may be because the lang tag that is there does not match a locale in your help center(s). I will publish a post with more details ASAP, but in short the crawler tries to detect the locale of the pages it crawls and match that with the locales enabled in the Guide account that the crawler is from. If there is no match the page will not be indexed.

Gorka
Author
Forum|Forum|4 years ago
March 7, 2022

Hey all,

I wanted to let you know that we have added the ability to edit a crawler that is already created.

You can now edit:

Sitemap URL
Source
Type
Owner

You access it by clicking the the oveflow menu icon on the crawler and the "Details" in the "Crawler" tab.

Full navigation:

Settings --> Search settings --> Federated search --> Crawler --> Overflow menu icon --> Details

Nick12
Forum|Forum|4 years ago
March 12, 2022

Are there plans to remove the 3,000 record limit at some point? When will this flow through to messaging as well?

Gorka
Author
Forum|Forum|4 years ago
March 18, 2022

Hi Ron,

We are looking at the 3000 record limit. What are the number of pages you need to index?
I can't give you an ETA at this point for when external content can be served via Answer Bot that powers the messaging answer suggestions, but it is part of our roadmap. We expect to be able to serve external content for Web Widget within 6 months, but as always this is not a promise just a target. Did I understand your question right?

Malcolm14
Forum|Forum|4 years ago
March 29, 2022

We have been struggling with the search crawler, which is not able to completely scan our shared documentation site. Each product in our product line has its own subsite on the documentation site, e.g. https://docs.ourbrandname.com/productname/

There is a top level sitemap.xml, on https://docs.ourbrandname.com/sitemap.xml which contains url / loc tags referencing subsite sitemaps, like https://docs.ourbrandname.com/productname/sitemap.xml

When I configure it as I would like, with a single crawler pointing at the root level sitemap.xml, the crawl fails with simply

Sitemap setup: Failed

I think this is happening because there are no HTML pages in this sitemap, only pointers to other sitemaps, but it is impossible to tell with this error message. Is there a way to know for sure this is the problem?

When I configure the crawler to use the product-specific sitemap, I am able to successfully crawl most of the content. Some pages return a 302 redirect but this does not seem to be a major issue (those pages are for embedding the documentation in another site, and don't need to be indexed)

Since I can configure a crawler for each product I considered working around the first issue by adding crawlers for each product. There are 17 products in this site.

However, if I configure multiple crawlers to the same site, each crawler generates a unique zd-site-verificaiton value for each crawler, rather than re-using the one for the site that was already defined.

Is there a way to configure this? Will Zendesk add support for referenced sitemap files? I would love to give our users the ability to find content in our product documentation from our help center!

Eli14
Forum|Forum|4 years ago
March 29, 2022

I also think the crawler would benefit from more detailed troubleshooting information. The sitemap I tried to use initially failed. When I viewed the crawler at Guide Admin>Settings>Search settings>Federated Search>Crawlers, I saw the error message:

Sitemap fetch failed:
The crawler has not been able to access or correctly parse the sitemap on <link to sitemap>. Take a look at the content crawler troubleshooting article to solve the issue. <link>

It took us some time to figure out why the sitemap failed. It turns out the dates in the <lastmod> tags were not in proper W3C datetime format.

My comments:

As noted in the post above, the email notification for a failed sitemap configuration simply states Sitemap setup: Failed. It would be useful if the more detailed error message displayed in Guide Admin were included in this email.
The error message in Guide Admin is followed by a link that I expected to take me to a content crawler troubleshooting article. However, the link takes me to the same crawler details page I am already on.
It would be very helpful if the error messages were more specific and clear (line numbers for parsing failures, the response status code for access failures, etc.).

Cedric21
Forum|Forum|3 years ago
April 27, 2022

Hi. I'm also getting an error "The sitemap for (my source) couldn't be processed. Review your sitemap setup to make sure the latest content appears in your help center search results."

The sitemap I'm trying to add is being generated by a third party tool (Document360). It's formatted as such, starting with: <urlset xmlns="http://www.sitemaps.org/schemas/sitemap/0.9">

And then within that <urlset> tag, has individual URLs contained within <url> tags, with the URL itself in a <loc> tag, then <lastmod> and <changefreq> tags.
 
I'm not really an expert on sitemap formatting or these standards so I might just be missing something obvious, is this just not going to work? The error message and the documentation provided hasn't really gotten me anywhere. Do I need to convert this sitemap somehow or are there plans to expand the crawler to cover this format of sitemap?

Dan28
Forum|Forum|3 years ago
May 3, 2022

It appears that the available records we have access to has increased from 3000 to 50000 records in a few screens, but there are some places that are documented that seem to still indicate 3000 might be the norm. Can we get confirmation that from a usage approach that 50000 is supported? We are looking to coordinate with some internal teams that own the external content and might need to take a different path in working with them if we need to account for running into the 3000 limit.

Understanding that this is EAP, these are the areas where we are still seeing the 3000 record number:

On the Configure Federated Search screen we see 3000 in the tooltip (The tag next to records remaining reads as 50000 of 50000)
On the Guide Product limits for your help center article, we are still seeing 3000 listed

Gorka
Author
Forum|Forum|3 years ago
May 10, 2022

@malcolm14 first apologies for the very long wait!

When I configure it as I would like, with a single crawler pointing at the root level sitemap.xml, the crawl fails with simply

Sitemap setup: Failed

I think this is happening because there are no HTML pages in this sitemap, only pointers to other sitemaps, but it is impossible to tell with this error message. Is there a way to know for sure this is the problem?

Will Zendesk add support for referenced sitemap files? I would love to give our users the ability to find content in our product documentation from our help center!

You are correct in your assumption, the crawler does not currently support sitemap indexes so the work around is right now as you describe.

We are aware of this limitation and will prioritise the issue in relation to other missing functionality and bugs. I can not promise this particular issue will be prioritised before the make the crawler generally available, but if not it is not because we will not add support for sitemap indexes, we will just do it after making the feature generally available.

However, if I configure multiple crawlers to the same site, each crawler generates a unique zd-site-verificaiton value for each crawler, rather than re-using the one for the site that was already defined.

Is there a way to configure this?

Not right now. Currently each crawler has it's own verification code. I completely understand that it is more cumbersome to have to add a verification code for each crawler instead of for each site. This decision was made based on a tradeoff between the benefit and the cost to implement. We decided to initially launch with one tag pr. crawler, but we have this issue in our backlog for further improvement of the crawler.

Gorka
Author
Forum|Forum|3 years ago
May 10, 2022

@eli14 apologies for the very long wait!

Thank you for your very detailed feedback. I will lift that straight to a backlog item.

We are working on better troubleshooting documentation and apologies for it taking so long. The link in the product links to this article as a second best option since we don't have the troubleshooting documentation in place yet, but it will be pointed to a better troubleshooting article once it is ready.