Scenario in which we would like to use the Help Center Search Crawler incl. feedback regarding limitations

Forum|Forum|4 years ago
January 13, 2022
4 replies
0 views

Josef11

Sounds great - we can't wait to get the EAP started and provide feedback!

We are a software vendor with around 100 different products.
For each product we host a seperate user manual as WebHelp in one to n languages - created by Help&Manual as well as madcap Flare
Each manual has between 50 to 3500 pages/articels in different chapters.
We are willing to provide public links for testing purposes etc.

Our internal development team implemented a crawler on our own and we run into several API limitations:

Many records have a body, which exceeds 10.000 chars
as listed above we have more than 5 sources
we clearly need more than 3000 records

Will it be possible:

to crawl website, which are restricted by an authentication mechanism like user/password? as described here not in phase 1
will it be possible to exclude certain pages? We have several articles in our product web helps with redundant information to Zendesk Help Center articles and they should also mainly live inside the Help Center.
It would be great to have an exclude list.

How will you handle updates like:

indexed content gets updated
indexed/crawled site gets deleted
new site needs be indexed

Will this always be a full update or just a diff?
Is it possible to automate the crawler within Zendesk Guide Admin interface or via API?
Has been explained in the following article.

Best Regards!

G

Gorka
Forum|Forum|4 years ago
January 18, 2022

@josef11 Thank you for your thorough feedback and questions!

Would it be possible to crawl website, which are restricted by an authentication mechanism like user/password? as described here not in phase 1

It is not on the immediate roadmap for the crawler. I suggest you instead build a middleware that can authenticate to an API for the service that hosts your content and then ingest is using Zendesk's Federated search - External Content REST API. The crawler is intended to be a no-code alternative, but it comes with some tradeoffs where this is one -. at least for now.

will it be possible to exclude certain pages? We have several articles in our product web helps with redundant information to Zendesk Help Center articles and they should also mainly live inside the Help Center.
It would be great to have an exclude list.

We have plans to make it possible to exclude pages and paths through the crawler setup UI, but right now the workaround I can suggest is to generate a custom sitemap for each crawler you want to set up with only the pages you want it to crawl. This is described in more detail in here under step 1.

You left a comment to your last questions:

How will you handle updates like:

indexed content gets updated

indexed/crawled site gets deleted

new site needs be indexed

Will this always be a full update or just a diff?
Is it possible to automate the crawler within Zendesk Guide Admin interface or via API?

Where they all sufficiently answered by the article or is there anything you would like me to answer in more detail?

Like

Josef11
Author
Forum|Forum|4 years ago
January 19, 2022

@gorka:
Thank you - everything has been answered fine.
Our priority concern, which is open, is the API limits - is there a chance that these will be adjusted? Can we discuss this in the course of the EAP, because we assume that these limits are not only relevant for our self-implemented crawler, but also for the upcoming native Zendesk implementation.

Like

G

Gorka
Forum|Forum|4 years ago
January 19, 2022

@josef11

Our priority concern, which is open, is the API limits - is there a chance that these will be adjusted?

We are open to adjusting them and are currently gathering feedback as to what is needed. What is your need in terms of?

Character length of record(page) body
# of sources
# of types
# of records(pages)

I also have a few other questions:

Regarding the length of the body, do you think it would work for you if we index the first 10.000 Charaters of the page instead of not indexing the page at all?
The issue with very long pages is that it heavily affects search latency and can cause to failed queries.
In the data model you have in mind for your use case, would all the sources denominate content that is hosted on different domains or would you have content from different parts of the same domain with different source denominations?
Would you need to have different content from the same domains, but in the same language surfaced in different help centers (if you use multiple HC's within the same Guide account)?

Thanks!

Like

Josef11
Author
Forum|Forum|4 years ago
January 27, 2022

Sorry for the delay - it took a little longer to get the correct data from our various Technical Writer teams:

External record size / Character length of recorded (page) body:
We did an analysis and here is the result:
Some percent of our articles extent 10.000 characters, but only a small amount exceeds 20.000 characters.
We can identify the articles exceeding 20.000 characters and would publish a new company policy that these articles would need to be updated, because such long articles make no sense.
External sources
We have currently 254 user manuals, if we count each language per user manual separate.
We expect to have a little bit more than 300 user manuals by end of the year.
We count each user manual language separately, because each of them has a separate URL, e.g. WebOffice:
Web
- WebOffice User Manual - English
  https://resources.weboffice.vertigis.com/Documentation/WebOffice/EN/index.html?intro.htm
- WebOffice Benutzerhandbuch - German
  https://resources.weboffice.vertigis.com/Documentation/WebOffice/DE/index.html?intro.htm
As you can see the only difference between these URL is the language code, but each of them has a separate Table of content (= sitexml).
Some of our user manuals are available in up to 4 languages (English, German, Italian and French), most of them only in 2 language (English & German) and a view of them just in English.

Some more examples:

GeoOffice analyst Benutzerhandbuch - German:
https://resources.geooffice.vertigis.com/documentation/de/analyst/index.html
GEONIS User Manual - English:
https://resources.geocom.ch/documentation/public/overview/2021/en/index.html?ov_geonis_dokumentation.htm
GEONIS Benutzerhandbuch - German:
https://resources.geocom.ch/documentation/public/overview/2021/de/index.html?ov_geonis_dokumentation.htm
GEONIS Manuel d'utilisation - Français:
https://resources.geocom.ch/documentation/public/overview/2021/fr/index.html?ov_geonis_dokumentation.htm
GEONIS Manuale utente - Italiano:
https://resources.geocom.ch/documentation/public/overview/2021/it/index.html?ov_geonis_dokumentation.htm
External types
As we see the types, we are fine with the current limit of 5, because currently we have our user manuals available in 4 languages and we maybe support Spanish as additional language in the future.
External records (pages):
Our research showed that we currently have 22.460 articles (= pages) and we estimate to have up to 45.000 in total by 2025, because we have to launch some products not only in the DACH region and therefor many user manuals will be available in more languages.
Additionally we currently prepare a set of new additional products and the existing product families will be available and on the market for many years in parallel to the new ones.

As you can see above by our examples currently our user manuals are hosted on different domains, but we plan to adopt and host them on only one documentation hub.
We expect to have different content (= products) from the same domain in several languages.

I hope these figures give you a good idea how we want to use the Federated Search capability.

Best Regards!

Like

Sign up

Login to the community

Scanning file for viruses.

This file cannot be downloaded