Sitecore Solr Search in Chinese

Recently we are working on a website in English and Chinese for one of our client, and there’s a search function we implemented with Sitecore Solr search provider .

The Issue

When we search for english terms, the results are returned normally. But when we search for Chinese terms, the search won’t work for keywords that contain more that 1 word. For example, for “你好”, the search won’t work, but for either “你” or “好”, the correct result can be returned.

We monitored the query string sent to Solr search engine by Sitecore search service. It’s something like: content : (*你好*) AND _language : (zh-CN) for Chinese, and content : (*Hello*) AND _language : (en) for English.

In our case, the query content : (*你好*) AND _language : (zh-CN) doesn’t work, but a query like content : (*你*) AND _language : (zh-CN) do. The final presentation of this issue on the website is, if you search anything in Chinese that have more that 1 word, the result will be empty.

The Investigation

We have searched through the internet and asked Sitecore Support team for help, and the official solution is to use a special analyzer and tokenizer for Simplified Chinese, like “HMMChineseTokenizerFactory”, and use a stopword list to recongnize stop words. We tried that out, which helps, but only a little. Only very common words in Chinese like “你好(Hello)” are recoglized, in most caese it still return no search result.

From that we were stucked, until I read an article from Jacques Sham, which reveals the different nature of Chinese and English tokenizing:

The major difference between Chinese and English is that structure of writing. The problem of NLP in Chinese is: If you tokenize Chinese characters from the articles, there is no whitespace in between phrases in Chinese so simply split based on whitespace like in English may not work as well as in English.

Jacques Sham, NLP : Tokenizing Chinese Phrases

That sealed the question, and let us awared of that it will take much more effort to solve the problem completely. Therefore we decided to take a workaround.

The Workaround

We do notice that when there’s only 1 Chinese letter the search will work, so by manuplating the search terms passed to the search service and adding a space between the Chinese letters, we can get Solr query strings like content : (*你*) AND content : (*好*) AND _language : (zh-CN), which can return desired results.

The search results are kind of fuzzy, as search results like “你不好” will pop out, but the workaround is definitely the most budget-and-outcome-balanced solution for the specific project we are working on for now.

Final Toughts

From this case, we can conclude that there’s still a long way for both Sitecore and the partners like us to go in order to localize the CMS behemoth, especially in China. Solr search is just the tip of the iceberg, and we have much more to consider, like how to integrate with the gigantic WeChat ecosystem.

Thanks to Chee How Yap and Vladyslav Molibozhenko form Sitecore Support team, you did a great job to help us investigate the issue.

And than you for reading!

SXA Creative Exchange Export Generating Incorrect Links – A Language Related Bug

Recently I’m working on a Sitecore multi-language project with SXA enabled. One day I noticed a weird phenemenon – I have a page called “Entertainment”, but all my exported pages failed to link to this page, cause the links to this page are “mysite.local/tertainment” everywhere. “En” is missing from the link’s URL.

1

Since all links worked except for links linking to this page and it’s sub pages, it’s not related to misspelled link fields definitely. So why is”En” missing from “Entertainment”? I did export the site in English, would it be related to language? From there I started to look into the pipeline of SXA creative exchange Export. The export pipeline is very long, but I managed to locate the processor responsible for handling the links:

2

It is Sitecore.XA.Feature.CreativeExchange.Pipelines.Export.PageProcessing.FixInternalLinks in Sitecore.XA.Feature.CreativeExchange library. There’s a “TrimLanguage” method to remove language embed code from the exported links:

3

Well, so if the link starts with current language code “en”, it will get trimmed… something like “entertainment/joy” is treated the same way as “en/about”… That’s why my links didn’t work. Fortunately we can fix it easily by checking “en/” instead of “en” from the start of the link:

4

Then we can patch it back, overriding the existing pipeline processor, then export again. Then…yes! We can get the correct links for exported site now.

Thans for reading, hope this article helps.

Troubleshooting Server Error Pages of Sitecore Experience Accelerator (SXA)

The Sitecore Experience Accelerator, well konw as SXA, comes with a handy feature of setting up 404 and 500 pages:

1

I had decided to take advantage of this feature in a recent project. The 404 page works perfectly, yet the 500 page doesn’t. The static page was generated successfully, however the site never redirects to the static 500 page.

First step to troubleshoot the issue, I need to locate the pipeline processor which process the http request when the error occurs, by checking the sitecore admin show config tool:(/sitecore/admin/showconfig.aspx)

2

At this point of time, I can dig into the backend code of SXA for more clues. Now we are looking at the process method of the processor. Since SXA is a multisite solution, there can be many static 500 pages generated. Once a server error occurs, the processor need to determine which site the static page belongs to, then redirect the request to the correct static page. It’s achieved by using SiteInfoResolver in Sitecore.XA.Foundation.Multisite library:

3

As soon as I navigate to the SiteInfoResolver, I immediately found the smoking gun: It compares the current host name with the site host name field on SXA’s site definition item, in which I use a wildcard host name to include all sub-domains! Therefore it can’t determine whic site the static page belongs to as it’s just a string compare.

4.png

5

This should be a design flaw of SXA, as the host name can either accept wildcard or normal host names, however the site info resolver can’t. I do can patch the SiteInfoResolver to fix this, but it’s a foundation project and many SXA features rely on it. So this time, I just change my host name setting to use normal host names like dev.example.local. After that, the 500 page works perfectly. Hope Sitecore will fix this in further releases of SXA.

Thanks for reading.