
Recently we are working on a website in English and Chinese for one of our client, and there’s a search function we implemented with Sitecore Solr search provider .
The Issue
When we search for english terms, the results are returned normally. But when we search for Chinese terms, the search won’t work for keywords that contain more that 1 word. For example, for “你好”, the search won’t work, but for either “你” or “好”, the correct result can be returned.
We monitored the query string sent to Solr search engine by Sitecore search service. It’s something like: content : (*你好*) AND _language : (zh-CN) for Chinese, and content : (*Hello*) AND _language : (en) for English.
In our case, the query content : (*你好*) AND _language : (zh-CN) doesn’t work, but a query like content : (*你*) AND _language : (zh-CN) do. The final presentation of this issue on the website is, if you search anything in Chinese that have more that 1 word, the result will be empty.
The Investigation
We have searched through the internet and asked Sitecore Support team for help, and the official solution is to use a special analyzer and tokenizer for Simplified Chinese, like “HMMChineseTokenizerFactory”, and use a stopword list to recongnize stop words. We tried that out, which helps, but only a little. Only very common words in Chinese like “你好(Hello)” are recoglized, in most caese it still return no search result.
From that we were stucked, until I read an article from Jacques Sham, which reveals the different nature of Chinese and English tokenizing:
The major difference between Chinese and English is that structure of writing. The problem of NLP in Chinese is: If you tokenize Chinese characters from the articles, there is no whitespace in between phrases in Chinese so simply split based on whitespace like in English may not work as well as in English.
Jacques Sham, NLP : Tokenizing Chinese Phrases
That sealed the question, and let us awared of that it will take much more effort to solve the problem completely. Therefore we decided to take a workaround.
The Workaround
We do notice that when there’s only 1 Chinese letter the search will work, so by manuplating the search terms passed to the search service and adding a space between the Chinese letters, we can get Solr query strings like content : (*你*) AND content : (*好*) AND _language : (zh-CN), which can return desired results.
The search results are kind of fuzzy, as search results like “你不好” will pop out, but the workaround is definitely the most budget-and-outcome-balanced solution for the specific project we are working on for now.
Final Toughts
From this case, we can conclude that there’s still a long way for both Sitecore and the partners like us to go in order to localize the CMS behemoth, especially in China. Solr search is just the tip of the iceberg, and we have much more to consider, like how to integrate with the gigantic WeChat ecosystem.
Thanks to Chee How Yap and Vladyslav Molibozhenko form Sitecore Support team, you did a great job to help us investigate the issue.
And than you for reading!