Sitecore Solr Search in Chinese

Recently we are working on a website in English and Chinese for one of our client, and there’s a search function we implemented with Sitecore Solr search provider .

The Issue

When we search for english terms, the results are returned normally. But when we search for Chinese terms, the search won’t work for keywords that contain more that 1 word. For example, for “你好”, the search won’t work, but for either “你” or “好”, the correct result can be returned.

We monitored the query string sent to Solr search engine by Sitecore search service. It’s something like: content : (*你好*) AND _language : (zh-CN) for Chinese, and content : (*Hello*) AND _language : (en) for English.

In our case, the query content : (*你好*) AND _language : (zh-CN) doesn’t work, but a query like content : (*你*) AND _language : (zh-CN) do. The final presentation of this issue on the website is, if you search anything in Chinese that have more that 1 word, the result will be empty.

The Investigation

We have searched through the internet and asked Sitecore Support team for help, and the official solution is to use a special analyzer and tokenizer for Simplified Chinese, like “HMMChineseTokenizerFactory”, and use a stopword list to recongnize stop words. We tried that out, which helps, but only a little. Only very common words in Chinese like “你好(Hello)” are recoglized, in most caese it still return no search result.

From that we were stucked, until I read an article from Jacques Sham, which reveals the different nature of Chinese and English tokenizing:

The major difference between Chinese and English is that structure of writing. The problem of NLP in Chinese is: If you tokenize Chinese characters from the articles, there is no whitespace in between phrases in Chinese so simply split based on whitespace like in English may not work as well as in English.

Jacques Sham, NLP : Tokenizing Chinese Phrases

That sealed the question, and let us awared of that it will take much more effort to solve the problem completely. Therefore we decided to take a workaround.

The Workaround

We do notice that when there’s only 1 Chinese letter the search will work, so by manuplating the search terms passed to the search service and adding a space between the Chinese letters, we can get Solr query strings like content : (*你*) AND content : (*好*) AND _language : (zh-CN), which can return desired results.

The search results are kind of fuzzy, as search results like “你不好” will pop out, but the workaround is definitely the most budget-and-outcome-balanced solution for the specific project we are working on for now.

Final Toughts

From this case, we can conclude that there’s still a long way for both Sitecore and the partners like us to go in order to localize the CMS behemoth, especially in China. Solr search is just the tip of the iceberg, and we have much more to consider, like how to integrate with the gigantic WeChat ecosystem.

Thanks to Chee How Yap and Vladyslav Molibozhenko form Sitecore Support team, you did a great job to help us investigate the issue.

And than you for reading!

Fit Sitecore 9 databases in Aliyun RDS

Background

aliyun

Aliyun is a major cloud infrastructure provider in China. Like Amazon’s AWS and Google’s GCP, it provides an option of using RDS (Relational Database Service) instead of traditional database virtual machines. This time, it becomes our client’s database of choice, and it’s my challenge to fit Sitecore 9 databases into it.

As you might probable experienced, installing Sitecore with RDS always makes life a little bit harder, as your database user will not have the full power of sysadmin role. It’s not a big deal for legacy versions of Sitecore like 7 or 8, however for 9.0 and later, the SIF (Sitecore Installation Framework) takes advantage of contained database authentication by default. It’s good for fine-grained authority control, yet the side effect is that you must have the contained database authentication enabled on your database, which can only be done with a SQL server administrator.

The challenge

For AWS RDS, you can either overcome the issue by changing database parameters, or have the AWS support engineer do it for you, or even let him temporarely grant you administrator privileges (for this part, read this great article: https://jeroen-de-groot.com/2018/07/19/deploying-sitecore-9-in-aws-rds/). But do they work for Aliyun RDS too? The simple answer is: NO. There is no place to set database parameters in Aliyun RDS console, and their cloud engineers just refuse to do scripting for you, or grant you admin privileges even for a single second.

The workaround

Since I’m not going to deeply custom my SIF code,  after some researches, I finally found a workaround. Here’s it:

  1. First, install a SQL Server Express version on the server you are going to install Sitecore on, then install the sitecore instance with database settings pointed to this Express database. That allows you to complete the installation of Sitecore, we will handle the connection strings and other stuffs later.
  2. Then, install a sitecore XP0 standalone instance, with the exact same isntance name and database name settings as your real instance, on a windows server with SQL Server installed. This step is to create the set of databases that to be migrated to Aliyun RDS. Or you can directly use the databases created in step 1.
  3. You have 2 options now: Use Aliyun’s DTS (Database Transfer Service), or use Aliyun’s OSS (Object Storage Service) Database Migration Service to migrate the databases to Aliyun RDS. DTS is an online service that transfer databases from source to target database instance. So to use this service, your source database need to either have a public IP, or locates in your Aliyun VPC. OSS is pretty the same as Amazon S3, after you uploading your database backup files, you may choose them as the restore options for your RDS instances. I chose the OSS way as I installed the databases for migration on my local PC.
  4. Change the connection strings. Change all SQL Server connection string settings with your sql account and RDS data sources, replacing the contained authentication accounts and local data sources.
  5. Clean up The Xdb.Collection.ShardMapManager database. The Xdb.Collection.ShardMapManager database defines the shards settings for each shard map in the __ShardManagement.ShardsGlobal table. You need to create a script to update the fields to reflect RDS data sources (and database names if you changed them during the migration).clean up db

After these steps, your Sitecore instance should be running well with your Aliyun RDS databases.

Other Thoughts

You may be concerned about that is there any consequnces of changing contained authentication users to SQL users in connection strings. The fact is that the contained database authentication is only used during installation and has no impact on the application after installation. Whether you use SQL users or contained users, the application will work fine.

Thanks for reading.

SXA Creative Exchange Export Generating Incorrect Links – A Language Related Bug

Recently I’m working on a Sitecore multi-language project with SXA enabled. One day I noticed a weird phenemenon – I have a page called “Entertainment”, but all my exported pages failed to link to this page, cause the links to this page are “mysite.local/tertainment” everywhere. “En” is missing from the link’s URL.

1

Since all links worked except for links linking to this page and it’s sub pages, it’s not related to misspelled link fields definitely. So why is”En” missing from “Entertainment”? I did export the site in English, would it be related to language? From there I started to look into the pipeline of SXA creative exchange Export. The export pipeline is very long, but I managed to locate the processor responsible for handling the links:

2

It is Sitecore.XA.Feature.CreativeExchange.Pipelines.Export.PageProcessing.FixInternalLinks in Sitecore.XA.Feature.CreativeExchange library. There’s a “TrimLanguage” method to remove language embed code from the exported links:

3

Well, so if the link starts with current language code “en”, it will get trimmed… something like “entertainment/joy” is treated the same way as “en/about”… That’s why my links didn’t work. Fortunately we can fix it easily by checking “en/” instead of “en” from the start of the link:

4

Then we can patch it back, overriding the existing pipeline processor, then export again. Then…yes! We can get the correct links for exported site now.

Thans for reading, hope this article helps.

Sitecore Administrative Tool – Database Browser

We know that in Sitecore content editor, if we want to remove a version of an item, just go to the “Versions” tab then click the “Remove” button. Sounds easy enough, but recently I ran into an related issue that I had to use the database browser administrative tool to resolve.

6

I was working on some templates and accidentally I added another language version of a template. Templates shouldn’t have versions, and you cannot add a version for a template in the normal way you do with other items, by clicking the “Add” button in “Versions” tab. But guess what, you can add one by switching to another language, and click the “Add a new version” link in the no version warning:

7.png

 

Naturally, I wanted to remove that accidentally added version. However, there’s no versions commands for template items in Sitecore ribbon, obviously they think templates won’t have versions so no need to have remove button as well, yet they left a possibility to add one… so this may happen not only by this way, but in other situations like using a script to create versions:

8

So how should I delete this version, will I eventually end up with using sql commans to operate the database directly? Yes for operating the database, but using Sitecore database browser. This administrative tool can be viewed on via sitecore/admin/dbbrowser.aspx. It looks pretty like the content editor with raw values mode on, but with some very useful bonus features like delete item’s children, and the most important one in my situation – the “delete version” command for all items including template items:

9

So with the database browser I could delete the unwanted version. This can be very useful too when you want to delete a bunch of an item’s children without deleting itself.

Thanks for reading.

Troubleshooting Server Error Pages of Sitecore Experience Accelerator (SXA)

The Sitecore Experience Accelerator, well konw as SXA, comes with a handy feature of setting up 404 and 500 pages:

1

I had decided to take advantage of this feature in a recent project. The 404 page works perfectly, yet the 500 page doesn’t. The static page was generated successfully, however the site never redirects to the static 500 page.

First step to troubleshoot the issue, I need to locate the pipeline processor which process the http request when the error occurs, by checking the sitecore admin show config tool:(/sitecore/admin/showconfig.aspx)

2

At this point of time, I can dig into the backend code of SXA for more clues. Now we are looking at the process method of the processor. Since SXA is a multisite solution, there can be many static 500 pages generated. Once a server error occurs, the processor need to determine which site the static page belongs to, then redirect the request to the correct static page. It’s achieved by using SiteInfoResolver in Sitecore.XA.Foundation.Multisite library:

3

As soon as I navigate to the SiteInfoResolver, I immediately found the smoking gun: It compares the current host name with the site host name field on SXA’s site definition item, in which I use a wildcard host name to include all sub-domains! Therefore it can’t determine whic site the static page belongs to as it’s just a string compare.

4.png

5

This should be a design flaw of SXA, as the host name can either accept wildcard or normal host names, however the site info resolver can’t. I do can patch the SiteInfoResolver to fix this, but it’s a foundation project and many SXA features rely on it. So this time, I just change my host name setting to use normal host names like dev.example.local. After that, the 500 page works perfectly. Hope Sitecore will fix this in further releases of SXA.

Thanks for reading.