Sitemaps and Multilingual Websites

January 6th, 2009

When beginning the architecting of the site I’ve been working on I knew I would need to address two issues (among many others, but for now we’ll just cover these two): 1) How to structure a multilingual website physically, and 2) How to address the sitemap.xml structure for the site as a whole.

First I had to decide how the site should be physically structured. Would a subdomain-per-language be good, e.g, en.mysite.com for English, es.mysite.com for Spanish, and ru.mysite.com for Russian? Or would it be better to use directories for the distinction, e.g. www.mysite.com/en/ for English, etc? If I chose the subdomain route it would be easy to build sitemap.xml files for each domain. But how would I structure the sitemap.xml if using directories?

I chose to use directories for a couple of reasons. One, I knew that google treats subdomains as entirely separate websites. I didn’t wish to do this because semantically these were three translations of the same website, and I felt that should be reflected in their structure. Two, I didn’t want to have multiple datasets when dealing with analytics, either multple log files to analyze or one of the myriad javascript-based analytics packages. Yes, I’m fully aware that there are ways to glom datasets together, or otherwise make analytics packages aware of your structure… this was just pure personal preference.

OK, so now I have my structure in place, how do I build the sitemap.xml? I don’t want one huge monolithic file for the entire site. Even though at current count there are only around 100 html files per translation (not huge by any means, but also not insignificant), I would just personally prefer to keep the translations in their own separate sitemap.xml files. Those of you familiar with sitemaps will have been shouting at your monitors by now “Use a sitemap index, dork!”, and you’d be right. I just wasn’t sure that Google would support this. Google didn’t seem to mention it anywhere in their webmaster tools documentation (though I could have just missed it).

I’m happy to report that Google does in fact support sitemap indexes, and I’m fairly certain that MSN and Yahoo! do as well. So, simply build yourself a sitemap_index.xml (the filename is arbitrary) file that looks like this:

<?xml version="1.0" encoding="UTF-8"?>
     <sitemapindex xmlns="http://www.sitemaps.org/schemas/sitemap/0.9">
     <sitemap>
          <loc>http://www.mysite.com/sitemap_en.xml</loc>
     </sitemap>
     <sitemap>
          <loc>http://www.mysite.com/sitemap_es.xml</loc>
     </sitemap>
     <sitemap>
          <loc>http://www.mysite.com/sitemap_ru.xml</loc>
     </sitemap>
</sitemapindex>

Then build your individual sitemap files as you normally would. You can find the full specifications for sitemaps at sitemaps.org, and a nifty utility to help you automatically build sitemap files at the google-sitemap_gen project. Dont forget to include your new sitemap index file in your robots.txt file! Enjoy.

UPDATED: December 8, 2009 – Corrected my syntax on the xml. D’oh!

3 Responses to “Sitemaps and Multilingual Websites”

  • Lembit says:

    Not easy to find indeed, but Google has shortly covered sitemap indexes in Webmaster Tools Help: http://www.google.com/support/webmasters/bin/answer.py?answer=71453

    P.S. What brought me here was the query “xml sitemap for multilingual site” at Google that returned this page as No 1 result.

  • Thomas says:

    If the sitemap protocol itself doesn’t actually say anything about language, then it would seem that your sitemap structure (using the index with different sitemaps for different languages) is purely for organizational purposes?

    Thanks!

  • @Thomas: For the most part, yes it’s about organization. I use a python script I found some time ago to recurse a given URL and generate a sitemap, which I then tweak by hand to make sure things like blog category links, tag links, or other redundant information is not included. If my sitemap is more granular it’s easier for me to sift though the stuff I don’t want. Also, the individual translations may not have identical content, and may not be updated simultaneously. For instance, the site’s blog and e-commerce are only done in English, so when those parts are updated, I only regenerate a sitemap for the english portion of the site. Make sense?

Leave a Reply