Our First W3C TPAC & Our Mission to standardize the Multilingual Web

Inspiring conversations with Tim Berners Lee, a nice chat with Mike Smith – W3C member Cocomore was at the 2012 W3C TPAC in Lyon, France. The W3C Combined Technical Plenary / Advisory Committee Meetings Week, or TPAC, brings together W3C Groups, the Advisory Board, the Technical Architecture Group (TAG) and the Advisory Committee for coordinated work and exciting sessions about current W3C topics.

Besides the opportunity to get in touch with developers and managers of global players like Google, Microsoft, Facebook, Mozilla, Sony, Adobe and Panasonic, it was great to see that medium-sized companies like us also are considered vital to the future of the Web. For us it has been great to experience first-hand that companies like Cocomore can influence what the Web will look like in the future through the W3C. The HTML working group was intensely discussing current issues with the HTML5 standards. After the Last Call period 53 attendees and 71 observers started to gather further feedback from the community. It really was fascinating how developers from all over the world were passionately exchanging their positions to reach a common goal – namely, defining HTML5, the future of the Web.

The MultilingualWeb-LT Working Group moved forward at an equal pace. W3C members interested in driving the multilingual Web forward met for two days to finalize the standard for the Last Call phase. As the deadline for the Last Call – November 30th, 3012 – was close, the group discussed final changes to the standard to stabilize it. Developing the Multilingual Web with Drupal

But what is the actual work we’re doing within W3C as a Drupal agency? First, we are developing a new standard called ITS 2.0 – Internationalization Tag Set. “It is designed to foster the creation of multilingual Web content, focusing on HTML, XML based formats in general, and to leverage localization workflows based on the XML Localization Interchange File Format (XLIFF).” By ITS 2.0 the MultilingualWeb-LT initiative is secondly able to provide an open source solution for a standardized workflow for human machine translation.

Our mission within the W3C is to develop a new standard for the Multilingual Web. Our work within the W3C’s MLW-LT working group is important because there currently are three gaps in the chain of content processing and consumption on the web. As all of you who have worked on a multilingual Drupal project know, dealing with multilingual content and translations is a complex endeavor that can cause developers, translators or content creators some nightmares. What if some words like names, countries, or products and brands should not be translated? The usual machine translation services don’t “know” about metadata in the source content, like “what parts of the text should be translated and what not?” The same issues are hidden in the databases from which the translated text has been generated. Up to now there isn’t much description available, which was the basis for generating the frontend web pages. Gap number three is the clear description of metadata. Thus the mission of the MultilingualWeb-LT project is: Develop methods, guidelines and standards to enhance the quality, (re)usability and interoperability of language datasets and processing tools; promote and support open repositories of research results and development/training resources of general interest.(http://www.w3.org/International/wiki/images/6/6d/MultilingualWeb_TA_v5_Feb2010.pdf)

The solution: For the last year we have been developing a module for Drupal called mlw_lt which content editors can use to mark up text passages for being able to see, for example, non-changing parts within a text immediately. The editors can use the module to place specific tags within the text without touching the code to ease the human and machine translation workflow. The view is related to a common wysiwyg editor. Special editing buttons are used to highlight text parts which are important for the machine translation process on the one hand and for the human translator on the other hand. Mark-ups for words which shall not be translated, locales and further text descriptions can be added.

The enriched text now will be sent from the Drupal CMS to a Language Service Provider (LSP) on side of a language agency where the machine translation process does the first translation by noting the standardized metadata. After finishing the first translation draft the text will be sent to a human translator who reviews it for quality assurance. The accomplished translation returns to the Drupal CMS to be used. To ease this return-process we have developed an automatic synchronization which checks every 15 minutes whether the text has been translated or not. In case it has been translated the CMS automatically pulls it back.

This MultilingualWeb-LT project funded by the European Commission is still running until mid 2013 and we’re keen on every feedback from the community concerning our Drupal module mlw_lt.