| Home | News | Company | Clients & Partners | Technology & Support | Products & Services | Awards & Patents | Contact Us | Recruitment |
Fanjian Translation
 

KanHan Technologies Limited

Introduction

Most Chinese literates, including native Chinese speakers, share a common misbelief that the two systems of Chinese characters, Simplified Chinese ("SC") and Traditional Chinese ("TC") correspond directly with each other, and the conversion between them only requires simple code to code mapping. It actually takes much more than that, as we will see below.

Background

There was only one Chinese script, the Traditional Chinese, before the establishment of the People's Republic of China in 1949. A new set of Chinese script was established by the Communist government in the 1950s, named Simplified Chinese, and became the official script in the PRC ever since. Among the SC characters, some are existing characters; some are simplified forms of the traditional characters that were commonly believed to be too difficult to memorize. The latter sums 2,244 characters according to the latest edition of Comprehensive List of Simplified Characters published in 1986.

Hong Kong, Macau, Taiwan and most overseas Chinese communities, however, still adopt Traditional Chinese while the PRC and Singapore use Simplified Chinese.

The Complexities of Conversion

1. Many simplified characters are no longer recognizable from their traditional forms, e.g. SC - TC

2. In numerous cases, one Simplified character corresponds to two or more Traditional forms, e.g. SC - TC and . Sometimes only one of these is the correct one; sometimes any of these may be correct, depending on the context:

SC Source TC Target Meaning TC Example
fa1 emit start off
fa4 hair hair
gan1 dry dry
gan4 trunk able, strong
gan1 intervene interfere with
gan4 tree trunk central figure
mian4 noodles noodle soup
mian4 face mask
hou4 after day after tomorrow
hou4 queen queen

3. Encoding: SC is encoded GB2312-80, GBK; TC is encoded in Big5. The two standards are not compatible, resulting in numerous characters missing on both sides:

Chinese Character GB Code Big5 Code
BFB4 ACDD
  BA7E
BBA5 A4AC
C170  
BAF4  
  B36E
BCFE A5F3

4. Vocabulary: SC mainly but not necessarily follows the usage of vocabulary in Mainland China, whereas TC follows Taiwan and Hong Kong.

Another common misconception is that one can switch a Chinese Website from Traditional Chinese to Simplified Chinese, or vice versa, simply by choosing the encoding in browsers. Inevitably it will only result screenfuls of indecipherable symbols displayed.

Our Approach

In the paper "The Pitfalls and Complexities of Chinese to Chinese Conversion", Jack Halpern and Jouni Kerman of the CJK Dictionary Institute offer their tremendous insight to Chinese-to-Chinese conversion philosophy. We are grateful to their work, and humbly believe we share the same understanding with them and try to evaluate our software with their work.

In the context of Halpern and Kerman's work, our research found out that Code and Orthograhic conversion rules are critical to the accuracy level of up to 99% because of the grammatical proximity of TC and SC. The accuracy can be further boosted to exceed 99.9% with the incorporation of carefully selected Lexemic and Contextual rules. However, the extra accuracy comes at a huge cost of significantly dragging down the performance especially when sophisticated Contextual rules are in place.

In order to strike a good balance between performance and accuracy, we opted to forsake the Contextual and Lexemic approach. This development strategy has rendered our translation engine one most sophisticated, high performance real-time conversion platform between the TC and SC Chinese available in the commercial market place. On a Pentium III/800 CPU, Our translation engine consistently delivers performance exceeding 10,000 characters per second conversion rate while maintaining average accuracy level more than 95%.

Our Implementation

We implemented our translation engine in both C and Java, which are currently in production use on platforms ranging from low-end Linux and Windows workstations to high-end Solaris and AIX servers. A COM version is also available for Windows-based applications and even an experimental PHP extension version was implemented at one point. The Orthograhic conversion rules governing the outcome of multiple character mapping and vocabulary substition can be updated in the rule database files, which allow the engine to adapt to different context without the need of recompilation.