A Pinyin Parser for MoinMoin

When writing Mandarin in Hanyu Pinyin, the tones can be represented either with numbers after each syllable or with a diacritical marker above the last vowel (in most cases). Thus:

I find the pinyin with tone markers far more legible than pinyin with tone numbers. On the other hand, typing the tone markers is a real pain, at least when using a US keyboard!! "Wouldn't it be nice," I thought to myself, "if I could type pinyin with tone numbers and then convert it to pinyin with the tone markers?"

I came across a solution on teh internets with Pinyin Joe's Word macro, which cleverly converts pinyin with tone numbers into pinyin with diacritical marks. The macro performs the hard work of 1) figuring out which vowel the diacritical mark hovers over (silly rules), and 2) mapping the vowel-marker combination to the appropriate Unicode code point. "Wouldn't it be swell," I mused, "if I could type pinyin with tone numbers into my wiki and then have it rendered as pinyin with tone markers?"

As it turns out, I actually know Pinyin Joe from a previous job, and he generously gave me permission to convert his macro into Python. Once I had the conversion in Python form, it was a simple matter to embed the converter into a MoinMoin parser--MoinMoin being the wiki engine underlying WikiPerdido. Now I can simply type:

and have it rendered as


   1 #!/usr/bin/env python
   2 # -*- coding: utf-8 -*-
   3 """
   4 pinyintones.py
   5 
   6 A MoinMoin plugin parser to convert pinyin with tone numbers
   7 to pinyin with diacritical marks.
   8 
   9 Markup examples:
  10 To apply the parser to an entire page:
  11 
  12   #FORMAT pinyintones
  13   xiao3 chou4 chou4
  14 
  15 Or to apply to a display region:
  16   { { {#!pinyintones
  17   xiao3 chou4 chou4
  18   } } }
  19 
  20 Inspired by Pinyin Joe's Word macro (http://pinyinjoe.com)
  21 
  22 2007 Robert Yu http://www.robertyu.com
  23 """
  24 
  25 import codecs
  26 from MoinMoin.parser import text
  27 
  28 #
  29 # definitions
  30 # For the pinyin tone rules (which vowel?), see
  31 # http://www.pinyin.info/rules/where.html
  32 #
  33 # map (final) constanant+tone to tone+constanant
  34 mapConstTone2ToneConst={'n1':'1n',
  35                         'n2':'2n',
  36                         'n3':'3n',
  37                         'n4':'4n',
  38                         'ng1':'1ng',
  39                         'ng2':'2ng',
  40                         'ng3':'3ng',
  41                         'ng4':'4ng',
  42                         'r1':'1r',
  43                         'r2':'2r',
  44                         'r3':'3r',
  45                         'r4':'4r'}
  46 
  47 #
  48 # map vowel+vowel+tone to vowel+tone+vowel
  49 mapVowelVowelTone2VowelToneVowel={'ai1':'a1i',
  50                                   'ai2':'a2i',
  51                                   'ai3':'a3i',
  52                                   'ai4':'a4i',
  53                                   'ao1':'a1o',
  54                                   'ao2':'a2o',
  55                                   'ao3':'a3o',
  56                                   'ao4':'a4o',
  57                                   'ei1':'e1i',
  58                                   'ei2':'e2i',
  59                                   'ei3':'e3i',
  60                                   'ei4':'e4i',
  61                                   'ou1':'o1u',
  62                                   'ou2':'o2u',
  63                                   'ou3':'o3u',
  64                                   'ou4':'o4u'}
  65 
  66 # map vowel-number combination to unicode hex equivalent
  67 mapVowelTone2Unicode={'a1':u'\u0101',
  68                       'a2':u'\u00e1',
  69                       'a3':u'\u01ce',
  70                       'a4':u'\u00e0',
  71                       'e1':u'\u0113',
  72                       'e2':u'\u00e9',
  73                       'e3':u'\u011b',
  74                       'e4':u'\u00e8',
  75                       'i1':u'\u012b',
  76                       'i2':u'\u00ed',
  77                       'i3':u'\u01d0',
  78                       'i4':u'\u00ec',
  79                       'o1':u'\u014d',
  80                       'o2':u'\u00f3',
  81                       'o3':u'\u01d2',
  82                       'o4':u'\u00f2',
  83                       'u1':u'\u016b',
  84                       'u2':u'\u00fa',
  85                       'u3':u'\u01d4',
  86                       'u4':u'\u00f9',
  87                       'v1':u'\u01db',
  88                       'v2':u'\u01d8',
  89                       'v3':u'\u01da',
  90                       'v4':u'\u01dc'}
  91 
  92 def ConvertPinyinToneNumbers(lineIn):
  93     """
  94     Convert pinyin text with tone numbers to pinyin with diacritical marks
  95     over the appropriate vowel.
  96     
  97     In:  input text.  Must be unicode type.
  98     Out:  utf-8 copy of lineIn, tone markers replaced with diacritical marks
  99           over the appropriate vowels
 100     
 101     For example:
 102     xiao3 long2 tang1 bao1 -> xiǎo lóng tāng bāo
 103 
 104     x='xiao3 long2 tang1 bao4'
 105     y=pinyintones.ConvertPinyinToneNumbers(x)
 106     """
 107 
 108     #
 109     # make sure input is unicode
 110     assert type(lineIn)==unicode
 111     lineOut=lineIn
 112 
 113 
 114     #
 115     # first transform
 116     for (x,y) in mapConstTone2ToneConst.items():
 117         lineOut=lineOut.replace(x,y)
 118         
 119     #
 120     # second transform
 121     for (x,y) in mapVowelVowelTone2VowelToneVowel.items():
 122         lineOut=lineOut.replace(x,y)
 123             
 124     #
 125     # third transform
 126     for (x,y) in mapVowelTone2Unicode.items():
 127         lineOut=lineOut.replace(x,y)
 128 
 129     return lineOut
 130 
 131 
 132 class Parser(text.Parser):
 133     """
 134     Override standard moinmoin parser by replacing pinyin tone markers
 135     with diacritical marks, then call standard moinmoin parser
 136     """
 137     def format(self,formatter):
 138         #
 139         # replace tone markers in raw text
 140         self.raw=ConvertPinyinToneNumbers(self.raw)
 141         
 142         #
 143         # call the standard parser
 144         text.Parser.format(self,formatter)
 145 
 146 
 147 if __name__=='__main__':
 148     #
 149     # test:
 150     # write mappings to tonemapper_out.txt
 151     #
 152     fh=open('tonemapper_out.txt','w')
 153     for (x,y) in mapVowelTone2Unicode.items():
 154         print ('%s\n' % x)
 155         fh.write('%s %s\n' % (x.encode('latin-1'),y.encode('utf-8')))
 156     fh.close()
 157     
 158     print 'write tonemapper_out.txt\n'
 159 
 160     # test:
 161     # read and convert pinyin with tone numbers
 162     lineNumber=0
 163     fin=codecs.open('WithToneNumbers.txt',mode='r',encoding='utf-8')
 164     fout=codecs.open('out.txt',mode='w',encoding='utf-8')
 165     for lineIn in fin:
 166         lineNumber=lineNumber+1
 167         lineOut=ConvertPinyinToneNumbers(lineIn)
 168         fout.write('%d: %s' % (lineNumber, lineOut))
 169     
 170     fin.close()
 171     fout.close()


CategoryNerd

Pinyin Parser for MoinMoin (last edited 2009-01-17 05:56:26 by RobertYu)