A Pinyin Parser for MoinMoin
When writing Mandarin in Hanyu Pinyin, the tones can be represented either with numbers after each syllable or with a diacritical marker above the last vowel (in most cases). Thus:
別學懶惰蟲 (no study lazy bug) is written in pinyin with tone numbers as
bie2 xue2 lan3 duo4 chong2 or with diacritical tone markers as
bié xué lǎn duò chóng
I find the pinyin with tone markers far more legible than pinyin with tone numbers. On the other hand, typing the tone markers is a real pain, at least when using a US keyboard!! "Wouldn't it be nice," I thought to myself, "if I could type pinyin with tone numbers and then convert it to pinyin with the tone markers?"
I came across a solution on teh internets with Pinyin Joe's Word macro, which cleverly converts pinyin with tone numbers into pinyin with diacritical marks. The macro performs the hard work of 1) figuring out which vowel the diacritical mark hovers over (silly rules), and 2) mapping the vowel-marker combination to the appropriate Unicode code point. "Wouldn't it be swell," I mused, "if I could type pinyin with tone numbers into my wiki and then have it rendered as pinyin with tone markers?"
As it turns out, I actually know Pinyin Joe from a previous job, and he generously gave me permission to convert his macro into Python. Once I had the conversion in Python form, it was a simple matter to embed the converter into a MoinMoin parser--MoinMoin being the wiki engine underlying WikiPerdido. Now I can simply type:
- "hen3 hao3"
and have it rendered as
"hěn hǎo"
1 #!/usr/bin/env python
2 # -*- coding: utf-8 -*-
3 """
4 pinyintones.py
5
6 A MoinMoin plugin parser to convert pinyin with tone numbers
7 to pinyin with diacritical marks.
8
9 Markup examples:
10 To apply the parser to an entire page:
11
12 #FORMAT pinyintones
13 xiao3 chou4 chou4
14
15 Or to apply to a display region:
16 { { {#!pinyintones
17 xiao3 chou4 chou4
18 } } }
19
20 Inspired by Pinyin Joe's Word macro (http://pinyinjoe.com)
21
22 2007 Robert Yu http://www.robertyu.com
23 """
24
25 import codecs
26 from MoinMoin.parser import wiki
27
28 #
29 # definitions
30 # For the pinyin tone rules (which vowel?), see
31 # http://www.pinyin.info/rules/where.html
32 #
33 # map (final) constanant+tone to tone+constanant
34 mapConstTone2ToneConst={'n1':'1n',
35 'n2':'2n',
36 'n3':'3n',
37 'n4':'4n',
38 'ng1':'1ng',
39 'ng2':'2ng',
40 'ng3':'3ng',
41 'ng4':'4ng',
42 'r1':'1r',
43 'r2':'2r',
44 'r3':'3r',
45 'r4':'4r'}
46
47 #
48 # map vowel+vowel+tone to vowel+tone+vowel
49 mapVowelVowelTone2VowelToneVowel={'ai1':'a1i',
50 'ai2':'a2i',
51 'ai3':'a3i',
52 'ai4':'a4i',
53 'ao1':'a1o',
54 'ao2':'a2o',
55 'ao3':'a3o',
56 'ao4':'a4o',
57 'ei1':'e1i',
58 'ei2':'e2i',
59 'ei3':'e3i',
60 'ei4':'e4i',
61 'ou1':'o1u',
62 'ou2':'o2u',
63 'ou3':'o3u',
64 'ou4':'o4u'}
65
66 # map vowel-number combination to unicode hex equivalent
67 mapVowelTone2Unicode={'a1':u'\u0101',
68 'a2':u'\u00e1',
69 'a3':u'\u01ce',
70 'a4':u'\u00e0',
71 'e1':u'\u0113',
72 'e2':u'\u00e9',
73 'e3':u'\u011b',
74 'e4':u'\u00e8',
75 'i1':u'\u012b',
76 'i2':u'\u00ed',
77 'i3':u'\u01d0',
78 'i4':u'\u00ec',
79 'o1':u'\u014d',
80 'o2':u'\u00f3',
81 'o3':u'\u01d2',
82 'o4':u'\u00f2',
83 'u1':u'\u016b',
84 'u2':u'\u00fa',
85 'u3':u'\u01d4',
86 'u4':u'\u00f9',
87 'v1':u'\u01db',
88 'v2':u'\u01d8',
89 'v3':u'\u01da',
90 'v4':u'\u01dc'}
91
92 def ConvertPinyinToneNumbers(lineIn):
93 """
94 Convert pinyin text with tone numbers to pinyin with diacritical marks
95 over the appropriate vowel.
96
97 In: input text. Must be unicode type.
98 Out: utf-8 copy of lineIn, tone markers replaced with diacritical marks
99 over the appropriate vowels
100
101 For example:
102 xiao3 long2 tang1 bao1 -> xiǎo lóng tāng bāo
103
104 x='xiao3 long2 tang1 bao4'
105 y=pinyintones.ConvertPinyinToneNumbers(x)
106 """
107
108 #
109 # make sure input is unicode
110 assert type(lineIn)==unicode
111 lineOut=lineIn
112
113
114 #
115 # first transform
116 for (x,y) in mapConstTone2ToneConst.items():
117 lineOut=lineOut.replace(x,y)
118
119 #
120 # second transform
121 for (x,y) in mapVowelVowelTone2VowelToneVowel.items():
122 lineOut=lineOut.replace(x,y)
123
124 #
125 # third transform
126 for (x,y) in mapVowelTone2Unicode.items():
127 lineOut=lineOut.replace(x,y)
128
129 return lineOut
130
131
132 class Parser(wiki.Parser):
133 """
134 Override standard moinmoin parser by replacing pinyin tone markers
135 with diacritical marks, then call standard moinmoin parser
136 """
137 def format(self,formatter):
138 #
139 # replace tone markers in raw text
140 self.raw=ConvertPinyinToneNumbers(self.raw)
141
142 #
143 # call the standard parser
144 wiki.Parser.format(self,formatter)
145
146
147 if __name__=='__main__':
148 #
149 # test:
150 # write mappings to tonemapper_out.txt
151 #
152 fh=open('tonemapper_out.txt','w')
153 for (x,y) in mapVowelTone2Unicode.items():
154 print ('%s\n' % x)
155 fh.write('%s %s\n' % (x.encode('latin-1'),y.encode('utf-8')))
156 fh.close()
157
158 print 'write tonemapper_out.txt\n'
159
160 # test:
161 # read and convert pinyin with tone numbers
162 lineNumber=0
163 fin=codecs.open('WithToneNumbers.txt',mode='r',encoding='utf-8')
164 fout=codecs.open('out.txt',mode='w',encoding='utf-8')
165 for lineIn in fin:
166 lineNumber=lineNumber+1
167 lineOut=ConvertPinyinToneNumbers(lineIn)
168 fout.write('%d: %s' % (lineNumber, lineOut))
169
170 fin.close()
171 fout.close()
