How can i generate only simplified chinese text string using the Text String Generation APIs?

Jan 12, 2010 at 6:30 AM

How can i generate only simplified chinese text string using the Text String Generation APIs?

Thanks in advance!

Jan 13, 2010 at 3:10 AM

Text String Gen is coded following Unicode 5.2 Character Code Charts (http://www.unicode.org/charts/). It supports chart by name (enum type) and Unicode range. Unicode org unfortunately puts unified CJK Ideographs into a same script. User needs to provide the specific range for CHS to generate string that contains simplified Chinese text only. On the other hand, CHS minority scripts such as Yi, Tibetan, Tai Le etc. are defined as separated scripts. The common block for Han is in the range of U4E00 - U9FCB (to be safer U4E00 - U9FAF since some recent addition above U9FAF are not really common). You may use the common block range if that suits your needs. The current version of str gen only supports single range. You may need to call the API muitiple times with different range values and concatenate the string. We are planning to support multiple ranges in the future.

Regard,

Dennis

Jan 14, 2010 at 4:27 AM

Just to add some sample code

StringProperties properties1 = new StringProperties();
properties1.MinNumberOfCodePoints = properties1.MaxNumberOfCodePoints = 20;
properties1.UnicodeRange = new UnicodeRange(0x4E00, 0x9FAF);
string str1 = StringFactory.GenerateRandomString(properties1, 1234);

StringProperties properties2 = new StringProperties();
properties2.MinNumberOfCodePoints = 5;
properties2.MaxNumberOfCodePoints = 10;
properties2.UnicodeRange = new UnicodeRange(UnicodeChart.YiSyllables);
string str2 = StringFactory.GenerateRandomString(properties2, 1234);

Jan 14, 2010 at 4:40 AM

Thank you very much!

 

lwfwind

Jan 14, 2010 at 4:57 AM

Hi dennisd,
     I have 25 languages(ARA, CHS, CHT, CSY, DAN, DEU, ELL, ENG, ESN, FIN, FRA, HEB, HUN, ITA, JPN, KOR, NLD, NOR, PLK, PSE, PTB, PTG, RUS, SVE, TRK(using 3-letter language abbreviations) ), I think it is not convenient to get unicode Range for any language.

    Do you think so?

regards,

lwfwind

 

Jan 14, 2010 at 6:49 AM

Good question. Yes, I agree with you. Characters belong to a specific language are not always put into a contiguous range especially when new characters and scripts are added to the Unicode Standard. I'll have Laurentiu, who has contributed to Unicode Standard, to elaborate more on this. . I actually thought about the same thing in your mind when I was working on this :). If you take a look at Group.Ids in UnicodeRangeDatabase.cs,  I'd attempted to identify each script to the specific language. I used smiliar string as LCID instead (http://msdn.microsoft.com/en-us/library/0h88fahh(VS.85).aspx). This part has not been reviewed yet. If I could get it approved in the future release, I might be able to generate strings according to language id(s).

Thanks,

Dennis

Jan 14, 2010 at 7:05 AM

Great, I'm looking forward to it., and I believe the api will be useful for our testing.Thank you for your great work!

regards,

lwfwind