## History 19.06.2013 - re-training of several languages - changed to data serialization independent of 32/64 bit machines - tokenization and word unification improved - ITA added 29.06.2013 - outsym=SAMPA phoneme mapping added 17.07.2013 - HUN update 18.07.2013 - outsym=MAUS phoneme mapping added 06.08.2013 - POL postprocessing update 06.09.2013 - extended features set and output (POS, morphological) for DEU 09.09.2013 - re-training of all G2P modules after alignment improvement 12.09.2013 - re-training of all G2P modules after alignment improvement 13.09.2013 - now differentiationmg between +/- connected text for POS tagging (ngram-length=3 for connected text, =0 for word lists) 17.09.2013 - re-training of all G2P modules - g2p_wrapper switched from 'oform=bpfs' to 'oform=bpf' 17.10.2013 - improved extended feature extraction for DEU - all modules re-trained 24.10.2013 - NZE added 31.10.2013 - NZE retrained 07.11.2013 - Bayes classifier wordstress model included - AUS added 11.11.2013 - aligned output now combinable with word stress and syllabification - extended table output (+POS, morpholgical analyses) for DEU and ENG - word stress assignment on stem level included 12.11.2013 - AUS re-trained without h-drop 13.11.2013 - output table separator changed to semicolon 06.12.2013 - AUS, NZE re-trained 18.12.2013 - letter to number conversion for all languages 19.12.2013 - AUS re-trained 23.12.2013 - C4.5 classifier for syllabification trained 24.12.2013 - Maori words removed from NZE training lexicon - AUS, NZE, NLD re-trained - all word stress models re-trained 31.01.2014 - word stress modules re-trained 05.03.2014 - extended features set and output (POS, morphological) for ENG - spelling transcription changed for MAUS embedding, since single letters more often refer to interruptions in spontaneous speech transcripts. This means e.g. DEU 'b' is tanscribed as /be:/ if not MAUS-embedded, but as /b/ if embedded. 06.03.2014 - improvements in word stress prediction 13.03.2014 - spelling trancription corrections 08.04.2014 - NZE re-trained - update of ENG morphology module 15.04.2014 - extended TextGrid (tgext) output supported 12.05.2014 - RON added 14.05.2014 - SQI added 16.05.2014 - FRA and SLK added 27.05.2014 - SLK assimilation processes implemented 07.06.2014 - HUN postprocessing update 12.06.2014 - SLK re-trained, postprocessing update 11.07.2014 - next to pronounciation dictionaries G2P mapping tables are now supported 09.08.2014 - ENG POS tagger re-trained 29.09.2014 - KAT added (mapping table) 07.10.2014 - EKK added (mapping table) 16.10.2014 - ENG-US added - Arpabet-Sampa/Maus mapping 31.10.2014 - EKK corrections 14.11.2014 - EKK 3 quantity levels modeling 07.12.2014 - FIN added (mapping table) 23.02.2015 - extended text normalization for DEU and ENG - HUN postprocessing improved 11.05.2015 - RUS added 12.05.2015 - ITA update 14.05.2015 - ITA update 15.05.2015 - text normalization update 18.05.2015 - text normalization update 19.05.2015 - RUS update 20.05.2015 - POS taggers for DEU and ENG re-trained - DEU, ENG re-trained 21.05.2015 - text normalization update 23.05.2015 - HAT added (mapping table) 08.06.2015 - text normalization update 11.06.2015 - text normalization update 15.06.2015 - text normalization update 16.06.2015 - RON update - GSW added (mapping table) 03.07.2015 - RUS update 23.07.2015 - text normalization update 24.08.2015 - GSW update 30.09.2015 - RUS update 05.10.2015 - FRA input lexicon corrected and models retrained 17.11.2015 - RUS update 10.12.2015 - GSW update 14.12.2015 - GSW table update 27.01.2016 - HAT mapping table update 20.02.2016 - hard-coded addlex update 26.02.2016 - GSW mapping table update 27.03.2016 - GSW* variants added as mapping tables - inventory mapping table update - README update 01.03.2016 - GSW mapping table adjustments 22.04.2016 - SPA added as mapping table + hard-coded word stress model 26.05.2016 - short error message 09.06.2016 - POL-MAUS mappings added 13.08.2016 - CAT added as mapping table 16.08.2016 - update of GSW* mapping tables 26.10.2016: v1.50 - retrained POL G2P on larger dictionary - new POL phonemes x_j, g_j, k_j, p_j - words are per default splitted at hyphens and each compound part is processed separately - for nrm=yes, this hyphen-split is set to be obligatory (to correctly pronounce acronyms encoded as U-S-A by the normalizer) - 'embed=maus' now comes with "oform=bpfs", i.e. phonemes are blank separated in the partitur file 24.11.2016: v1.51 - .par input: no ascii-utf8 conversion for annotation symbols in <> brackets, e.g. <"ah> - '-com yes' flag now needs to be set separately from '-embed maus' - hun-HU: overgeneralizations in regressive assimilation fixed 09.12.2016: v1.52 - Maltese mlt-MT added as mapping table 16.12.2016: v1.53 - fra-FR single letter mapped to glottal stop 18.01.2017: v1.54 - mlt-MT mapping table update 20.01.2017: v1.55 - eng-US improved for colloquial speech 30.01.2017: v1.56 - bugfix for option combination com=yes and nrm=yes: comments are now also outputted if nrm=yes - com=yes: comments now detected next to punctuation also if not blank-separated - fix in English textnormalization (context-dependent expansion of 'am') - KAN output: instead of empty transcription for output formats bpf(s) 15.02.2017: v1.57 - eus-ES added as mapping table - README/help function update - hun-HU G2P changed from decision tree classifier to mapping table - hun-HU: double vowels are maintained, consonant geminate fix - spa-ES: mapping table update 12.04.2017: v1.58 - eus-FR added as mapping table 13.04.2017: v1.59 - eus-ES, eus-FR mapping table update 24.04.2017: v1.60 - bugfix: allowing -nrm yes for TCF and TextGrid input - number expansion for TCF Remarks: text normalization does not affect neither -- the original tokenization in TCF input, nor -- the segmentation in TextGrid input (regardless whether it is a word or word sequence segmentation) - bugfix: XML compliant output for TCF stress and/or palatalization markers ' replaced by ' comment brackets <> replaced by < and > - bugfix: option -com yes for TextGrid input and BAS partiture output now yields correct TRN rows also for <>-annotations 12.05.2017: v1.61 - '-embed maus' without '-com yes' does not anymore leave unchanged BAS partiture file specific annotations (e.g. , , ), but treats them as text to be transcribed. All <*> annotations are kept only in combination with '-com yes' - deu: /O6/ is now split like all other /6/ diphtongs for '-embed maus' - nld: wrong /a/ phoneme removed - eng-AU, cat-ES, gsw-CH-ZH, rus-RU: additional phoneme mapping to MAUS inventory for '-embed maus' 30.05.2017: v1.62 - optional option 'limit' added. Default: exp(200). To limit the number of input words 30.06.2017: v1.63 - Romanian training data orthography corrected (ŞşŢţ replaced by ȘșȚț) and G2P re-trained - bugfix: optional par header lines not anymore moved to body in output 16.07.2017: v1.64 - language 'und' (undetermined) added that requires a user-defined mapping file; remark: user-defined mapping cannot be combined with syllabification, since any phoneme inventory can be used, for which sonority is unknown. However, if X-Sampa is used, the pho2syl service can syllabify the bpf output with the lng parameter set to 'und'. - option 'imap' added by which the user can provide his/her own mapping file - option 'lowercase' |no added. 'yes' for all languages but 'und'. For 'und' the user can specify whether it should be distinguished between upper- and lowercase letters in the mapping table. Again the default is 'lowercase=yes', assuming that the mapping table just consists of lower-case letters and the input text should be converted accordingly before the mapping. 18.07.2017: v1.65 - table-based conversion now also supports phoneme-to-phoneme mappings - new German letter ẞ can be processed 19.07.2017: v1.66 - table-based conversion now can also deal with P2P without preceding G2P 03.08.2017: v1.67 - aus-AU added as mapping table - bugfix: SAO line kept in par header - bugfix for bpf input (class 4 tiers) and tg output first interval so far was overwritten by pause if it did not start at time 0. Now its kept together with preceding pause 28.08.2017: v1.68 - jpn-JP added as mapping table - jpn tokenization powered by Atilika Inc http://atilika.com/ 29.08.2017: v1.69 - bugfix in embedding of jpn tokenizer 18.09.2017: v1.70 - rus: mapping table adjusted to SAMPA inventory (/h tS/ removed) 21.09.2017: v1.71 - jpn: all /N NN/ replaced by /N\ NN\/ in mapping table 04.10.2017: v1.72 - bugfix of word-POS alignment error caused by multiple <*>-comments in a row (yielded errors for oform=ext* or featset=extended) 09.10.2017: v1.73 - BAS partiture input: header elements are now identified only by their position between LHD and LBD. That implies: files without headers are processed. Files with erroneous headers that do not contain an LBD element are not processed. 07.11.2017: v1.74 - nor-NO added (cannot distinguish between accent 1 and 2) 17.11.2017: v1.75 - guf-AU and gup-AU added as mapping tables 23.11.2017: v1.76 -nor-NO: non-official sampa symbols retroflex s and E removed from output (mapped to s and e, respectively) -gup-AU, aus-AU: new mapping w;w added 08.12.2017: v1.77 -spa-ES mapping table update (removed phoneme: 4, new phonemes: S, ts) 19.01.2018: v1.78 -spa-ES syllabification improvement (by improved sonority ordering) -spa-ES mapping table update to correctly transcribe words of 'Merengue' type -standard digit-word mapping now also applied for bpf input format 07.02.2018: v1.79 - spa-ES changes in sonority ordering to improve syllabification 12.11.2018: v1.80 - deu-LU added 25.11.2018: v1.81 - deu-LU renamed to ltz-LU, improving ltz-LU, new phoneme d_0 for article d' 10.12.2018: v1.82 - swe-SE added - new option -verb: verbosity level. If set to 0, warnings are not displayed. Default: 1 13.12.2018: v1.83 - swe-SE u0 mapped to u_0 for embed=maus or outsym=maus-sampa 02.01.2019: v1.84 - '#' in input not anymore treated as comment marker but removed without expansion 13.01.2019: v1.85 - modified embedding of the external Japanese tokenizer kuromoji-0.7.7 to work with Java 10 20.02.2019: v1.86 - sqi-SQ: /e/, /I/, and /o/ replaced by /E/, /i/, and /O/, respectively minor word stress corrections models re-trained after lexicon corrections 22.03.2019: v1.87 - German number expansion. "einhundertnull" bug fixed. - Text normalization: Quotation treatment next to punctuation, e.g. " 'quote'. " fixed (before right quotation mark had been kept as apostrophe). 28.04.2019: v1.88 - Afrikaans afr-ZA added 14.06.2019: v1.89 - Hungarian syllabification improved (max_onset_length=1 heuristics) - bugfix: elisions in user-defined g2p mapping file do not anymore output "1" 08.08.2019: v1.90 - Georgian (kat) mapping table correction: ტ;t_> 06.10.2019: v1.91 - Thai (tha) added using the Python tltk package 14.10.2019: v1.92 - Thai (tha): corrected sampa/ipa mapping; expansion of reduplication tokens 08.12.2019: v1.93 - Thai (tha): added 1 to all tone indices tone output preserved also for embed=maus 13.12.2019: v1.94 - Czech (cze) added as mapping table 14.12.2019: v1.95 - Luxembourgian (ltz): /2/ in output mapped to /2:/ 15.12.2019: v1.96 - missing tha-sampa ipa mappings added 26.12.2019: v1.97 - transitive sonority assignment to sampa-maus mappings - re-ordering of Thai sonority table 27.12.2019: v1.98 - adjustments and removal of SAMPA-MAUS mappings for syllable nucleus consonants 10.01.2019: v1.99 - ltz, fra, ita: clitic articles now treated as separate tokens 23.01.2019: v1.100 - tha: digit to numeral conversion integrated 02.02.2019: v1.101 - tha: keep-comment option enabled 09.02.2019: v1.102 - cze: updated mapping table 12.02.2019: v1.103 - tha: tone 8 support 21.03.2020: v1.104 - tokenizer can now cope with narrow quotation marks, i.e. tokens like "nechjis'o'eri'i" are not anymore split into two parts 28.03.2020: v1.105 - Icelandic isl-IS added 13.04.2020: v1.106 - g2p map: "+" is not anmyore considered as a phoneme separator but can be defined itself as (part of) a phoneme symbol 25.04.2020: v1.107 - ltz: syllabification and word stress model trained 01.05.2020: v1.108 - ltz: syllabification and word stress model update - embed maus does now also allow for stress and syllable boundary assignment 09.05.2020: v1.109 - eng-AU: spelled "o" now mapped to /@U/ and "r" to /6:/. New phonemes added /6: {I/ - ltz: added worstress and syllable markers to g2p_addlex entries 17.05.2020: v1.110 - ltz: sonority table update - word stress: stress probability of syllables with reduced vowel set to 0 (exception: ltz, ron) 18.05.2020: v1.111 - acronym spelling can now be overwritten by user-defined exception dictionary 26.05.2020: v1.112 - tha: call of tha_g2p.py updated 14.07.2020: v1.113 - underscore directly attached to words now treated as punctuation, eng "well" pronunciation fixed. 15.07.2020: v1.114 - eng "no" pronunication fixed. 18.09.2020: v1.115 - text normalization bugfix: elliptic "..." do not anymore erroneously indicate "mixed" word type which was expanded by spelled out letters. 22.09.2020: v1.116 - jpn-JP: geminate mapping update for embed=maus (/pp_j/ becomes /p_jp_j/, etc.) 15.01.2020: v1.117 -rus-RU: letter o now mapped to /o/ 17.02.2020: v1.118 -sqi-SQ: for embed=maus /D\/ is now mapped to /d_j/ 16.04.2021: v1.119 -fas-IR: added as mapping table 17.07.2021: v1.120 -fra-FR: syllabification improved (preventing from stop consonant clusters at syl onsets) 05.10.2021: v1.121 -und (undefined): increased support for mapping of non-letter characters (". , $ ' 1", etc.) Behavior: text is tokenized at whitespaces. Punctuation marks and digits are not anymore treated differently from letters. I.e. if such symbols are defined in the mapping table, they will will be mapped according to these definitions. If they are not defined, they don't show in the phonetic transcription according to the logics of NULL-mapping of undefined strings. Special treatment of strings is still supported. 26.10.2021: v1.122 -arb: Standard Modern Arabic added. Internally it is treated as language "undefined" with a pre-stored imap table. Functionality is therefore limited: only whitespace tokenization, no syllabification, no wordstress. 29.10.2021: v1.123 -und/arb: TextGrid item tokenization (triggered by certain romanization characters) switched off. 30.10.2021: v1.124 -und: TextGrid empty items ignored 03.03.2022: v1.125 -aus: G2P, SYL, and wordstress model retraining after dictionary corrections 11.01.2023: v1.126 -tha: bugfix; now first trying g2p_tha.py call from PuFileIO:ideo_tha_chunk() 23.01.2023: v1.127 - PuNumbersMultiling.dig2str_ml() now returns number, if conversion is not defined for input language. Before, an error was thrown. 30.05.2023: v1.128 - nor: retroflex s for is now transcribed as /S/ instead of /s/ 29.06.2023: v1.129 - g2p based on mapping table can now be used in combination with exception dictionary (all dictionaries were ignored before) 20.06.2024: v1.130 - g2p_addlex.txt updated ## Training data # G2P Language;Source;Size (K) afr-ZA;NCHLT-inlang pronunciation dictionaries;15.5 aus-AU;table;-- cat-ES;table;-- cze-CZ;table;-- deu-DE;Phonolex Core;62.7 eng-GB;Celex;52.2 eng-AU;UNISYN_1_3 (corrected by mq.edu.au);119.4 eng-NZ;UNISYN_1_3;114.6 eng-US;cmudict;123.5 eus-ES;table;-- eus-FR;table;-- fas-IR;table;-- fin-FI;table (Festival);-- fra-FR;espeak;398.3 gsw-CH;table;-- gsw-CH-SG;table;-- gsw-CH-GR;table;-- gsw-CH-ZH;table;-- gsw-CH-BE;table;-- gsw-CH-BS;table;-- guf-AU;table;-- gup-AU;table;-- hat-HT;table;-- hun-HU;-1.56: Kornai corpus;14.5 hun-HU;1.57-: mapping table;-- isl-IS;Pronunciation Dictionary for Icelandic (https://clarin.is/en/resources/prondict/);182.5 ita-IT;Festival;41.0 kat-GE;table;-- ltz-LU;lexikon provided by Peter Gilles;307 mlt-MT;table;-- nld-NL;Celex;118.0 nor-NO;https://github.com/stts-se/wikispeech-lexdata;639.3 pol-PL;CLARIN-PL-STUDIO;63.5 ron-RO;espeak;150.5 rus-RU;table;-- slk-SK;espeak;181.1 spa-ES;table;-- sqi-SQ;espeak;507.6 swe-SE;Leksikalsk database for svensk;36036.0 tha-TH;tltk package;-- # Part of speech Language;Source;Size (K) deu-*;ECIMC1;382.4 eng-*;PENN;1200.9