Check out the file HISTORY.ITERATIVE for development history of the iterative MAUS technique. Check the development status table at the end of this document for the current development status of individual supported languages. Also check out PARAM./README for details about the relationship between Phonem inventory filters, rule set and HMM corpus. HISTORY 05.03.03 : Re-engineering the MAUS software: maus produces exactly one segmentation from a NIST file See disclaimers in USAGE 06.03.03 : Tested the possibility of tee-words in HTK A 'tee-word' is a word that allows the jump from the virtual first node of the first HMM to the last virtual node of the last model. This would be helpful to insert 'dummy' words into the lattice that occur in the output, but do not consume any frames. In our case this would be the '#' model indicating a word boundary. Modified the '#' model to be a tee-model, HVite spits out a warning about a tee-word but does the alignment alright (at least for the test sentence!). The rec file then contains segments with duration zero indicating the word boundary. If the MINPAUSLEN in maus is set to 1 then, maus even detects small pauses of 1 frame length Verified the results on one test sentence -> ok Used a larger sentence -> ok mau2TextGrid converts MAU tier in TextGrid praat as option build in maus -> ok Version 1.0 maus 07.03.03 : maus integrated in MkVMPar -> ok test on VM1.1 -> Error in DICT : 'word' 'i' had HMM 'i' which does not exist Interesting that this error occurs only if the lattice contains the 'word' 'i'; that means that HVite does not parse the entire dictionary on start-up. Fixed DICT and tested the error turns -> ok Test of V2.1 -> some errors Test of VM15.1 -> handling of other codings and other languages missing Version 1.1 should do that Test of VM15.1 -> ok 17.03.03 : Version 1.3 maus : Silence intervals of smaller than 3 frames are deleted between words because they are not perceivable. If the following phoneme is a plosive, the silence is added to the plosive; if not, the silence is spread equally onto both segments. CLEAN is constraint to its own files, that is, the process will not delete other files than he has created in $TEMP However, it might happen that two instances of maus are working on equally named files. Therefore the semaphore check is still in action Test of all German VM volumes on linux35:/scratch/PARTEST -> ok Lots of bug fixes -> ok 18.03.03 : Version 1.4 maus : New option PARAM allows to select a parameter set (default is $SOURCE/PARAM with the statistical rule set for German -> ok New parameter set PARAM.MAN with phonological rule set -> ok 19.03.03 : Verified that the pause handling works -> ok Set the minimum length of pauses between words to 50 msec. 14.08.03 : /y/ als erlaubtes Symbol in KANINV und GRAPHINV eingefuegt. DICT bildet /y/ auf /y:/ ab. Grund: Neue Konventionen der kanonischen Aussprache erlauben auch /y/; daher kommt es im Lexikon vor. 21.08.03 : 1.6 maus : Added possibility to use WAV signal files; input file must have the extension 'wav' or 'WAV'. This is simply done by sox: converting WAV into SPH (NIST). 22.08.03 : 1.7 maus : Added option 'CANONLY=yes' that causes the script to perform a simple forced alignment to the signal without using the MAUS method. Added options WEIGHT and INSPROB which are essentially the HVite options -s and -p passed through. Since we do not know yet which values might be optimal, we stick to the theory and set them to s: 1.0 and p: 0.0 default. 12.09.03 : 1.8 maus : Optimzed the parameters WEIGHT and INSPROB to 7.0 and 0.0 respectively (see comments in maus for details) 15.09.03 : 1.9 maus : Added option allowresamp=yes If set to 'yes', maus will try to re-sample input signals that are not 16 kHz using polyphase of sox. 09.12.03 : 1.11 maus : Bugfix. In the TextGrid output all interval indices were set to '1'. Strangely enough this error did not show up when loading the TextGrid into praat ... 20.01.04 : 1.0 maus.corpus -> ok 21.01.04 : 1.12 maus : Bug fix in maus. When unknown coding was found in signal file, the CLEAN semaphore was not removed from cache. -> ok Change cache handling: all temporary files written by maus are prefixed by the process id of maus and at the end all files with that id are removed if option CLEAN is set to 1 Semaphore is not necessary any more then -> ok 22.01.04 : 1.13 maus : change help message output. To get a help message simply type in 'maus'. 1.0 maus.corpus : change help message output. To get a help message simply type in 'maus.corpus'. 23.01.04 : 1.14 maus : maus may now also process BPF files that have no KAN tier but have a ORT tier. This works only, if 'create_kan' and 'mk_pron' are installed. 29.01.04 : 1.15 maus : MMF can be defined from command line now and does not need to be in dir PARAM any more 30.01.04 : /a:~/ inserted as allowed symbol in KANINV and GRAPHINV. DICT maps /a:~/ to /a:/. Reason: New conventions for Standard German Pronunciation allow /a:~/; therefore it may show up in BPF files or lexica. 06.04.04 : 1.17 maus, kan2mlf.awk : New options STARTWORD and ENDWORD allow to select only a portion from the input BPF file. 08.04.04 : 1.18 maus : parameter set KANINVENTAR, GRAPHINVENTAR and DICT extended by 'foreign' phonemes that might appear in German when foreign words are uttered. These phonemes are mapped to their nearest German symbols for HMM modelling but passed as is to the segmentation output. Therefore a /T/ in the input will be internal modelled by /s/ but shows up in the output as label /T/ again. 26.08.04 : 1.19 maus : Since the histograms over segment boundary deviations show a rather distinctive shift of 10 msec of the MAUS boundaries into an earlier position (that is: the MAUS boundaries are 10 msecs too early), we introduce a new option MAUSSHIFT (which is default set to 10) that shifts the MAUS boundaries by that given parameter in msecs. (This involves changes in maus.iter, maus.corpus, maus and rec2mau.awk) Also changed rec2mau.awk so that plosives are recognized by their first char only. That way plosives labels like /k_s/ are treated correctly, if there is an preceeding inter-word silence that will be spread. (/k_s/ denotes the silence interval of a /k/ plosive) 12.07.06 : 1.20 adapted word_var-2.0 to current Linux distribution SuSE 9.0 Sources and necessary libraries are in ./word_var To compile a new binary issue: make word_var make install Note that this is still a dynamically linked binary. 19.07.06 : 1.21 adapted word_var to be a statically linked binary. cd ./word_var.src make word_var make install 17.04.07 : 1.22 added subdir 'ipkclib' containing the header ipkclib.h and the library libipkclib.a for compilation of different OS than Linux Added some hints about that in the docu files. 08.06.08 : 1.24 added option INSORTTIER=no If set to 'yes' and option OUTFORMAT is set to 'TextGrid' and input is read from a BPF, maus will try to identify either an ORT tier or - if that fails - a KAN tier (must be there as input!) and write an additional interval section into the TextGrid file containing the word segmentation based on the underlying MAUS segmentation. The tier is called either 'ORT:' or 'KAN:' respectively; it contains non-labeled segments where MAUS labelled a silence interval and a segment either labelled with the orthography or the canonical transcript for the words. If set to 'no' the regular Textgrid output with one interval section is produced. 09.06.08 : 1.25 added option USETRN=no If set to 'yes' maus will search the input BPF for a trn tier that segmentes the utterance within the recording. maus will cut out the segment and run the MAUS segmentation only within these cut out segment. Afterwards the offset and final cut off are re-calculated into the final mau or TextGrid file. 1.26 added option INSKANTEXTGRID=no See INSORTTEXTGRID respectively. If both options are set to 'yes' first the orthographic tier, then the canonical transcript tier are exported to the TextGrid file. If the source BPF file does not contain an ORT tier, two canonical transcript tiers are exported. 12.06.08 : 1.27 : changed behaviour of MINPAUSLEN: If both adjacent segments of a deleted inter-word silence are plosives, the deleted interval is spread equally to both plosives (before that only the word-initial plosive was enlarged). If the word-final segment is a plosive, the deleted interval is added totally to that final plosive (before the interval was spread equally to word-final plosive and word-initial non-phoneme) 28.06.08 : 1.3 maus.corpus : If the option OUTDIR is set to '#APPEND#' the script will insert the resulting mau tier into the source BPFs This requires maus version 2.0 or higher. 02.07.08 : 2.0 : re-engineered the handling of times and sampling rates: The overall behaviour of maus is the same as before with one important exception: the timing information in the temporary or final mau output files are not based on the model sampling rate any more, but are always based on the sampling rate of the input signal file. Therefore scripts who are post-processing the mau output and are using input signals other than the HMM sampling rate must be fixed for this new behaviour. Furthermore the HMM sampling rate is no longer fixed but is read from the HCopy config file PRECONFIGNIST; that way other sampling rates can be used in the HMM and maus will automatically adapt to that (still the user has to take care that the config file PRECONFIGNIST matches the HMMs used!) The new version makes it easier to include mau output directly into source BPF files that are not sampled in the HMM sampling rate, as for instance is done in maus.corpus if you use the option OUTDIR=#APPEND#. 23.07.08 : 2.3 maus : Some bug fixes and the option RULESET added 05.08.08 : 1.5 maus.corpus : changed option BPFDIR from REQ to OPT. If BPFDIR is empty the script will search for the BPF in the same location as the signal file. 13.08.08 : 2.4 maus : fixed some problems with the location of intermediate signal files: now they are all in the TEMP area and will be cleaned up after maus. Before that you might have found '..._trim.nis' or '..._resamp.nis' files in the location of the input signal file after running maus. Fixed a small problem in mau2TextGridORT.awk : if the KAN tier of the input BPF contains secondary lexical stress markers (") then a corrupt TextGrid will be created by maus. Now these markers are simply deleted, until I find out how to use '"' within a praat label 13.08.08 : 1.6 maus.corpus : introduced option CREATETRN=no|yes|force 01.11.08 : 2.5 maus : introduced options INSORTTEXTGRID INSKANTEXTGRID 25.02.09 : 2.6 maus : Bug fix : version 2.5 did not work with other PARAM sets 'dummy.rul' was default set to 'dummy' Inserted a fixed locale LANG = en_US.UTF-8 because scripts called by maus will produce output with floating point number formatted with a comma instead of a dot, if the locale of the environment is for instance de_DE. 02.03.09 : 2.7 maus : An optional word-internal silence interval '' is now allowed in the input. To force MAUS to model a silence interval the symbol '<' should be used. The symbols '#', '&' and '' all model silence intervals that can be of zero length or can be deleted if less than a threshold defined by option MINPAUSLEN. 08.03.09 : 1.7 maus.corpus : fixed minor bug : if option CREATETRN=yes and one single BPF in corpus already had a TRN tier, then the option was deactivated for the rest of the corpus. Now the rest will be checked as before. 11.03.09 : 2.8 maus : the mapping scripts kan2mlf.awk (which maps the canonical input phoneme string to the MAUS internal phoneme set) and rec2mau.awk (which maps the internal phoneme set back to the input phoneme set) are dependend on the sets stored in the PARAM dir). To be conform, the scripts are therefore moved to the PARAM set of files and can individually be adapted to different sets. To summarize: If a new language set PARAM.LANGUAGE is defined, do the following: - copy the standard German set PARAM to PARAM.LANGUAGE - within the new set adapt the following files: KANINVENTAR (the set of phonemes used in the canonical input and MAUS output) kan2mlf.awk (the mapping script from KANIVENTAR to GRAPHINVENTAR) rec2mau.awk (the reverse mapping) - in case that you add/change/delete models in the HMM set you'll need to adapt also the following files: GRAPHINVENTAR (the set of MAUS internal phonemes) HMMINVENTAR (the set of used HMMs) DICT (the mapping from GRAPHINVENTAR to HMMINVENTAR) MMF.mmf (the HTK HMM set that matches HMMINVENTAR) A good example for such a new language set is PARAM.HUNGARIAN. 16.03.09 : 2.9 maus : Audio input ALAW raw 8kHz (extensions al, AL, dea, DEA) allowed now. ALAW samples are converted to PCM/16kHz. 30.03.09 : 2.10 maus : TextGrid files created with MAUS missed the line 'item []:' in the header. Although praat didn't seem to notice, other programs like Emu need this redundant entry for some reason. The new maus version does create this header entry. 21.04.09 : 2.11 maus : praat has a bug that causes boundaries with not exact the same float number for segment end and segment begin to be disfunctional (can't be moved for instance). We changed the TextGrid export so that end and begin are always exact the same number. 03.06.09 : PARAM.HUNGARIAN : added virtual model 'geminate /t/' modelled by /t/ 28.07.09 : Added provisional support for English PARAM.ENGLISH (see README there) Warning: maus will issue warnings if used with this param set and the German rule set (default), but these warnings can be ignored. Re-structured and extended EXAMPLES dir according to supported languages 29.07.09 : 2.12 maus : Re-worked provisional support for Hungarian 05.08.09 : 2.13 maus : added output into Emu format files: OUTFORMAT=emu 07.12.09 : 1.9 maus.corpus : bug-fix when maus.corpus was called with a file list SLIST=.. that contained no path information improved security for temporary files handled by different instances of maus.corpus on the same host removed a 'bug' that caused maus.corpus to write the resulting files into the location of BPFDIR=... instead to the location of the signal files if OUTDIR=... was not set. 22.01.10 : 2.14 maus : bug fix: when called without BPF but with USETRN=yes an error was issued. Now the option USETRN=yes is being ignored 2.14 maus 1.10 maus.corpus : change method to define the rule set; the default link RULESET.rul in PARAM was removed. The default rule set is now the statistical rule set rml-0.95.rul. 01.07.10 : 2.15 maus : call with options INSORTTEXTGRID or INSKANTEXTGRID and without option BPF was handled disgracefully 13.07.10 : 2.16 maus : in rare cases the temporal boundaries in a TextGrid result file may differ slightly between the MAU and the ORT|KAN tiers. This does not bother praat, but other programs that for instance read the TextGrid to build a hierarchy (e.g. Emu). This behaviour was fixed in this version. 27.10.10 : 2.17 maus : moved ITALIAN from BETA to RELEASED 20.11.10 : 2.18 maus : fixed some minor bugs in the export to Emu files forced all outputs to be strictly onsecutive segments Although this ist not necessary according to the BPF, we found that many import routines require this 24.11.10 : 2.19 maus : fixed some problems when creating Emu output within maus.corpus: if option OUTFORMAT=emu is selected, then maus will create 2 files *.hlb and *.phonetic with the same name as the signal file either in the dir of a given file name in option OUT, or. if OUT is not given, in the dir of the signal file. Note that the actual filename in OUT will be discarded. 15.03.11 : 2.20 maus : fixed minor bug: when USETRN=yes and KANSTR!="" an error occured because the script looks for a BPF to read the TRN 26.04.11 : 2.21 maus : maus issues a warning if Emu files are produced containing non-Emu-conform SAM-PA labels such as '{'. 08.07.11 : added simple script 'txt2par' to create BPF input files for the usage in maus.corpus. Simply provide TXT files with the same name as the sound files with one word per line and orthography in the 1st column and transcript (SAM-PA) in the second column. 15.12.11 : 2.22 maus : technical changes that do not change the functionality: all references to developper paths removed. 20.12.11 : 2.23 maus : technical changes that do not change the functionality: when called with CANONLY=yes, the script does not call word_var-2.0 but uses the simple HVite aligner instead; bug fix in par2emu: temporary files were not deleted on error exit; input BPF is filtered for '\r' (DOS files); sox resampling re-formulated so that no automatic dithering takes place (the dithering took place if input signals had a higher sampling rate than 16000Hz; this caused a small amount of white noise be added to the signal which in turn caused maus to produce randomly fluctuating segmentation results.) added functionality check CHECK/maus.check 24.01.12 : 2.24 maus : changed default setting for allowresamp to 'yes' 27.01.12 : 2.25 maus : changed behavior of options PARAM and RULESET: if the option has no path before the file name, the script no only checks in the local directory but also in the SOURCE directory for the given file. This way for instance the language can be changed by simply PARAM=PARAM.HUNGARIAN whereever maus is called. 23.02.12 : 2.26 maus : added option PRINTINV 28.02.12 : 2.27 maus : added an additional error report to stdout and a definite error exit code 1 to the case that the sub-script kan2mlf issues an error (before 2.27 only an error message was printed to stderr, but the main script continued.) 05.03.12 : 2.28 maus : added a better error message for a missing rule file (not distributed!) 08.03.12 : 2.29 maus : added webservice options into the documentation, so that users of the webservice based help function can associate the options. 21.03.12 : 1.13 maus.corpus : option PARAM is searched for in SOURCE, if not found (same behavior as in maus!) 10.04.12 : 2.30 maus : added option value OUTFORMAT=mau-append 12.04.12 : 2.31 maus : added option value OUTFORMAT=EMU 29.05.12 : 2.32 maus German : the symbol /Q/ will be still accepted as a glottal stop, but also /?/ (SAM-PA) and maus will now always produce a /?/ in the output regardless whether the input was /Q/ or /?/. 25.06.12 : 2.33 maus : added Dutch language (German rule set, German HMM) bug fix: caused by a HTK bug the last segment could in rare cases have a negative length. 26.06.12 : 2.34 maus : fixed some inconsistencies in the PARAM dirs: now every language should per default use the best suited rule set, with the exception of PORTUGUESE.BETA which should be used with the option RULESET=regeln9.nrul 05.07.12 : 1.14 maus.corpus : bug fix: if the option OUTDIR was set to a different location than the location of the signals files, result files already present in the location of the signal file were overwritten (because maus writes into that location by default). 06.07.12 : 1.12 maus.iter : multiple bug fixes related to problems when phoneme symbols contain curly brackets '{}'. It still does not work with backslashes in the phoneme symbol name, e.g. 'r\'. We do not have a solution for such phoneme sets yet. The only case where this happened until now was Australian english where we map the (only) /r/ allophone /r\/ to /r/ and do not map it back in the output! 11.10.12 : 2.35 maus : added language Australian English (rule set and HMM trained to a subset (5421 samples) of AUSTALK) 18.10.12 : 2.36 maus, 1.15 maus.corpus : added option LANGUAGE=iso639 that overrides PARAM 25.10.12 : 2.37 maus : added option '--version', improved some help texts 07.11.12 : 2.38 maus : stricter handling of boolean parameters: boolean option such as USETRN=yes can only handle the following values: '0,1,yes,no,true,false' and their capitalized variants (e.g. 'TRUE'). All other values cause an error exit 1. 08.11.12 : 2.39 maus : added special language mode LANGUAGE='sampa' which allows the language independent segmentation of arbitrary inputs coded in SAM-PA. 16.11.12 : 2.40 maus : added CLIPS trained rule sets and HMMs for Italian 19.11.12 : 2.41 maus : bug fix: if the rule set allows the complete deletion of a word, this was only represented correctly in the BPF (mau) output. TextGrid is now supported in that way that the deleted word does not appear in the ORT and KAN tiers any more. emu and EMU output are also supported: the word still appears in the word/cano tier but owns no segment in the phonetic tier. 09.01.13 : 2.42 maus : LaTeX Umlauts in the BPF input tier ORT are transcoded to UTF-8 in the TextGrid output (former coding was ISO8859). The reason for this is that praat cannot handle LaTeX Umlaut encoding as label names gracefully. Emu or mau outputs are not transcoded, that is e.g. an ISO8859 or LaTeX encoded input is passed as such to the Emu output files. The options STARTWORD and ENDWORD do not work properly with Emu output; therefore an error message is issued from this version on, if these options are selected together with OUTFORMAT=emu|EMU. maus checks command line for unknown options and terminates with an ERROR message if it finds one. If OUTFORMAT is set to emu|EMU and input BPF does not contain a SAM entry, maus adds sampling rate of signal file to input BPF. 09.01.13 : 1.0 maus.trn : a simple script to exemplify the combined usage of the maus options START/ENDWORD and USETRN: by providing an input BPF with a chunk segmentation coded in TRN entries (see format definition BPF TRN) this script reperatedly calls maus with partial segmentations within a chunk of the input signal, and concatenates the results into the input BPF. Works only with mau output. 11.01.13 : 1.16 maus.corpus : check command line for unknown options and terminates with an ERROR message if it finds one. 15.01.13 : 2.43 maus : option OUTFORMAT=emu|EMU : if a BPF file named as the input signal file is in the location of the signal file, but not the intended BPF input, maus transformed this file into the Emu result output instead of the intended BPF input + newly created MAU tier. this is a very rare case, but if this happens from this version on, a warning issued and the BPF in the location of the signal file is being overwritten. 25.01.13 : 2.44 maus : bugs in Dutch parameter set : glottal stop in input caused error, misleading warnings about disfunctional rules -> rules removed. 28.01.13 : 2.45 maus : introduced language specific default options. These are stored in an CSH script called 'DEFAULTS' in the parameter dir (e.g. PARAM.ITALIAN). Language specific defaults are used, if no option is given on the command line. If no language specific default is given in the DEFAULTS file the global default defined in the maus script is being used (e.g. WEIGHT = 7.0). 06.02.13 : 2.46 maus : Fixed bug in SAMPA inventar: some geminates were not defined 08.02.13 : 1.2 maus.web : introducing a new wrapper to the package that replaces the locally installed maus script by calling the new CLARIN WebMAUS service instead. By using 'maus.web' instead of 'maus' you can use the maus package without any local installation (and the hassle that comes with that; see INSTALL in this directory). Simply replace the 'maus' calls all scripts by 'maus.web'. maus.web validates on the standard benchmark (see CHECK/...). Use CHECK/maus.check.web to verify that on your computer. 14.03.13 : 2.47 maus : WAV input files with bit resolution other than 16 and more than 1 channel are automatically converted to 16bit, mono. 18.03.13 : all scripts : replaced '$?' by '$status' and 'gawk' by 'awk' to be cpompatible wth differemt Linux installations (e.g. Ubuntu) 18.03.13 : 2.48 maus : introduced option USETRN=force, a pre-segmentation to cut off leading and trailing silence is done with the helper wav2trn; if the helper is not installed a WARNING will be issued and the script proceeds without pre-segmentation. 21.03.13 : 2.49 maus : adapted English parameter set to Australian English set (former: cloned German set). Parameter set SAMPA: re-stricted HMM set source to languages that have trained HMM; set-up complete benchmark for all SAMPA symbols. 22.03.13 : 2.50 maus : introduced chunk segmentation: if the TRN tier of the input BPF contains a chunk segmentation (as defined in the BPF format), maus will recognize this and perform a chunk segmentation using the helper maus.trn. This works only with OUTFORMAT=mau-append, that is the results are overwritten in a MAU tier of the input BPF. 25.03.13 : 2.51 maus : added helper par2TextGrid that is a general tool to transform BPF (MAU,[SAM,ORT,KAN]) into a 1-3 layer TextGrid file. Chunk segmentation mode extended to all output formats. 1.3 maus.trn : extended output formats to mau|TextGrid|emu|EMU to make chunk segmentation mode fully compatible to maus. Some restrictions still apply: - overlapping chunks cannot be processed for TextGrid|emu|EMU output because these formats do not allow segmemnts with negative time. - emu|EMU output requires that the tiers KAN and TRN in the input BPF are matched; other formats tolerated partial TRN (covering only a subset of the KAN tier). 11.04.13 : 2.53 maus : script checks whether the loaded rule file is a dummy file (named 'dummy.rul') which indicates that for the selected language there exits no valid rule set. If the option CANONLY=false, i.e. the script should use a rule set, a WARNING message is issued to prevent un-voluntary usage of a dummy rule set. 03.06.13 : 2.54 maus : added special Hungarian SAM-PA symbols /J-/ and /c/ to SAM-PA parameter set. 03.07.13 : 2.55 maus : - Removed arbitrary inter-word silence model /&/ from MAUS inventars, because it interfers with the SAMPA vowel /&/. - Changed HMM names in ITALIAN of sub-phonemic segments '*cl' and '*rl' to '*_cl' and '*_rl' to be conform with KANINVENTAR. This has no effect on normal operation but simplifies the automatic generation of the language independent set SAMPA. - Reduced experimental SAMPA set ESTONIAN to Wells definition of Estonian SAM-PA + extra diphthongs + extra French/English sounds. - Removed SAMPA symbols from KANINVENTAR that had diacritic nasalized BEFORE lengthening, e.g. /a~:/; henceforthwith only the following order of diacritics will be supported: lengthening (:) -> nasalisation (~) -> -> palatalisation (_j) -> aspriration (_h). E.g. /a:~_h/ would be allowed, but not /a:_h~/ /a~_h:/ etc. - Reworked the SAMPA language set completely: the set should now cover all known SAMPA symbols derived from Wells SAMPA page, German and English wikipedia. The full definition of basic SAMPA symbols in in PARAM.SAMPA/SAMPA.dia; language specific extensions are defined in SAMPA.dia (e.g. diacritics). SUPERHMM.* still defines all trained HMM that MAUS knows about. SAMPA.map maps SAMPA symbols that have no trained HMM to existing HMMs. The script mk_set creates the complete set anew (e.g. after adding HMMs to SUPERHMM) - created a parallel UTF-8 table KANINVENTAR.inv in each language parameter set that describes the used SAMPA symbols of that language in more detail; this table is output if the option PRINTINV=true. - created list of plosives that are handled specifically at word boundaries (see script PARAM./rec2mau.awk). At the moment all these language specific lists are linked to the list in PARAM.SAMPA/PLOSIVES. So, if a plosive is added to the latter, all languages (if they use this plosive), will treat it specially at word boundaries. 08.07.13 : 2.56 maus : added language POLISH (iso639-3: pol): SAMPA set as defined by Wells 1996; cloned HMM from German, Italian and Australian models; no rules. Completed IPA column (3) in KANINVENTAR.inv set descriptions. 09.07.13 : 2.57 maus : added new option OUTIPA (boolean); if set maus will use UTF-8 IPA symbols in all segmental output tiers instead of SAM-PA. 15.07.13 : 2.58 maus : Missing vowel /1/ in SAMPA set; added SAMPA symbols /s`,z`,g_j,p_j,x_j,ts`,dz`/ to Polish SAMPA set. 18.07.13 : 2.59 maus : Added symbols /ddz,ddz_cl,ddz_rl/ to SAMPA set and to Italian set. 2.60 maus : Added lots of symbols to Hungarian set to satisfy different users. 21.08.13 : 2.61 maus : non-human noise model was missing in some languages - fixed. 03.09.13 : 2.62 maus : Added SAM-PA symbols /d_j/, /i~/ and /u~/ to LANGUAGE sampa set. 04.09.13 : 2.63 maus : maus.learn may create insertion rules that double phonetic symbols and that are not handled by word_var gracefully; inserted a warning in maus.learn and removed such nonsense rules from the Italian rule sets. 09.09.13 : 2.64 maus : KAN tier output in TextGrids contains only first SAMPA symbol if used with LANGUAGE=sampa (SAMPA symbols are separated by a blank in KAN tier!) (the same error happened with multiple label entries in the ORT tier) -> fixed. 20.09.13 : 2.65 maus : changed the WEIGHT option for LANGUAGE=eng to 1.0 (the same as LANGUAGE=aus) so that eng and aus deliver exact the same results (this makes sense since PARAM.ENGLISH is identical to PARAM.AUSTRALIAN for now). Adapted maus.web to the new JSON format return of the webservices. 22.10.13 : 2.66 maus 1.6 maus.trn : Bug fix: helper maus.trn did not pass through errors detected by its helper maus; an error in the segmentation of a single chunk was therefore only reported to stdout, but the exit code of maus remained 0, although this resulted in a corrupt output file. 15.11.13 : 2.67 maus : added 6 more diphthongs to AUSTRALIANENGLISH set 28.11.13 : 1.9 maus.web : changed to XML results format 05.12.13 : 1.17 maus.corpus : added option MMF to overwrite usage of default HMM set in $PARAM/MMF.mmf (we need that for maus.iter!) 2.68 maus : added language Newzealand English (LANGUAGE=nze) 19.12.13 : changed default value for option WEIGHT in LANGUAGE=aus|nze|eng from 1.0 to 5.0 because users report excessive application of unlikely rules. 23.12.13 : 2.69 maus : added a new (non-default) rule set rml.AUSTRALIAN.20131223.rul for better consistency with the g2p -lng aus method by U. Reichel which is used in WebMAUS Basic. Basically the canonical pronunciation for the rule learning algorithm is now produced by the g2p method instead of manually encoded. Whether this is a better way, remains to be seen; at the moment the quality of the g2p method is rather poor, probably because of the UNILEX input. The rule training can be repeated almost automatically, in case that the g2p method improves in the future, or we get more transcribed material. I noticed a lot of 'certainty rules', i.e. rules that are always observed for a given context onthe data (which indicates that the phonological encoding deviates from the phonetic encoding, e.g. /V/ is always realized as [6]), and some 'reverse phonological rules', i.e. rules that describe a reversed reduction process, e.g. /Sn/>/S@n/ (which indicates that the g2p output is too phonetic). Nevertheless the usage of this (non-default) rule set might improve results when using maus in conjunction with g2p on Australian English data, for instance in WebMAUS Basic. To apply this optional rule set use the option RULESET=rml.AUSTRALIAN.20131223.rul. 22.01.14 : extended helper tool par2TextGrid so that 'shared phonemes' are possible (= a phonetic segment that is assigned to two words), and to handle arbitrary phonetic tiers, not just MAU tiers. 06.03.14 : maus.trn 1.7 : bug fix bei OUTFORMAT=TextGrid wurde die SAMPLERATE aus dem Signalfile falsch ermittelt, falls nicht im Input-BPF per SAM Eintrag gegeben. 24.03.14 : maus 2.70 : change HVite option FORCEOUT=F to prevent partial results; HVite now exits a 1 and the error message 'no tokens survived to final node' is displayed. 24.04.14 : maus 2.71 : added simple check for KAN tier having at least 3 columns 14.05.14 : maus 2.72 : added the following symbols to SAMPA phoneme set: NN ww Q: I: U: Y: required by Swiss German. 02.06.14 : maus 2.73 : added option NOINITIALFINALSILENCE=no : if set to 'yes', maus will suppress the automatic modelling of initial/final silence intervals 27.06.14 : maus 2.74 : added language Georgian (kat): basic alignment 24.07.14 : maus 2.75 : bug fix: if option ENDWORD was set to 0 for languages eng,aus,nze,por,pol,nld the script erroneously tried to set ENDWORD to 999999. 04.08.14 : maus 2.76 : bug fix: in rare cases the resulting segmentation was not exactly consecutive, which caused praat to mis-treat TextGrid output produced by maus. Now all kinds of output formats should always produce exact matching segmental boundaries. 27.08.14 : maus.corpus 1.18 : added option OUTIPA to pass on to maus; removed default setting of MMF, because it interfers with the setting of LANGUAGE: if MMF is not set as option on the command line the maus script will set it to the correct HMM set depending on the setting of either PARAM or LANGUAGE (LANGUAGE overwrites PARAM!); if MMF is set on command line, maus will use that MMF ignoring LANGUAGE or PARAM setting. 05.09.14 : maus 2.77 : echo of command line call is now restricted to verbose level > 0 (before 2.77 the echo was independent of verbose level) 05.09.14 : maus 2.77 : echo of command line call is now restricted to verbose level > 0 (before 2.77 the echo was independent of verbose level) 05.09.14 : maus.trn 1.18 : echo of command line call is now restricted to verbose level > 0 (before 1.18 the echo was independent of verbose level) 06.10.14 : maus 2.78 : editing in comments; added option value OUTFORMAT=legacyEMU identical to OUTFORMAT=EMU. 06.10.14 : maus 2.79 : extended Hungarian phoneme set by /zz,ZZ,NN,FF,xx,dd_j,xx_j/ 09.10.14 : maus 2.80 : added language AMERICANENGLISH (eng-US) with HMM training and pronunciation rule set training basd on AUSTALK. added rfc5646 language codes to option LANGUAGE (old iso639 pseudo codes including 'sampa' are retained for backward compatibility until further notice). added option INFORMAT=bpf|bpf-sampa to replace iso639 pseudo code 'sampa'. 10.10.14 : maus 2.81 : replaced eng-US rule set by set trained on TIMIT. 15.12.14 : maus 2.82 : trained HMM Estonian on BABEL/PhED (HMMLEARN/ESTONIAN); trained Estonian rule sets on PhED, part SKK0; fixed some errors in conjunction with OUTFORMAT=legacyEMU in maus.web maus.trn and CHECK scripts; currently maus and maus.trn still suppport EMU as well as legacyEMU; maus.web only supports legacyEMU. 23.12.14 : maus 2.83 : bug fix in default rule set for Estonian; it is still not clear whether this bug may not appear again in other input contexts; this might require some more tweaking... 08.01.15 : maus 2.84 : added provisional LANGUAGE=fin-FI; cloned HMM; no rule set; since there exist no defined SAM-PA for Finnish we use the festival SAM-PA set. 02.02.15 : maus 2.85 : added error reporting if sox signal file conversion to 16bit PCM, 1 channel fails. 06.02.15 : maus 2.86 : added option MODUS = 'standard'; if set to 'bigram', maus runs a free phone recognition without BPF input on the signal. For this a phone bigram lattice (option LATBIGRAM) and a compatible mapping table from the symbols used in the bigram the HMM in HMM (option DICTBIGRAM) must be present (defaults are PARAM./DICT.bigram LAT.bigram) OUTFORMAT is restricted to mau and TextGrid (tier MAU only!). Note that WEIGHT influences the impact of the bigram. 09.02.15 : maus 2.88 : deprecated option CANONLY=true; now implemented as MODUS=align; for backwards compatibility reasons CANONLY=true still works (if MODUS is not set), but a warning is being issued. Option MODUS=bigram overrules CANONLY (as before). 23.02.15 : maus 2.90 : Trained Hungarian HMM (45) on the BEA corpus (min 100 instances per class); the remaining 21 models stay cloned models from the German HMM set. (Also replaced former cloned HMM for Hungarian in SUPERHHH set). Trained a rule set for Hungarian on BEA corpus fragment (approx. 16000 annotated words), prune=20, smoothing (3200 rules). Rules reflect probably mainly systematic differences between the phonological coding (G2P output) and the BEA transcription rules, e.g. G2P often predicts 'd_j' but BEA consistently uses 'J-' in segmentation etc. To do: discuss systematic differences and possibly improve G2P for Hunagrian, the run rule set training again on BEA. 25.02.15 : maus 2.91 : Cleaning up silence modelling: since G2P now allows passing of <...> in the transcription, it is possible to insert and directly into the txt input where noice or human noise should be enforced. Since in many languages was modelled as optional HMM (t-model) we harmonize the usage of the silence HMM < as non-optional and # as optional silence model for all languages: Fixed bug in AMERICANENGLISH : the optional inter-word silence model '#' was not modelled by an optional HMM '' (T-model). Now '#' is truely optional. Fixed bug in DUTCH,ESTONIAN,FINNISH,GEORGIAN : only optional silence model was applied even for 'real' silence intervals '', '<' and '>'. Fixed bug in NEWZEALANDENGLISH,POLISH,PORTUGUESE.EUROPE,SAMPA,SPANISH : was optional T-model, now a real silence model. 02.03.15 : maus 2.92 : Technical change: moved rec2mau.awk from PARAM to SOURCE, since it does not need any language specific programming any more. 03.03.15 : maus 2.93 : Removed symbols N and J- from HUNGARIAN phoneme set; fixed buggy phone alignment in BEA corpus; re-training of HMM and pronunciation model HUNGARIAN 09.03.15 : maus 2.94 : Trained Georgian HMM on Corpus provided by Zakharia Pourtskhvanidze 28 phonemes trained, 24 symbols cloned (mostly needed for foreign words) 11.03.15 : maus 2.95 : Changed default rule set for eng-AU to the newer set of Dec 2013, (phonology derived by G2P) since the former rule set of Oct 2012 contained some very strange rules that are probably caused by a faulty pronunciation dictionary for eng-AU we used at that time. Added MINNI service for ita-IT, eng-US, eng-AU 12.03.15 : maus 2.96 : Added MINNI service for hun-HU, ekk-EE 16.03.15 : maus 2.97 : Bug fix eng-* : the rule set used for eng-GB, eng-AU and eng-NZ (= default rule set of eng-AU) caused a severe internal error in the program word_var, when a rule was applied, which context probabilities add up to 1.0 or more. This is a general weakness of the fact that MAUS used only a float mantisse of 7 when calculation log prob or probabilities. The same effect might be observed in other languages as well (very rare though). To fix this problem, the algorithm to learn the rules maus.learn now subtracts a DISCOUNT value of 0.000001 to each conditional probability log(P(...)), to make sure that the sum of probs always is less than 1.0. The default rules set for eng-AU (and other eng-* that point to that) is now trained on the AUSTALK corpus (95 speakers, 59 sentences each), with pruning set to 20 and no smoothing and unlikely rules removed manually. 17.03.15 : maus 2.98 : Added new option OUTSYMBOL=sampa|ipa|manner|place to map phonetic symbols in output (default: SAM-PA) to IPA (UTF-8) or IPA manner (vowel, plosive etc.) or IPA place of articulation (bilabial, dental, etc.). The mapping is derived from tables PARAM.SAMPA/SAMPA.inv and SAMPA.dia from columns 3 (ipa), 7 (manner) and 8 (place). Note that applying OUTSYMBOL!=sampa is causing non-standard output in combination of OUTFORMAT=mau|mau-append (BPF output), since BPF tier MAUS is only defined for SAM-PA. Deprecated option OUTIPA (still functional, but is superceeded by OUTSYMBOL!=sampa). 18.03.15 : maus 2.99 : re-calculated German phontactic bigram model (MODUS=bigram) using DARPA backoff bigram language modelling with default discounting (HTK HLStats). The former bigram model was based on a non-discounted, non-backoff bigram model, which caused a large proportion of bigrams effectively be impossible (prob = 0). The now fixed bigram is produced exactly with the same parameters and methods as in the other languages. Bug fix: in MODUS=bigram numerical SAM-PA symbols in output had a leading 'P' - fixed 23.03.15 : maus 2.100 : added option value OUTFORMAT=par|PAR as aliases to option value OUTFORMAT=mau-append; this is merely done because most users are not familiar with the BPF tier concept. added option value OUTFORMAT=csv : this is equivalent with option value 'mau' (default), but the default output file name gets the extension 'csv' instead of 'mau'; this should ease the use of simple table output of maus in spread sheet software. 10.04.15 : maus 2.101 : extended phoneme set of Finnish by /d/ and /d:/. added converter mausbpf2emuR from MAUS output BPF (OUTFORMAT=par) *.par to emu DB *_annot.json file. added wrapper mausbpfDB2emuRDB to create complete emu DB from MAUS output BPF collection. 22.04.15 : maus 2.102 : bug fix: in some place 'cvs' instead of 'csv' was coded. 24.04.15 : maus 2.103 : in mausbpf2emuDB incompatible level names (to legacyEMU) were used, fixed. 24.04.15 : maus 2.104 : added HMMs for eng-GB; PARAM.ENGLISH (which was a fake to AUSTRALIANENGLISH) is now obsolete; new is PARAM.BRITISHENGLISH 27.04.15 : maus 2.105 : language specific options (defined in PARAM./DEFAULTS) are read, if the option value is 'default' or the empty string. Set global value for WEIGHT to 1.0 (was 7.0). 29.04.15 : maus.trn 1.10 : check for missing KAN/TRN tier, negative times or negative word numbers in TRN tier before starting processing and issue proper error messages 06.05.15 : maus 1.106 : eng-GB : new rule set trained on AIX-MARSEC corpus with prune=10 and nosmooth 13.05.15 : maus.trn 1.11 : added pre-test to check TRN entries for impossible short chunks = chunks that contain more phonemes as are fitting in the speech signal assuming that each phoneme has minimum duration of 20msec. In that case maus.trn throws an error before starting the segmentation. 08.06.15 : maus 2.107 : fixed a very rare bug: if the signal is really bad, the Viterbi aligner may skip an entire word, if the word is composed of just one phoneme. This leads to a gap in the word order of the output BPF which is formally ok, but most tools (including the par2TextGrid) expect a consecutive order of word numbers. Hence the TextGrid output might be wrong in such a case. Fixed by changing par2TextGrid to calculate the number of words from the segments and not from the maximum link number. 23.06.15 : maus 2.108 : added language Swiss German (gsw-CH); HMM partially (40/84) trained on ETH Zuerich corpora (thanks Volker Dellwo); missing phonemes cloned from other languages; no pronunciation model. 26.06.15 : maus 2.109 : added MODUS=bigram (MINNI) to LANGUAGE gsw-CH based on phonetic segmentations in TEVOID etc corpora of ETH Zuerich. 01.07.15 : maus 2.110 : set default RULESET for LANGUAGE gsw-CH to a phonological rule set SwissGerman.nrul that reflects possible effects caused by other Swiss German dialects than Zuerich; since the rule set has no probabilities, all variants have the same probability (experimental) 07.07.15 : maus 2.111 : Bug fix in LANGUAGE=kat-GE : input files *.par with /ts_>/ or /tS_>/ caused an empty result caused by a mis-match in the Georgian PARAM set (DICT) 13.07.15 : maus 2.112 : follow-up bug to version 2.107 : there was a nother bug in one of the AWK helper of par2TextGrid, causing par2TextGrid to stop at a word that has no phoneme assigned -> fixed 22.07.15 : maus 2.113 : extended gsw-CH HMM set by a few additional trainable symbols and a virtual symbol /kx/; updated MODUS=bigram service as well. 01.09.15 : maus 2.114 : Swiss German (Dieth): deleted pronunciation rule *-e-# > *-@-# because this is now the default pronunciation producedby the Dieth variant of G2P. 11.09.15 : maus 2.115 : Bug fix - when using chunk segmentation mode (USETRN=true with more than one TRN tier in input BPF) and option NOINITIALFINALSILENCE=true, the resulting segmentation was corrupt in all OUTFORMATS - fixed. 02.10.15 : maus 2.116 : added language support Russian rus-RU (thanks to Daniil Kocharov and Alexander Belyy) 19.10.15 : maus 2.117 : added language support French fra-FR (thanks to Nina Pörner & Uwe Reichel) 15.12.15 : maus 2.118 : gsw-CH added pronunciation rules -{-r>-E-r and -{:-r>-E:-r (thanks to Hanna Ruch, University of Zurich) 18.12.15 : maus.corpus 1.19 : bug fix USETRN=force was not passed to the maus script and caused an error maus 2.119 : KAN tier may contain optional white spaces in regular languages (before: only required in LANGUAGE=sampa); this allows users to use KAN strings in input BPF that were created with separated phonemic symbols. Test phase only for LANGUAGE=deu. 12.01.16 : maus 2.120 : bug fix in ITALIAN : due to an un-sorted phoneme inventar in GRAPHINVENTAR the rule application was buggy; instead of using a replacement rule such as #,s,a>#,ts,a where a word-initial /s/ is replaced by the affricate /ts/, a /t/ was inserted before /s/. 13.01.16 : maus 2.121 : extended optional white spaces (see 1.119) to all languages. MODUS=bigram support (MINNI) for LANGUAGE=fra-FR. 22.01.16 : maus.trn 1.13 : changed error message for impossible short chunk: now the starting sample of the chunk is reported. 29.01.16 : maus.trn 1.14 : changed minumum estimated duration length per phone HMM to 30msec for impossible short chunk check. This causes fewer HVite error where no result is calculated because the signal does not fit into the pronunciation model (which is an ackward error message!) 18.02.16 : maus.trn 1.15 : bug fix: if optin OUTFORMAT=par|mau-append and option OUT= was set to the input BPF (effectively the same as leaving OUT empty), the input BPF was incomplete and contained only the MAU tier -> fixed 29.02.16 : maus 2.122 : added virtual symbol /{:u/ to SAMPA language set and gsw-CH language set 02.03.16 : maus 2.123 : added virtual symbols /A:/ /Ai/ and clone /6.deu/ to gsw-CH language set 03.03.16 : maus 2.124 : added LANGUAGE option values gsw-CH-BE ... gsw-CH-ZH, all pointing to PARAM.SWISSGERMAN 03.03.16 : maus 2.125 : bug fix: option PRINTINV did not work for LANGUAGE=gsw-CH* 25.04.16 : maus 2.126 : extended par2TextGrid helper for handling syllabic tiers (such as MAS) created by webservice Pho2Syl 28.04.16 : maus 2.127 : introduced new OUTFORMAT=emuR producing a Emu compatible *_annot.json file 29.04.16 : maus 2.128 : bug fix in MODUS=bigram: leading/trailing segments '!ENTER'/'!EXIT' are now correctly labelled as ''. 27.07.16 : maus 2.129 : changed deprecated options sox -s -2 into sox -e signed-integer -b 16 to avoid sox warnings 03.08.16 : maus 2.130 : due to a very nasty bug in the UNIX job control the command 'cut' cannot be used reliable in parallel called scripts (as we do it on the webMAUS server). To avoid these problems all usage of 'cut' is replaced by 'awk' in the maus script and all helper scripts 04.08.16 : maus 2.131 : follow up to 2.130 : set interprter from /bin/csh to /bin/tcsh because we found that in Ubuntu the bug does not appear in the tcsh, only in the csh (?) 31.08.16 : maus 2.132 : helper mausbpfDB2emuRDB creates emuDB in directory named _emuDB instead of ; the ZIP file name remains the same .zip 01.09.16 : maus 2.133 : pol-PL replaced cloned HMM set by (partially) trained HMM to CLARIN-PL-STUDIO corpus (thanks to Danijel Korzinek); added pol-PL MINNI support. added eng-UK MINNI support. 19.09.16 : maus 2.134 : added SAMPA symbol /pS_j/ as clone of /tS_j/ to Russian SAMPA set, added MINNI service for language rus_RU 22.09.16 : maus 2.135 : bug fix : when running with USETRN=force (pre-segmentation enforced) and with input signals that in fact have energy to the very last sample, the script issued a misleading warning from the sox trim operation that had no effect on the (valid) output -> misleading warning removed. 10.10.16 : maus 2.136 : added option RELAXMINDUR=false; when set, maus relaxes the minimum duration per segment to 10msec for short/lax vowels and consonants, and to 20msec for other vowels and diphthongs; note that this modus is operational and often leads to impossible short vowel and glottal segments; however for investigations that target a certain consonant class only, setting this option might prohibit the ceiling effect in the measure duration distribution at 30sec. added option BPFTHRESHOLD=10000; if a BPF input file contains more KAN: lines than this threshold, the script exits with an error message, because it is unlikely that the script will return a reasonable result in a manageable time (caused by the quadratic increase of processing time with length). added option GETBPFTHRESHOLD=FALSE; if set, the script will return a single number BPFTHRESHOLD to stdout. 25.10.16 : maus 2.137 : set BPFTHRESHOLD=3000 after consultations with Nina Pörner. 27.10.16 : maus 2.138 : added BPF=file.csv input; file.csv is a two-column, ';'-separated spreadsheet CSV table with UTF-8 orthography in the 1st and pronunciation encoding in the 2nd column; other extensions than par|PAR|csv|CSV are not accepted any more. 07.11.16 : maus 2.139 : the pre-validation on BPFTHRESHOLD (see 2.136) prevented large BFB input files with chunk segmentation to be processed (since the *total* number of words in KAN was validated). We changed this, so that each chunk is pre-validated invividually: when USETRN=true and at least one chunk in the BPF input file has more than BPFTHRESHOLD words, an ERROR is thrown. 10.11.16 : maus 2.140 : helper mausbpfDB2emuRDB extended to accept *_annot.json instead of *.par as input; this allows to build an emuDB based on already created *_annot.json files. 22.11.16 : maus 2.141 : changed PARAM dirs naming and structure: a language specific parameter dir is now named 'PARAM.' (e.g. PARAM.eng-AU' or 'PARAM.iso639-3' (e.g. 'PARAM.eng'); the latter are usually just copies of a rfc5646 directory (e.g. 'eng' is a copy of 'eng-GB'). LANGUAGE codes 'aus' and 'nze' are not supported any longer; 'sampa' is used for the language independent mode. 01.12.16 : maus 2.142 : language pol-PL: re-build HMMs and statistical model (LAT) for MODUS=bigram processing, because training corpus has been improved. trained rule set from CLARIN-PL Studio corpus: since the corpus was transcribed half-automatically, it is not quite clear whether the learned rules really model processes in Polish or rather systematic differences between the phonological form produced by G2P and the way the corpus has been segmented; however the rule sets look quite reasonable. Setting the default rule set to POLISH.smooth.prune20.rul (531 derived from 136 basic rules with minimum occurance of 20); other available rules sets are: POLISH.smooth.prune5.rul POLISH.nosmooth.prune5.rul POLISH.nosmooth.prune20.rul POLISH.smooth.prune50.rul POLISH.nosmooth.prune50.rul chunker 0.1 : new service 'Chunker' added to the MAUS package (thanks to Nina Poerner); the tool is called by the command 'chunker'; the software, benchmarks and data reside in the subdir 'Chunker'. Added alias 'emuDB' for OUTFORMAT=emuR. 13.12.16 : maus 2.143 : bug fix: maus reported an error when a word was modelled by a single '' in the BPF input, KAN tier. In fact since version 2.90 all languages model '' not as a T-model any more, so a single '' is allowed. What is not allowed is a word modelled by a single '#' or '&' model, since these are skipable (optional) T-models of silence which can only be used within words. 14.12.16 : chunker 0.2 : bug fixes: the word-based recognition did not use the trained bigram but rather a uni-gram model which led to very poor chunk segmentations e.g. in French, the signal was not re-sampled to 16kHz before ASR (leading to slightly decreased ASR rates), the energy feature of the HTK ASR frontend was not normalized (leading to very bad ASR rates on weak signals); added new method based on a factor automaton ('force', experimental). 15.12.16 : maus 2.144 : added LANGUAGE Maltese support; only forced alignment using cloned HMMs; SAM-PA set defined by Ruben van de Vijver. 16.12.16 : maus 2.145 : experimental feature (LANGUAGE=deu-DE only!): unknown tags '<...>' are modeled by non-optional silence; this allows to pass arbitray tags to the ORT/KAN tier output. chunker 0.3 : bug fix: input and output file can be the same. chunker 0.4 : bug fix: error codes were not passed through (always 0) 22.12.16 : maus.trn 1.17 : bug fix: could not process into OUTFORMAT=emuR|emuDB (*_annot.json). removed BPFTHRESHOLD=9999999999 in maus.trn internal MAUS calls. 03.01.17 : maus 3.0 : major upgrade * moved this file and other documentation into sub-dir 'DOCU'. * bug fix fin-FI : the SAMPA symbol /d:/ was defined wrong as IPA /b:/; HMM set (link auf SUPERHMM.mmf) missed HMMs that were used in the DICT mapping (D.use). * added language Catalan cat-ES * new design rules regarding HMM sets and phonemic/phonetic symbols for all languages 1. KANINVENTAR (allowed SAM-PA symbols in KAN input) *must* contain the symbols '' '' '' '<' '>'; if a language requires SAMPA /P/ (labiodental approximant) use the alternate symbol /v\/. 2. GRAPHINVENTAR (symbol set for internal processing) *must* contain all symbols of KANINVENTAR and the symbol '#'; symbols with leading numerals *must* be masked with 'P' (e.g. /P6/, /P2:I/); symbols with trailing '\' *must* be replaced by symbols with trailing '-' (e.g. /r\/ -> /r-/); GRAPHINVENTAR *must not* contain the symbol 'P'. 3. DICT (mapping from symbols to HMM) *must* contain the mappings # # # < < > > or > < (in case we have only one non-optional silence HMM, see point 4) All symbols in 1st column *must* match GRAPHINVENTAR; all symbols in 2nd column *must* match HMMINVENTAR (and therefore MMF.mmf). HMM names (2nd column) can be chosen arbitrarily (e.g. a:.deu-AT) 4. MMF.mmf (and HMMINVENTAR) *must* contain - an optional silence HMM (t-model) named '#' - two non-optional silence HMMs named '<' and '>', or one non-optional silence HMM named '<' The helper script kan2mlf.awk and rec2mau.awk are responsible for the mapping from KANINVENTAR (input) to GRAPHINVENTAR, and for the mapping from HMMINVENTAR to phonetic output. The helper script check_param_sets can be used to check all language sets for accordance to these rules. * add the possibility to use backslash in input symbols (e.g. /h\/). Up to now MAUS did not accept backslash and languages that require the usage of X-SAMPA symbols (e.g. /J\/) were only accepted as /J-/, requiring for instance G2P to map these symbols for MAUS input. THis caused changes to individual languages: eng-US now accepts /h\/ (X-SAMPA) instead of /h-/ (MAUS internal symbol) (both!) hun-HU now accepts /J\/ (X-SAMPA) instead of /J-/ (MAUS internal symbol) (both!) eng-NZ now accepts /r/ and /r\/ as input (modelled by the same acoustic NZE model, though) eng-AU now accepts /r/, /R/ and /r\/ as input (modelled by the same acoustic AE model, though) in analogy all X-SAMPA symbols in the language-independent set with trailing backslash are now recognized by MAUS (and the old form with '-'!). * added chunker benchmark deu-DE short to maus benchmark CHECK/maus.checklist. * added new option INSYMBOL=sampa|ipa that allows IPA symbols instead of SAMPA/X-SAMPA in input files. * deprecated option INFORMAT. * BPFTHRESHOLD is now compared to number of KAN lines and number of word links in single TRN line, if USETRN=true; this allows correct pre-validation of chunks in maus.trn, and the BPFTHRESHOLD=9999999999 in maus.trn internal MAUS calls can be removed. * re-worked and harmonized (across languages) the modelling of silence and noise: Automatic modelling: Maus will automatically insert optional silence models (HMM '#') between words (see option MINPAUSLEN) and output these as 'detached' silence segments '' (with word number -1) if they exceed MINPAUSLEN times 10msec. The same is true for utterance initial/final silence, but these are modelled non-optional (HMMs '<' and '>'), and therefore have a minimun length; to suppress this use NOINITIALFINALSILENCE=true. Manual modelling: Intra-word silence intervals can be modelled by inserting the symbols '' (optional silence) or '<' (enforced silence) in the canonical input string ('#' in the phonological input will be ignored because in some phonological forms it marks a compound boundary! This is not the case for option KANSTR, though!); e.g. /ba:nhof/ will model an optional silence interval between /n/ and /h/; in the MAUS output these models appear as '' segments (or do not appear at all). Intra-word silence intervals are always linked to the word number in which they appear. If an optional '' is the only symbol within a word, it will be modelled by an non-optional silence model (HMM '<') because HTK cannot model words that consist only of a t-model; it will appear as a single segment '' linked to that 'silence word'. It is allowed to model a 'silence word' as // (where '...' is an arbitrary string without blanks, but not one of 'usb' or 'nib') in the KAN input tier; both will model a non-optional silence model and both will produce a '' in the phonetic output that has a word link, and the 'word' appears as a numbered word in the ORT/KAN tiers (see TAGS PASSING below). To summarize: ('#' symbolize word boundaries here, '<' '>' utterance begin/end) KAN input MODEL ORT/KAN OUTPUT MAU OUTPUT ## non-human noise '' segment // with word number ## human noise '' segment // with word number #<...># silence word '<...>' segment // with word number #......# non-human noise '......' segment /....../ with word number #......# human noise '......' segment /....../ with word number #...<...# non-optional sil '...<...' segment /....../ with word number #......# optional sil '......' segment /....../ with word number or deleted # (word boundary) - segment // with word number -1 or deleted < (initial sil) - segment // with word number -1 > (initial sil) - segment // with word number -1 (the last three lines are not possible inputs, but are modelled automatically!) * added tags passing feature Unknown tags '<...>' given as words (not embedded in other symbols!) in the input KAN tier are modeled by non-optional silence; this allows to pass arbitray tags to the ORT/KAN tier output of MAUS, e.g. a speaker ID etc. To pass such tags through G2P from the orthographic input use the g2p.pl option -com yes. 03.01.17 : maus 3.1 : suppressed warnings from helper programs, unless debug level is v > 0 04.01.17 : maus 3.2 : added MINNI service for cat-ES 05.01.17 : maus 3.3 : replace provisional parameter set for spa-ES by new set trained on GLISSANDO News corpus (thanks to Juan Maria Garrido for providing GLIASSANDO and Bernhard Jackl for the MAUS training). added MINNI service for spa-ES 05.01.17 : maus 3.4 : fixed KANINVENTAR.inv for languages spa and cat, added some clones to the spa-ES and cat-ES HMM set that are more suitable chunker 0.5 : - Fixed a bug that led to a seg fault in cases where the KAN tier's first entry is a tag or pause. - Moved this History section to a separate file HISTORY. - Help page now lists all available languages. - Set MAXNUMTHREADS option default to 1 in master.config. 13.01.17 : maus 3.5 : [internal: setting SOURCE variable automatically; this requires the scripts to reside in the installation dir; symbolic links to these scripts work fine; changed the pre-installation script mk_distribution and the makefile in dist:/share/local/sources/ips_utils/makefile] 19.01.17 : maus 3.6, maus.trn 1.18 : if option USETRN was set to true and the input BPF contained a TRN tier with more than 2 lines, the script did not remove temporary files in TEMP (CLEAN=0 instead of default CLEAN=1). maus 3.7 : revised version of mlt-MT phonetic symbols set (now 70) to work with revised version g2p.pl 1.54. Patch of wrong HMM entry in PARAM.rus-RU/DICT.bigram that prohibited the usage of MINNI modus for Russian. Added option value 'bpf' for OUTFORMAT as a synonym for 'par' to be conform to BALLOON services (that use 'bpf'). 26.01.17 : maus 3.8 : changed top level name in emuDB output (*_annot.json) from 'utterance' to 'bundle' to be conform with EMU-SMDB nomenclatura. 27.01.17 : maus 3.9 : bug fix: if OUTFORMAT=emuDB and the input BPF contained blank separated phonetic symbols in the KAN tier ('bpfs' style), then in the output *_annot.json file the label 'cano' contained only the first phonetic symbol of the KAN tier -> fixed. 03.02.17 : maus 3.10 : bug fix in LANGUAGE=eng-AU and MODUS=bigram : MINNI service did not work, caused by a buggy setting in DICT.bigram -> fixed 07.02.17 : maus 3.11 : there were complaints that the last segment delivered by maus ends not exactly at the end of the signal file; although this is not a requirement for all annotation formats that maus produces, we implemented a fix, so that the last segment delivered from MAUS ends always exactly with the signal. 16.02.17 : maus 3.11 : changed level/attribute names in emuDB output: 'word' -> 'ORT', 'cano' -> 'KAN', 'phonetic' -> 'MAU'; the idea is that names that consist of three capital letters are syntactically defined in the BPF standard, see: http://www.bas.uni-muenchen.de/forschung/Bas/BasFormatseng.html#Partitur 24.02.17 : chunker 0.10 (see Chunker/HISTORY) maus 3.12 : added missing SAMPA symbol /N/ for hun-HU 28.02.17 : maus 3.13 : patched missing input symbol /4/ for spa-ES 14.03.17 : maus 3.14 : bug fix in LANGUAGE=mlt-MT the phonemes /r/ /tts/ and /hh/ did not work in input. 10.04.17 : maus.pipe 1.0 : wrapper for 'pipeline' services, WARNING: Alpha. 13.04.17 : maus.trn 1.19 : check and removal of temporary files improved; added missing OUTFORMAT=bpf 13.04.17 : maus.pipe 1.2 : restructured tool; added pipe 'G2P_CHUNKER_MAUS' 17.04.17 : maus.pipe 1.3 : enforce CHUNKER parameter insymbols='sampa' for pipe 'G2P_CHUNKER_MAUS'; added PHO2SYL pipe service maus.pipe 1.4 : extended to all possible pipes; re-worked CMDI option descriptions; improved error reporting about missing options 18.04.17 : maus 3.15 : bug fix: script attempted to remove existing temporary files in $TEMP, if they existed; this failed if the temp file is owned by a different user, even if the temp file has full rights (666); now the script does not remove the temp file but simply tries to overwrite it (which works always for 666 rights). 18.04.17 : maus.pipe 1.5 : several bug fixes; changed temporary file handling 19.04.17 : maus.pipe 1.6 : removed pipe CHUNKPREP_G2P; default test for rate on rate=1 20.04.17 : maus.pipe 1.7 : added pre-check of mandatory parameter OUTFORMAT; the user now gets an error message *before* the pipe starts, ifthe last service does not support OUTFORMAT; that way the user does not have to wait to the end of the pipe to get an error. 21.04.17 : maus 3.16 : added language Basque eus eus-ES eus-FR (all the same model): clone model 21.04.17 : maus.pipe 1.8 : bug fix: if PHO2SYL is asked to produce a TextGrid (which it can do), the maus.pipe script issues an error -> fixed. 26.04.17 : maus.pipe 1.9 : fixed several small bugs concerning error reporting; added pre-check for compatibility of TEXT input file extension to first service in pipeline; synchronized options tier_G2P (tgitem) and tier_CHUNKPREP (tier) into a common option InputTierName; 27.04.17 : maus.pipe 1.10 : deprecated parameter 'rate'; sample rate is now determined from input SIGNAL; since we only allow pipelines that require a SIGNAL, this is a convenient way to get rid of a parameter most users don't understand anyway. 01.05.17 : maus 3.17 : extended helper mausbpfDB2emuDB so that either *.wav|nis|nist|sph can be input instead of only *.wav; this can be relevant in webservices that batch process a mixture of *.wav and NIST/SPHERE signal files. 08.04.17 : maus 3.18 : suppress warning that occurs if the file given in parameter OUT coincidently is the same file as MAUS uses to create a *_annot.json file internally, and then move it to OUT; extended BPF-to-emuR conversion (scripts mausbpfDB2emuDB, mausbpf2emuR) to handle optional syllable tiers KAS/MAS; this is not necessary for conversion of maus output, but maus.pipe utilizes this conversion for pipeline final PHO2SYL service and web API uses this to build the meuDB ZIP after batch processing; option emuRtemplate of script mausbpfDB2emuDB replaced by option emuRDBname= (the old option is reconized for backward comaptibility, though); Config template is now in-document; removed template from package. 11.05.17 : maus 3.19 : changed the way MAUS parses the phonological input (KAN: tier): up to now we assumend that the KAN tier is encoded in SAMPA which implies left-right parsable phoneme sets. We also allowed blank separated KAN strings where each phoneme symbol is already separated by a blank to allow language indenpendent SAMPA; technically we deleted all blanks from the input and then parsed left-right. With the advent of more and more languages that do not have a SAMPA definition (e.g. eng-SC) and on the same time do not allow left-right parsing, we decided to make this somewhat sloppy convention more strict: if the KAN tier contains a 4th column (the glutinated KAN string is ususlly in the third), we assume that the phonological input is blank separated, and we do not try to left right parse this, but simply check, whether each blank separated symbol is part of the language phoneme set. Consequently, some languages in MAUS can only be processed with blank separated input (but this is not a major issue, since G2P outputs blank separated KAN tiers as default anyway). Faulty encodings where the KAN input is basically not blank separated but for some reason there are two 'words' in the KAN tier line (such as Italian 'sedia' /sedj a/ are not supported any longer. 11.05.17 : maus 3.20 : added a primitiv 'used options protocol' for OUTFORMAT=par|mau-append|legacyEmu|emu|EMU|emuR|emuDB : in case of some form of BPF output, a header entry 'MAO:' is added to the BPF header containing a list of named value pairs 'OPTION=value'; in case of a legacy Emu file, the options are added as labels to the top level bundle; in case of a emuDB _annot.json file a list of named value pairs 'OPTION=value' is stored in a level attribute 'MAO' to level 'bundle'. 12.05.17 : maus 3.21 : bug fix the deu-DE phoneme set was lacking the 4 affricates /ts,tS,dZ,pf/ (which are in the German SAMPA definition by Wells); the reason this never caused a problem is that the sloppy way MAUS parsed the KAN input string (before 3.19) simply split those into single phonemes. Now we added those 4 affricates to the deu-DE HMM set. 18.05.17 : maus 3.22 : added some filters to remove LF from the input BPF file to avoid unexpected effects when these are transferred to output formats other than BPF. 31.05.17 : maus 3.23 : bug fix in mausbpf2emuDB : blanks were deleted in label strings before transfering to _annot.json, now white space sequences are condensed to a single blank in the label string; backslashes were not masked in _annot.json files, now there are masked as required. Some languages had incomplete KANINVENTAR.inv lists (phones with trailing backslashes were missing, this did not affect MAUS processing). Extended helper mausbpf2emuR so that flat hierarchy (only phonetic tier as from MINNI output) is processed correctly; extended mausbpfDB2emuDB to handle flat hierarchies and added schema check for DBConfig. Added maus OUTFORMAT=bpf|par|emuDB|emuR for MODUS=bigram (minimal header with SAM entry), so that MINNI can produce emuDB as other services. Languages mlt, eus, spa had some minor problems with optional silence modelling. 01.06.17 : maus.pipe 1.13 : added pipeline MINNI_PHO2SYL 02.06.17 : maus 3.24 : MODUS=bigram : until this version an input BPF was ignored. Now if a BPF is given, the MINNI result is added to the input BPF (possibly replacing an existing MAU tier) as in the other modi. This works only for OUTFORMAT=par|bpf|mau-append. Added option createDBConfigOnly= to helper mausbpfDB2emuDB; if set, the script only writes the DBConfig.json file to and exits. 07.06.17 : maus.pipe 1.14 : G2P option 'com' was not implemented 09.06.17 : maus 3.25 : changed validators for _annot.json and _DBconfig.json in helpers mausbpf2emuR and mausbpfDB2emuDB to more stable servers and set option validate=true as default. bug: temporary dirs $TEMP/$PID__BPFDIR were not cleaned up -> fixed maus.pipe 1.15 : PIPE=MINNI_PHO2SYL : the MAS tier was not converted into the _annot.json file correctly; currently maus will pass on existing tiers in the input BPF and just add/replace the MAU tier; therefore in a PIPE (which requires a TEXT input) where the BPF input alraedy contains ORT/KAN etc. tiers, the result will be a BPF that has partly hierarchical tiers and partly not (MAS from MINNI a,d consequently a MAS without links); this poses a problem when converting to emuDB output; in this version MAU and MAS withozrt links will be converted but any other tiers in input BPF will not. 22.06.17 : maus.pipe 1.16 : MAUS option INSYMBOL was missing; this caused the PIPE=MAUS_PHO2SYL to report unknown phoneme symbols in input when INSYMBOL=ipa 22.06.17 : maus 3.26 : changed OUTFORMAT=csv : instead of simply producing the MAU tier in a file with extension '*.csv', now a real CSV spreadsheet file with ';' seperated 6 columns is created: 0RT;KAN;MAU;TOKEN;BEGIN;DURATION Note that the structure is fiexed, even if a column is empty (e.g. in MODUS=bigram where no info regarding ORT or KAN are in the output); segments with token number -1 (not linked) have empty fields ORT and KAN. 23.06.17 : callHavenOnDemandASR 1.1 : wrapper to perform automatic transcription via HPE ASR callHavenOnDemandASR 1.2 : APIKey can be given as empty string on command line, then the internal default APIKey is used; introduced check of signal length less than 30min; bug in file size check: 1MB > 1GB. 27.06.17 : maus 3.27 : added acoustic models for nld-NL and nld-BE based on phonetic transcripts from the CGN corpus (NL + VL). callHavenOnDemandASR 1.3 : longer files did not receive results due to bug in polling routine: fixed Note: each poll costs API units; therefore I build in some 30sec delays to avoid larger costs. 28.06.17 : maus 3.28 : added basic language support for Romanian ron-RO 28.06.17 : maus 3.29 : fixed missing three lines in PARAM.ron-RO/KANINVENTAR.inv 30.06.17 : maus.trn 1.21 : changed code (same functionality) to tolerate input files with hashes ('#') in the file name 04.07.17 : maus 3.30 : added full MAUS and MINNI service for nld-NL based on CGN corpus added acoustic modelling and MINNI service for nld-BE based on CGN corpus 05.07.17 : maus.pipe 1.17 : added mapping file maus.pipe.G2P.mapping that allows pipes with mixed language settings: If MAUS supports a language, but BALLOON does not, the alternate language for BALLOON tools is read from this mapping file maus 3.31 : added full support for nld-BE: pronunciation model is trained with reference to nld-NL, so that systematic deviations from nld-NL to nld-BE are covered by MAUS; call G2P with lng = nld-NL for nld-BE 14.07.17 : maus 3.32 : added basic support (forced alignment) for Australian Aboriginal Languages (aus-AU) Unsolved problem: the input symbol 'r\`' cannot be processed; therefore input BPF containing this symbol fail. 16.07.17 : maus 3.33 : changed utterance initial/final silence modelling: models are now optional silence models (HMM #...); added explicite silence model

to replace the awkward usage of '<' as non-optional silence model; retain option NOINITIALFINALSILENCE=true to suppress even the optional models. 18.07.17 : maus 3.34 : fixed bug in rec2mau.awk that caused zero length initial/final silence intervals *not* to be suppressed (they had minimum lenth of 1 frame); now zero length silence intervals are suppressed. 19.07.17 : runASR 1.1 : added OUTFORMAT=emuDB 20.07.17 : maus 3.35 : added LANGUAGE=eng-SC; re-calculated cloned HMM for por-PT, set schwa-elision rule set as default for por-PT 21.07.17 : maus.pipe 1.18 : changed handling of 'mixed language pipes': up to now only G2P changed its language option depending on the mapping in maus.pipe.G2P.mapping. But since the same problem can happen in the other direction, i.e. the pipe is called with gsw-CH-BE but MAUS kows only gsw-CH, we added the mapping maus.pipe.MAUS.mapping. 26.07.17 : maus 3.36 : extended database Pan_AUS for aus-AU acoustic training; enabled MODUS=bigram (MINNI) for aus-AU the default MAUSSHIFT=10 value was set to 0; the value of 10msec shift (which was used for most languages) contradicts our newest findings (see BA thesis of B. Jackl 2017, LMU Munich) that the systematic shift of MAUS segment boundaries is caused by a bias in the training material of the acoustic model of MAUS; the value 10msec is therefore only valid for the German MAUS set, but not for most other languages (which were trained on other language data); therefore starting with this version only the German (10), Catalonian (-4) and Spanish (-4) have specific MAUSSHIFT values, all other languages use the default of 0. Helper mausbpfDB2emuDB extended to handle *_annot.json and *.par files that contain only an ORT tier as delivered by runASR. 28.07.17 : maus 3.37 : re-trainig of acoustic model of aus-AU on PanAUS 0.5.1 03.08.17 : maus 3.38 : bug-fix: in the (probably rare) case that maus is called with USETRN=force and the pre-segmentation estimates a TRNOFFSET=0 a initial silence segment with negative duration -1 was created. 16.08.17 : maus 3.39 : bug fix in mausbpf2emuR : MAS tiers in recordings with only one word were not processed bug fix in maus : some temporary files were created without chmod 666 and therefore not removable 18.08.17 : maus.pipe 1.20 : added G2P option imap; if set, lng=und is set automatically without changing LANGUAGE 28.08.17 : maus 3.40 : enabled the disabled WARNING, if INS{ORT|KAN}TEXTGRID=true but has no effect 31.08.17 : maus.pipe 1.21 : check input TEXT file if empty before starting the PIPE to avoid confusing ERROR messages 07.09.17 : maus.pipe 1.22 : bug in LANGUAGE mapping caused LANGUAGE=gsw-CH to fail -> fixed (and patched 1.21) maus 3.41 : disabled WARNING 'options INS***TEXTGRID have no effect' 09.09.17 : maus 3.42 : added basic aligment service for language nor-NO based on corpus 'NB Tale' (thanks to Johanna Cronenberg) maus 3.43 : added MINNI service to nor-NO 12.09.17 : maus 3.44 : added basic service for jpn-JP based on CSJ corpus 13.09.17 : maus 3.45 : added MINNI service for jpn-JP based on CSJ corpus with merged sub-phonemic plosives (i.e. MINNI does not recognize *_cl and *_rl) 24.09.17 : maus 3.46 : Japanese phonemes with embedded backslash '\' were not handle correctly; the conversion from X-SAMPA (KANINVEBTAR) to internal symbol set (GRAPHINVENTAR) now replaces all backslashes by '-', not only trailing backslashes, e.g. 'N\N\' becomes internally 'N-N-' 28.09.17 : maus 4.0 : major update due to several internal re-codings; new features are: - introduce video processing: unknown extensions are treated as video input, audiotrack is extracted from video (using ffmpeg) and processed as input, if possible, the original sampling rate of the audiotrack is being used, otherwise output is based on 16000Hz sampling rate. - maus reports all ERRORS and WARNINGS now to stderr instead of stdout - internal: conversion to NIST deprecated; all internal processing now based on RIFF WAVE maus.pipe 2.0 : major update due to several internal re-codings; reports all ERRORS and WARNINGS now to stderr instead of stdout maus.trn 1.22 : reports all ERRORS and WARNINGS now to stderr instead of stdout par2Textgrid 1.3 : reports all ERRORS and WARNINGS now to stderr instead of stdout 09.10.17 : maus 4.1 : some bug fixes caused by internal re-coding, added WARNING when signals with less than 16kHz are processed. 11.10.17 : maus.pipe 2.2 : removed buggy video conversion: all services now process video on their own 17.10.17 : maus.pipe 2.3 : added ASR option 'diarization' 19.10.17 : maus.pipe 2.4 : added quota pre-check for pipes with MAUS 26.10.17 : maus 4.2 : added symbols /O:, I:, 6:/ to language aus-AU 01.11.17 : maus 4.3 : re-coded mausbpfDB2emuDB, mausbpf2emuR plus helpers: code is now generic, so that all combinations of BPF tiers are transformed 03.11.17 : maus 4.4 : bug in mausbpfDB2emuDB: *_annot files were not analysed correctly, fixed; extended OUTFORMAT=csv by a 7th column carrying the speaker diarization (if in input BPF, otherwise column SPEAKER is empty). 06.11.17 : maus.pipe 2.5 : bug in emuR output of service PHO2SYL, fixed; changed default G2P option to '-com yes' 06.11.17 : maus 4.5 : bug in helper mausbpfemuR : input BPF without class 4 but class1mult BPF tiers caused a syntax error in output, fixed 09.11.17 : maus.pipe 2.6 : changed module PHO2SYL: depending on BPF input from the pipe a syllabification of KAN (-> KAS) or a syllabification of MAU|SAP|PHO (in that order, first found is used) or both are performed; the (senseless) option 'phontier_PHO2SYL' is now obsoleten maus.pipe, but still accepted by the script maus.pipe 2.7 : bug in OUTFORMAT=TextGrid and PIPE=*_PHO2SYL fixed 10.11.17 : maus.pipe 2.8 : changed temporary file storage to unique file names 17.11.17 : maus 4.6 : removed misleading Swiss German variants gsw-CH-* from LANGUAGE set; simplified LANGUAGE to PARAM dir mapping (less maintenance required, only the PARAM dirs define what language are supported (as with chunker btw); adapted maus.pipe for MAUS and CHUNKER processing accordingly. 21.11.17 : maus.pipe 2.9 : added checks for file type and existence of TEXT, RULESET and imap before starting the pipe; TEXT input is ignored for pipes that do not require TEXT input and a WARNING is issued. 22.11.17 : maus 4.7 : added pronuciation model nor-NO based on NB Tale corpus 27.11.17 : maus.pipe 2.10 : bug when called without TEXT argument: wrong ERROR message, fixed. 07.12.17 : maus 4.8 : new improved version of language spa-ES : the GLISSANDO corpus re-labelled, acoustic and pronunciation models re-trained. 08.12.17 : maus.pipe 2.11 : added pipes CHUNKER_MAUS and CHUNKER_MAUS_PHO2SYL 01.02.18 : maus 4.9 : bug in LANGUAGE=spa-ES : the acoustic model used a non-optional silence model for optional inter-word silence modellig instead of an optional silence model; this caused very bad segmentation results for spa-ES; this error is probably relevant only for maus version 4.8 (7. Dec 2017 - 30. Jan 2018) 02.02.18 : maus 4.10 : added cross check for RULESET extension (rul|nrul) vs. types (statistical|phonological) maus 4.11 : added option PRESEG to replace deprecated USETRN=force; that way pre-segmentation can be applied to chunks (USETRN=true PRESEG=true); until next rollout USETRN=force still works; for a short period (2.2.-16.2.18) there was a bug in this version that in very rare cases caused the service WebMAUS Basic to crash; this was patched without new version on the 16.2.18, 09:30) 03.02.18 : maus.pipe 2.12 : added MAUS option PRESEG (default is false) 26.02.18 : maus.pipe 2.13 : added correct handling of PHO2SYL -lng 'und' option when pipe has either LANGUAGE=sampa or the G2P service uses an imap 30.09.18 : maus 4.13, maus.trn 2.1 : maus now correctly distinguishes between a proper single TRN entry with word number list and a in-proper TRN as output by wav2trn. There is a problem with the eng-US rule sets trained on TIMIT: it seems that the rule sets with pruning = 5 contain so called 'replacement rules' with a ln() = -0.000001 that effectively always apply to the left-hand context of the rule. word_var-2.0 sometimes crashes, when such a context appears in the input. Since I could not figure out what the proble is (the rules look perfectly normal), I replaced the ln() = -0.000001 by a lower probability, and then the error vanished. A lower prob. than 1.0 for a rule makes sense anyway, since the acoustics should in the end decide whether the replacement is applicable. Changed both rule sets with pruning = 5 accordingly; copies of the old versions are retained in files *.20181002 03.10.18 : maus 4.14, maus.trn 2.2, mausbpfDB2emuRDB, par2emu : made file names of temporary files unique; there have been problems with temporary files that were left by debugging on the server; this fix should solve this problem in the future. 10.10.18 : maus 4.15 : fixed INSYMBOL=ipa : when the input IPA contained symbols that are not actually IPA (e.g. 'I' instead of 'ɪ', the script simply ignored these symbols so that the output missed a phoneme. From this version on maus issues an error as soon as any symbol appears in the canonical input that is not defined by the mapping tables IPATABLE1 and IPATABLE2. 07.11.18 : maus 4.17 : added LANGUAGE=tha-TH; basic forced alignment added 'c_h' 'ts\' 'ts\_h' to PARAM.SAMPA/PLOSIVES 09.11.18 : maus 4.18 : added WARNING for the case that a phonetic symbol in the output cannot be mapped to ipa, manner or place (option OUTSYMBOL) because missing information in IPATABLES. added MINNI for tha-TH based on phonemic transcripts in LOTUS; changed handling of phonological input: 11.11.18 : maus 4.19 : trailing tone markers ('..._1 - ..._5; e.g. in Thai) are deleted from input, since MAUS does not differenciate between tones. re-worked OUTSYMBOL=place tables, KANINVENTAR.inv tables, aggregated new HMM to SUPERHMMs 13.11.18 : maus 4.20 : added language deu-LU by extending and cloning deu-DE inventar and HMM set; the extension was based on the phoneme set of Peter Gilles, but all deu-DE symbols are still maintained. 14.11.18 : maus 4.21 : changed language code deu-LU to ltz-LU; updated PARAM.SAMPA for new symbols in tlz-LU 15.11.18 : maus 4.22 : set a default phonological rule set for tha-TH that alows replacement of canonical /r/ by /l/ in any context, and replaced the Thai /r/ HMM by the Italian /r/ HMM 18.11.18 : maus.pipe 3.0 : added service SUBTITLE and changed structure : partial pipes ..._MAUS[_SUBTITLE][_PHO2SYL] are seen as one building block because SUBTITLE and PHO2SYL only appear in pipes that contain MAUS and always after MAUS; this makes the code of maus.pipe much shorter and better maintainable 22.11.18 : maus 4.23 : added (default) phonological rule set ltz-LU_manualRules.nrul kindly provided by Peter Gilles, University of Luxembourgh; ltz-LU is now using this rule set instead of the statistical rule set of German; in case you want to use the German rule set use option RULESET=deu-DE_rml-0.95.rul. 23.11.18 : maus 4.24 : ltz-LU : added phoneme /d_0/ (devoiced alveolar plosive) 28.11.18 : maus 4.25 : ltz-LU : added/corrected rules in the phonological rule set ltz-LU_manualRules.nrul 08.12.18 : maus 5.0 : major update because output format CSV has changed (not backwards compatible!) OUTFORMAT=csv has been extended from a 6-column table to a 11-column table; CSV now contains data from the BPF tiers ORT,KAN,TRO,KAS,SPK,MAS,MAU,TRN; conversion is now performed by external helper mausbpf2csv which is part of the MAUS distribution (and can be used as a conversion tool on its own) maus.pipe 4.0 : major update because output format CSV has changed (not backwards compatible!); enabled emuDB|emuR|csv output for pipes ending on ..._SUBTITLE; enabled csv output for pipes ending on ..._PHO2SYL; now almost all output formats are possible for almost all pipes. 12.12.19 : maus.pipe 4.2 : added '-verb 0' to G2P service; bug fix: some pipes reported ERROR but returned exit 0 - fixed; bug fix: G2P reported a WARNING because it got an empty -imap option - fixed 13.12.18 : maus 5.1 : added LANGUAGE=swe-SE (cloned from Norwegian, no pronunciation model); bug fix: if the last label in the MAUS result started with a '{' and the last segment needed correction, maus terminated with an Shell ERROR -> fixed 16.12.18 : maus 5.2 : added Albanian LANGUAGE=sqi-AL: mainly cloned from Hungarian 27.12.18 : maus.pipe 4.3 : changed LANGUAGE mapping in modules (internal) callGoogleASR 2.3 : fixed the way quotas are printed maus 5.3 : added missing optional silence '' to language spa-ES 02.01.19 : maus 5.4 : disabled WARNING that the signal is extracted from a video input because this in combination with maus.trn produces very long WARNING output. 04.01.19 : maus 5.5 : fixed a bug in option INSYMBOL=ipa : in some rare cases a wrong IPA->SAMPA mapping was applied which caused an 'unknown symbol' ERROR; option INSYMBOL=ipa : KAN tier is passed to output as IPA (was tranformed into SAMPA in earlier versions); input MP4 with more than one soundtrack caused ERROR: now the default soundtrack is selected, if multiple soundtracks, the script checks whether LANGUAGE matches the default soundtrack and gives a WARNING when mismatch; maus.pipe 4.4 : input MP4 with more than one soundtrack caused ERROR; now the default soundtrack is selected 07.01.19 : maus 5.6 : the pre-processed SIGNAL is now passed onto maus.trn, not the original SIGNAL; this avoids that the maus calls in maus.trn repeat e.g. the extraction of a soundtrack from video input over and over again. maus.pipe 4.5 : a video input is not passed through the pipe as video any more but rather as the default soundtrack (extracted by ffmpeg); this avoids that services in the pipe extract the soundtrack over an over again, and - even worse - might extract different tracks. 14.01.19 : maus 5.7 : using installed HTK tools instead of copies in the distribution SOURCE dir 20.01.19 : maus.pipe 4.6 : bug fix in PHO2SYL language mapping 27.01.19 : maus 5.8 : added SAMPA /X\/ (uvular fricative) to SAMPA inventory 01.02.19 : maus.pipe 4.7 : changed CHUNKER call so that signals with capital extensions (e.g. '.WAV') are accepted by chunker 19.02.19 : maus.pipe 5.0 : internal re-organisation of sources, functionality the same 06.03.19 : maus.pipe 5.1 : added module ANONYMIZER 09.03.19 : maus.pipe 5.2 : changed module SUBTITLE so that in case no original transcript is given via the TEXT input to the pipe, the transcript is either recovered from TRN (CHUNKPREP in PIPE) or from TRL|TR2|TRS (input to the PIPE is BPF) tier(s) or - if everything fails - from the ORT tier. If there is a module ANONYMIZER before SUBTITLE in the pipe, the original/recovered transcript is anonymized according to the list in ATERMS before passing it to SUBTITLE. 11.03.19 : par2Textgrid 1.5 : all existing BPF tiers (ORT,KAN,MAU|SAP|PHO|IPA,MAS,TRN) in input are converted by default; no WARNING if a tier is not present; added TRN tier 11.03.19 : maus 5.9 : OUTFORMAT=TextGrid : all BPF input tiers are passed on to par2TextGrid 13.03.19 : maus.pipe 5.3 : added video input support for AVI and FVL maus 5.10 : bug in helper par2TextGrid 1.5 caused errors in TextGrid TRN tier 14.03.19 : maus.pipe 5.4 : fixed bug in SUBTITLE module when recovering original transscript from BPF input 26.03.19 : par2TextGrid 2.1 : complete re-write of par2TextGrid; this is a non-backwards compatible update! par2TextGrid is now a general usable tool to convert most types of BPF files into standard praat TextGrid. All BPF tiers that are currently supported are converted automatically; for backwards-compatibility the options INSORTTEXTGRID=false and INSKANTEXTGRID=false are still recognized and the tiers are suppressed accordingly. The main new feature is that BPF with parallel class 4 time layers (e.g. MAU and SAP and WOR in the same file) are now possible: the output TextGrid then contains blocks of intrinsically synchroneous blocks of layers that are all derived from one class 4 BPF tier, e.g. if the input BPF contains SAP and MAU and ORT and KAN, the TextGrid will have the layers ORT-SAP, KAN-SAP, SAP, ORT-MAU, KAN-MAU, and MAU. maus.pipe 5.5 : extended OUTFORMAT support using par2TextGrid 2.1 in pipe that end on CHUNKER; PHO2SYL and SUBTITLE. 28.03.19 : maus.pipe 5.6 : removed original transcript recovery from TR* BPF tiers: only ORT is used! Some minor bug fixes in mausbpf2emuR and mausbpf2csv. 20.04.19 : maus.pipe 6.0 : major update introducing option 'Keep everything' (KEEP=true): output a ZIP archive instead the normal output of the last service; this ZIP then contains not only the output of the pipe, but also the input data (marked in the file name with '_INPUT'), a _README.txt describing the input file names and pipeline options, as well as intermediate results of the pipeline that would otherwise be lost, because they cannot be passed through the rest of the pipe (e.g. an anonymized version of the input video produced by ANONYMIZER; intermediate results are marked in the file name with '_', e.g. if the input was 'Signal1.mp4' then an intermediate result produced by the ANONYMIZER service would be named 'Signal1_ANONYMIZER.mp4'). 25.04.19 : maus.pipe 6.1 : added option '--list-pipes' (the name says everything) maus 5.12 : removed (German) g2p fall-back in maus; improved video processing; removed some out-of-date WARNINGS 27.04.19 : maus 5.13 : added phonemic symbols /Nm/, /kp/ (= double articulated), /e_r/, /o_r/ (raised) to (X-)SAMPA phoneme inventory 04.05.19 : maus 5.14, maus.trn 3.0 : re-worked maus.trn to process chunks in parallel maus.pipe 6.2 : bug fix : pipes starting with CHUNKER crashed 06.05.19 : maus.trn 3.1 : bug fix : the pre-screening for chunks too short to be processed did not work properly for KAN tiers with blank-separated SAMPA strings - fixed Crashed sub maus jobs were not handled gracefully (just wait for time-out) - fixed Chunks that could not be processed are now labelled (in the MAU tier) as 'chunkNotProcessed>' in segments of length 10 samples at the beginning of the chunk (word segments are accordingly mapped to these very short segments, but the word labelling stays as in the input, as does the TRN tier). Sub maus jobs that do not die, causes the service to wait for time-out (currently 1600sec) maus.trn 3.2 : improved checking of sub maus jobs, when very many sub jobs crash, the main process might 'hang' forever, because the number of still running jobs (that very in fact crashed) was determined incorrect - fixed Changed minimum average phone duration limit for pre-screenung cunks in maus.trn to 40msec (was 30msec). 07.05.19 : maus.trn 3.3 : improved pre-screening : when RELAXMINDUR is set, the pre-screening average phone duration is set to 10msec; improved ERROR message of screening; 08.05.19 : maus.trn 3.4 : improved multi-threading and added option MULTITHREADING=true maus 5.15 : set default statistical rule set for eng-SC to rml.prune50smooth.rul, which is a robust set of rules (must see at least 50 occurances of a rule); the rule sets with lesser pruning thresholds (10,20) are faulty; added option MULTITHREADING=true maus.trn 3.5 : build-in time-out for forking jobs (2h) in case that more sub jobs 'hang' than MAXFORK (which could result in an indefinite 'hang') maus 5.16 : added language afr-ZA, forced alignment only 11.05.19 : maus.pipe 6.3 : when a pipe has no TEXT input, the SUBTITLE service reconstructs the transcript from the ORT tier, but it can be that MAUS processed only a part of the ORT tier, if a subset was defined in TRN and USETRN==true was set. Starting from this version the reconstruction is then constrained to the ORT subset as defined in the TRN tier. 12.05.19 : maus.pipe 6.4 : changed G2P -oform (output format) to 'bpfs' (KAN tiers contain blank-separated phoneme symbols) 17.05.19 : maus 5.17 : changed deu_DE default MAUSSHIFT from 10.0 to 7.13 after re-validation 18.05.19 : maus 5.18 ; introduced flexible frame rate for segment boundaries (option TARGETRATE); TARGETRATE is default 100000 units of 100nsec (= 10msec, backwards compatible), but can be reduced to minimum 10000 (= 1msec) framerate, if for instance segmental analysis require more fine grained quantization. Note though that increasing the frame rate *does not* improve average MAUS accuracy (tested on German VM benchmark only!) nor improve the boundary deviation histogram. 28.05.19 : maus 5.19 : Albanian: phoneme /4/ was missing -> fixed 04.05.19 : maus.pipe 6.5 : added MAUS option TARGETRATE 06.06.19 : maus.pipe 6.6 : emuDB output file in KEEP ZIP hat the wrong extension '._annot.json' -> fixed 10.06.19 : maus.pipe 6.7 : replaced media file pre-processing by a call to 'audioEnhance'; this enables pipes to process MP3 input, other bit resolutions than 16bit, multi-channel files. 11.06.19 : maus.pipe 6.8 : wrong content in KEEP=true ZIP output, if ANONYMIZER is last service -> fixed 11.06.19 : maus 5.20 : bug fix: with the introduction of AUDIOENHANCE in maus.pipe, maus did not insert the correct bundle name in emuR output *_annot.json when in a pipeline -> fixed 13.06.19 : maus 5.21 : added a number of X-SAMPA symbols to LANGUAGE=sampa on request of the DoReCo project 19.06.19 : maus 5.22 : added X-SAMPA /dz\/ to LANGUAGE=sampa maus.pipe 6.9 : changed options INSORTTEXTGRID and INSKANTEXTGRID to 'true' 24.06.19 : maus 5.23 : added X-SAMPA /@e/ /@:e/ to LANGUAGE=sampa maus.pipe 6.10 : added AudioEnhance option NOISEPROFILE 30.06.19 : maus 5.24 : added X-SAMPA /J_+/ /@`/ /z=/ to LANGUAGE=sampa 10.07.19 : maus 5.25 : bug fix : when using a 'nrul' set and the LANGUAGE contains X-SAMPA symbols that contain '\', the rules are not used correcty, i.e. symbols with '\' can be in the output although no rules are predicting them; from this version on X-SAMPA symbols that contain '\' are removed from the symbol set before calling the variant generator; this does not change any processing of X-SAMPA symbols, since rules containing such symbols cannot be used in 'nrul' sets anyway (because of '-' being the context separator) 11.07.19 : maus 5.26 : transcription tags of the form '<...>' that are passed through G2P (-com yes) and are modelled as explicite silence ('

') so that theya re passed through maus as well, may now contain the characters '<>' in the tag string, e.g. '<tag>'. maus.trn 3.6 : bug fix : when running in chunk segmentation mode (USETRN=true) and NOINITIALFINALSILENCE=true a silence interval was inserted at the end of some chunks; from this version on the final segment always fits to the end of the chunk 13.07.19 : maus 5.27 : added X-SAMPA symbols for DoReCo maus 5.28 maus.trn 3.7 : bug fix : silence intervals of sample length 1 between consecutive chunks removed 14.07.19 : maus 5.29 maus.trn 3.8 : bug fix : the pre-check for impossible short chunks did not consider reduced frame rate (via option TARGETRATE) 15.07.19 : patch in maus 5.29 maus.trn 3.8 : wav2trn calculates duration *2 samples* too high! maus.pipe 6.12 : option TARGETRATE was not passed on to module MAUS maus.trn 3.9 : changed length of dummy MAU segments of non-processed chunks so that they cover the complete chunk (and are therefore better visible in sound editors) 27.07.19 : maus.pipe 6.13 : added ASR option ACCESSCODE 30.08.19 : maus 5.30 : added 8 new X-SAMPA symbols to LANGUAGE=sampa 06.09.19 : maus 5.31 : bug fix in Georgian phoneme set: model /c_>/ was missing leading to an ERROR when in BPF input 23.09.19 : maus 5.32 : added 7 new X-SAMPA symbols to LANGUAGE=sampa 01.10.19 : maus 5.33 : integrated output format conversion using annotConv 03.10.19 : maus.pipe 7.0 : pipes with ASR are no longer supported without AAI authentication; LANGUAGE: added variants of English and Spanish that are supported by ASR module and mapped these to supported English and Spanish variants in modules G2P_PHO2SUL and MAUS 08.10.19 : mausbpf2emuR 5.33 : added BPF class 2 tier support (SPD IPA) mausbpf2csv 5.33 : added BPF class 2 tier support (SPD) 11.10.19 : annotConv 1.3 : added fallback option SAMPLERATE (just needed if the sample rate cannot be determined from the input BPF) mausbpf2eaf 1.3 : enabled SPD conversion as singular tier 14.10.19 : annotConv 1.4 : bug fix : exit code 0 after ERROR in EAF fixed 21.10.19 : maus.pipe 7.1 : allow OUTFORMAT=eaf for pipes ending with MAUS or ANONYMIZER; this is just a pre-liminary fix; the next step will be the introduction of annotConv into maus.pipe 30.10.19 : maus.pipe 8.0 : replaced all output conversions by annotConv; stream-lined list of recognized output format descriptions: deprecated emuR, PAR, BPF, textgrid, tg, TG, CSV, EAF mau-append 05.11.19 : maus 5.34 : re-enabled already deprecated option INSORTTEXTGRID and INSKANTEXTGRID after user complaints 07.11.19 : maus 5.35 : added option ADDSEGPROB=false; if set the frame-normalized natrual log Viterbi likelihood is appended to the phonetic symbol in the MAU tier (separated by blank) Note that setting this option will break the BPF standard, and must not be used in a pipeline in which the MAUS result is processed further (e.g. PHO2SYL). maus.pipe 8.1 : added option ADDSEGPROB=false for pipes that end on MAUS 11.11.19 : maus.pipe 8.2 : fixed typo in variable name 'OUTPFORMAT'; fixed missing extension handling in KEEP=true. 24.11.19 : maus 5.36 : added HMM '' to German set (was a link to '') 30.11.19 : maus 5.37 : bug fix: if both option INSORTTEXTGRID and INSKANTEXTGRID were set, the results other than TextGrid caused an ERROR. 09.12.19 : maus 5.38 : added tone markers to tha-TH processing; syllable nuclei carrying a tone marker '..._1 - ..._5' are processed as nuclei without markers, but the tone marker is carried over to the MAUS output. 13.12.19 : maus 5.39 : added missing phoneme /@:/ to tha-TH phoneme inventar maus.pipe 8.3 : added a list ASRNONTOKENLANGUAGES that define languages like tha-TH for which the ASR does not deliver word-tokenized output but rather the complete utterance in one string; for these the following G2P module will pass the 'word' through the usual word tokenization (iform txt). 17.12.19 : maus 5.40 : extended the KANINVENTAR.inv list by the language specific README; fixed missing KANINVENTAR.inv tables for Swiss dialects gsw-CH-**; fixed non-masked numericals in GRAPHINVENTAR/DICT in the languages sampa,ron-RO, ltz-LU and jpn-JP. maus.pipe 8.4 : added jpn-JP to ASRNONTOKENIZEDLANGUAGES, and since Google and Watson behave different on jpn-JP made the decision to extract txt from the ASR BPF more strict: only if the BPF contains really only one 'word' (= the total utterance), the text is extracted and passed on to G2P as iform txt 23.12.19 : maus 5.41 : added /G/ to phoneme set of ltz-LU 26.12.19 : maus 5.42 : fixed buggy entry in tha-TH KANINVENTAR.inv; fixed OUTSYMBOL=ipa|manner|place for tha-TH tone markers '..._1" etc. are treated in IPA as in SAMPA. 27.12.19 : maus 5.43 : added syllabic variants l= m= n= to LANGUAGEs eng-AU and eng-NZ to be conform with pho2syl service; added syllabic variants l=` to LANGUAGE nor-NO to be conform with pho2syl service. maus.pipe 8.5 : added new option '-embed maus' to all pho2syl_wrapper.pl calls 28.12.19 : maus.pipe 8.6 : enabled the usage of option OUTSYMBOL=sampa|ipa|x-sampa|maus-sampa|arpabet for PIPEs with last service PHO2SYL 01.01.20 : maus.pipe 8.7 : fixed SUBTITLE problem with LANGUAGEs jpn-JP and tha-TH: subtitle texts are now taken from the word-tokenization instead of the input TEXT; this has the disadvantage that subtitles have no punctuation and are possibly in another script that TEXT, but at least we get a usable result. 09.01.20 : maus 5.44 : bug fix : '0' was not masked with 'P' in ltz-LU and swe-SE bug fix: /d_0/ was not in acoustic model of ltz-LU bug fix: the default phonological pronunciation model of ltz-LU prevented the phonemes /s\/ and /z\/ to be passed to the MAU output; switched to default forced alignment 17.01.20 : maus 5.45 : extended tha-TH phoneme set by un-lengthened schwa /@/ maus.pipe 8.8 : bug fix: the extraction of ASR results for tha-TH did not work because for instance Google ASR tokenizes the ASR result between digits and does not put the total result in one string; fixed this by extraction the complete ORT layer from the ASr BPF result file and concatenate this into a txt file which is then passed on to G2P. 25.01.20 : maus.pipe 8.10 : enabled G2P option 'syl=yes'; when set the KAN tier will contain '.' syllable boundaries and G2P outsym maus-sampa is switched to sampa maus 5.47 : enable KAN tier input with syllabe markers '.' (which are ignored by MAUS) 03.02.20 : maus 5.47 : added 8 new phoneme symbols (for language Sanzhi Dargwa) to language independent phoneme set 04.02.20 : maus 5.48 : added closure only phonemes t_cl, p_cl and k_cl (clones from Italian) to the tha-TH phoneme set (experimental) 05.02.20 : maus 5.49 : added new phoneme symbol dZ_j to language independent phoneme set 06.02.20 : maus 5.50 : added 'error tone _8' to tha-TH phoneme set 08.02.20 : maus.pipe 8.11 : bug fix : LANGUAGE=sampa was not translated to -lng und in CHUNKPREP module 19.03.20 : maus 5.51 : bug fix in tha-TH : tone variants of schwa /@/ and closure models were missing 25.03.20 : maus.trn 3.10 : disabled pre-screening of chunk lengths; some user requested rather a marking in the MAU tier that a full ERROR. 30.03.20 : maus 5.52 : added language Icelandic isl-IS maus 5.53 : adjusted isl-IS phoneme mapping 02.04.20 : maus 5.54 : added phoneme symbol 'q_h' to language independent set; added Icelandic phonemes to language independent set 11.04.20 : maus 5.55 : bug fix: X-SAMPA diacritic 'advanced' /_+/ was removed from KAN input because function words are (sometimes) marked with a trailing '+' in KAN. Now only trailing '+' without a preceeding '_' are removed. 14.04.20 : maus 5.56 : added /4/ to inventar eng-AU 15.04.20 : maus 5.57 : eng-AU : mapped /4/ to eng-US /4/ 30.04.20 : maus 5.58 : added phoneme symbols /i_?\/ and /x:/ to language independent set 01.05.20 : maus.pipe 8.12 : adapted to new G2P 1.108: -embed maus does no longer disable the options syl and stress added G2P option stress=no maus 5.59 : bug fix : special characters in KAN tier ".#'\"+" were not suppressed in blank-separated KAN strings. 10.05.20 : maus 5.60 : changed language specific IPATABLE = KANINVENTAR.inv to PARAM.SAMPA/KANINVENTAR.inv maus.pipe 8.14 : bug fix in maus.pipe.MAUS : KAN tier with stress marker were not tranlated to IPA correctly 16.05.20 : maus 5.61 : added phoneme symbol /s_>/ (alveolar ejective fricative) to language independent set 18.05.20 : maus 5.62 : changed IPATABLE back to language specific table KANINVENTAR.inv because the language independent table (see version 5.60) lead to ambiguous mappings (e.g. IPA u: -> SAMPA uu) 19.05.20 : maus.pipe 8.15 : bug fix: using INSYMBOL=ipa and OUTSYMBOL=ipa in parallel lead to a mapping ERROR changed IPATABLE back to language specific table KANINVENTAR.inv 20.05.20 : maus 5.63 : added LANGUAGE=und as alias for LANGUAGE=sampa; the following symbols are ignored in IPA input (KAN) but passed on to the KAN output: ˈˌ.#"'+ 21.05.20 : maus.pipe 8.16 : enabled chunker option 'maus' 23.05.20 : maus.pipe 8.17 : enabled option USEREMAIL 30.05.20 : maus.pipe 8.19 : integrated textEnhance service; added new option USEAUDIOENHANCE 31.05.20 : maus.pipe 8.20 : set default for LEFT_BRACKET = "#" because the interface cannot pass "#" as value 04.06.20 : maus.pipe 8.21 : adapted textEnhance call with option '--infile'; fixed icsiarg parser so that value '{}' can be passed to the script 11.06.20 : maus 5.65 : bug fix in MAU IPA output mapping maus.pipe 8.22 : bug fix in KAN IPA output mapping 13.06.20 : maus 5.66 : added phoneme symbols S_> and x_> to language independent phoneme set 15.06.20 : maus.pipe 8.23 : modified SUBTITLE module to produce subtitles based on TRO tier in BPF input, if a TRO tier is present; this makes sense, if another module (e.g. ASR) in the pipe produces original text with punctuation, which would be lost if we create subtitles only based on the ORT tier. Bug fix: the textEnhance was applied to non-txt input to G2P 02.07.20 : maus 5.67 : added audioEnhance pre-processing on all input media formats except *.wav 10.07.20 : maus.pipe 8.24 : allow odt doc docx pdf rtf as input formats 16.07.20 : maus.pipe 8.25 : add G2P option 'except=exceptionDictionary' 28.07.20 : maus 5.68 : added check for non-ASCII characters in RULESET file (not allowed: ERROR) 30.07.20 : maus 5.69 : added 51 new X-SAMPA symbols to the Language Independent Set (LANGUAGE=sampa); mostly clicks. 24.08.20 : maus 5.70 : added 3 new X-SAMPA symbols to the Language Independent Set (LANGUAGE=sampa): r\=` ts\: s\: 29.08.20 : maus 5.71 : inserted comment leading '%' as comment marker to all lines in KANINVENTAR.inv that are not part of the CSV 29.08.20 : maus.pipe 8.26 : fixed bug : MAUS RULESET was not saved in KEEP dir when MAUS was last service in pipe 31.08.20 : maus 5.72 : fixed KANINVENTAR.inv tables: some lines had trailing TABs 07.09.20 : maus.pipe 8.27 : fixed bug : KEEP=true did not work for PIPE=ASR_... types; TEXTENHANCE output copy in KEEP ZIP had the wrong extension ".wav" (now it is '.txt') 11.09.20 : maus 5.73 : set BPFTHRESHOLD=3000 to test 19.09.20 : maus.pipe 8.28 : bug: non-TXT input was copied to TEXTENHANCE output in KEEP although no textEnhance was used 08.10.20 : maus 5.74 : added three new symbols to language independent set: ou y2 ie (Dolgan language) 15.10.20 : maus 5.75 : added three new geminate symbols to language independent set: J\J\ d`d` g_wg_w 21.10.20 : maus.pipe 8.29 : option BRACKETS={} did not work ('{}' were deleted in option) -> fixed 23.10.20 : maus 5.76 : expanded the SAMPA encoding of geminated consonants (which is ambique) to all possible forms to ease the use of Language Independent mode, e.g. ddS, dSdS and d:S are all the same model 24.10.20 : maus.pipe 8.30 : added SUBTITLE option value OUTFORMAT='vtt' 01.11.20 : maus 5.77 : fixed PARAM.SAMPA/mk_set with 'set noglob' command (no bug, just better code!) 05.11.20 : maus 5.78 : added 4 new symbols to language independent set: 1~ 1:~ h~ j~ (Texistepec Popoluca) 11.11.20 : maus.pipe 8.31 : delete '\t' and '\n' from reconstructed transcript from TRO before passing it to subtitle 16.11.20 : maus 5.79 : added 8th column with minimum duration to phoneme table of 'Language Independent' set 18.11.20 : maus 5.80 : added option RELAXMINDURTHREE: like RELAXMINDUR this option causes the HMM models to be set to a lower minimum duration, here 3 states in each model (= 30msec for standard frame rate); note that setting this option might ease analysis of segments' length since there is a uniform lower ceiling effect at 30msec for each phoneme class, but it also will dteriorate the segmental accuracy of maus, since the restraints for longer segments such as affricates are waived. 19.11.20 : maus 5.81 : bug fix in sqi-AL (Albanian) phoneme set: /c/ was missing 20.11.20 : maus.pipe 8.32 : added new maus option RELAXMINDURTHREE 09.12.20 : maus.pipe 9.0 : added new pipeline 'ASR_SUBTITLE'; this pipeline will only work properly with ASR services that produce a WOR tier (a word alinment) 10.12.20 : maus 5.82 : bug fixes in Language Independet phone set table KANINVENTAR.inv 04.01.21 : maus 5.83 : added symbols s\_h @\ 3\ G\ to Language Independent Set 11.01.21 : maus 5.84 : bug fix : options ADDSEGPROB and OUTSYMBOL were incompatible -> fixed 17.01.21 : maus 5.85 : added symbol r= to Language Independent Set; bug fix in mausbpf2emuR: ids in the *_annot.json had either gaps or be double (which causes an error when loading the DB) 20.01.21 : maus 5.86 : added symbols l_t m_t n_t to Language Independent Set 04.02.21 : maus 5.87 : bug fix in mausbpf2emuR : links to special levels with zero length segments (PHO,SAM) and gaps were wrong 18.02.21 : maus 5.88 : re-newed cloning in language por-PT to current SUPERHMM with preference language spa (was cloned from German) 21.02.21 : maus 5.90 : added language far-IR forced alignment and MINNI; no pronuciation model (yet) 24.02.21 : maus 5.91 : added symbol q_w to Language Independent Set 02.03.21 : maus.pipe 9.1 : bug fix in maus.pipe.ASR : runASR was called with corrupt arguments that caused exceed quota codes not to be passed 06.03.21 : maus.pipe 9.2 : bug fix : pipes ending with MAUS and SUBTITLE reported error when called with OUTFORMAT=exb|tei although this works 10.03.21 : maus 5.92 : changed mapping of un-trained geminate phoneme symbols in th eLanguage Independent Set (LANGUAGE=sampa) so that the variants 'x:' and 'xx' always point to the same HMM 23.03.21 : maus 5.93 : re-designed phoneme mapping in eus-ES|FR to SUPERHMM with preferred language 'spa' 24.03.21 : maus 5.94 : re-designed phoneme mapping in mlt-MT to SUPERHMM with preferred language 'ita' 27.03.21 : maus.pipe 9.3 : removed OUTSYMBOL=arpabet,maus-sampa because PHO2SYL never supported these options; renamed OUTSYMBOL=sampa to OUTSYMBOL=x-sampa because the so called 'maus-sampa' (or 'sampa' in MAUS terms) is basically X-SAMPA with the exception of Norwegian/Icelandic where retroflex consonants can be encoded as 'rX' instead of 'X_r'; (MAUS and pho2syl_wrapper are both called with OUTSYMBOL=sampa / -outsym sampa in this case!); this is mainly done to resolve the often misleading option names in pipeline services. 31.03.21 : maus 5.95 : re-designed phoneme mapping in ron-RO to SUPERHMM with preferred language 'ita' 01.04.21 : maus.pipe 9.4 : added option 'UTTERANCELEVEL' 07.04.21 : maus 5.96 : bug when input is CSV : output BPF contains no valid SAM: header entry which leads to errors in AnnotConv conversions -> fixed 16.04.21 : maus 5.97 : added symbol d:` to Language Independent Set 17.04.21 : maus 5.98 : language far-IR changed phonemic symbol /X/ to /x/ 02.05.21 : maus 5.99 : added HMMs from language fas-IR to SUPERHMM set introduced DEFMODUS to avoid WARNING on LANGUAGEs that have no rule set 06.05.21 : maus 5.100 : added dos2unix/mac2unix conversion of BPF input file 07.05.21 : maus.pipe 9.5 : added (MAUS) MODUS value 'default' 09.05.21 : maus 5.101 : added symbols K\ @\: 8: to Language Independent Set 27.05.21 : maus 5.102 : minor bug fix in an ERROR output regarding IPA to SAMPA mapping of input 07.08.21 : maus 5.103 : added symbols j_0 w_0 ?m ?n ?N ?j ?w to Language Independent Set (824) 22.08.21 : maus 5.104 : added language Taiwanese Min Nan (nan-TW) based on NYJU corpus (Prof. Ho-hsien) 03.09.21 : maus 5.105 : bug fix : ADDSEGPROB=true did not work for USETRN=true, i.e. all pipelines that contain a CHUNKER module 13.09.21 : maus 5.106 : added symbols t_d_w J\_w to Language Independent Set (826); updated nan-TW phoneme inventar from Min Nan specific SAMPA to X-SAMPA 30.09.21 : maus 5.107 : added first version of (macro) language Arabic (arb); only forced alignment added arb MINNI service; both services are based on Jalal Tamimi's spoken Arabic corpora with the Arabic varieties Bahrain, Saudi,Lebanese and Levante Arabic (6530 recordings, 225730 segments); added 29 new Arabic symbols to Language Independent Set (856) 05.10.21 : maus 5.108 : language nan-TW : added 7 Mandarin sounds to phoneme set: ts` ts`_h s` z` ou ei y 02.11.21 : maus 5.109 : added symbol R\ to Language Independent Set (857) 15.11.21 : maus.pipe 9.6 : added language mapping for module PHO2SYL arb -> und; this effectively means that pipes running with LANGUAGE=arb will execute PHO2SYL as 'und' (undefined X-SAMPA) 11.12.21 : maus.pipe 9.8 : bug in UTTERANCELEVEL=true : empty lines in input text caused corrupt BPF 13.12.21 : maus 5.110 : added symbol R_j to Language Independent Set (858) 17.12.21 : maus.pipe 9.9 : integrated 'SpeakDiar' service (backend 'speakDiar, SD) 08.01.22 : maus 5.111 : added 6 Mandarin phone models to nan-TW set to model code switching to Mandarin: @` u@ ts\ ts\_h s\ N 11.01.22 : maus.pipe 9.10 : added N-HANS options of AUDIOENHANCE: NHANS, neg, pos 11.02.22 : maus.pipe 9.11 : added language mapping deu-AT and deu-CH -> deu-DE for all tools (since these are new languages in ASR module and could therefore be used in an 'ASR_...' pipeline) 14.02.22 : maus.pipe 10.0 : added pipelines 'SD_ASR_...' 25.02.22 : maus.pipe 10.1 : added runASR options speakMatchASR numberSpeakDiar TROSpeakerID 02.03.22 : maus.pipe 10.2 : bug in maus.pipe.SUBTITLE : text extraction from input TRO tier did not replace '\s' by blank etc. but simply delete them 05.03.22 : maus 5.112 : removed 53 'replacement rules' (con. prob. 0.9999999) from the default rule set of eng-AU because we have now a corrected G2P component for eng-AU 07.03.22 : maus.pipe 10.3 : pipe 'ASR_SUBTITLE' enforces diarization=true for Google ASR only (all other ASR services deliver word alignment without diarization anyway). 08.03.22 : maus.pipe 10.4 : option selectSpeaker="" (default) now causes 'SD_ASR_...' pipes to process all speakers detected from the SD module; speaker turns are labelled in TRO tier if TROSpeakerID=true. 15.04.22 : maus 5.113 : added 20 South Saudi speakers to arb acoustic model training set 17.04.22 : maus 5.114 : added 20 South Saudi speakers to MINNI bigram model (509804 phonemes) 20.04.22 : maus 5.115 : added phonemes dZ_w, tS_w and tS_w_< to language independent set 25.04.22 : maus 5.116 : added phoneme X_> to language independent set 09.05.22 : maus 5.117 : added phonemes R_w, ttS_w, t:S_w to language independent set 13.06.22 : maus 5.118 : bug fix : in MODUS=bigram (MINNI mode) the option ADDSEGPROB (experimental) did not work changed output format for option ADDSEGPROB=true : the time-normalized Viterbi likelihood is appended to the phoneme symbol with a '|' between, e.g. 'a:|-78.676543' 17.06.22 : maus 5.119 : bug fix in aus-AU set: the phoneme /U:/ was missing in the set 17.06.22 : maus 5.120 : bug fix: when PRINTINV is called with a language (e.g. hat-HT) that has only a proxy language in MAUS (here: fra-FR), maus issues an error; we add a fake dir + link 'PARAM.hat-HT/KANINVENTAR.inv' that points to the proxy list 'PARAM.far-FR/KANINVENTAR.inv and add a note in the top of the README of fra-FR that this set serves also another language. Done for proxy languages: ht-HT, 30.06.22 : maus.pipe 10.5 : bug fix : in SUBTITLE not the original input text was segmented but rather the text processed by TEXTENHANCE, thus in subtitles the changes by BRACKETS etc. showed up which they shouldn't in maus.pipe.SUBTITLE the original transcript is now passed to SUBTITLE, and the TEXTENHANCE options of suntitle are all inactivated, so that the subtitles look like the original transcript. 17.07.22 : maus 5.121 increased word threshold from 3000 to 10000 (this has to be done in Chunker/Config/master too!) 23.07.22 : maus.pipe 10.6 : bug fix : pipes ending with SUBTITLE and beginning with ASR did not work because the original transcript could not be found in sub-module maus.pipe.SUBTITLE 03.08.22 : maus 5.122, maus.trn 3.11 : improved WARNING of with regard of pipeline with SUBTITLE 14.12.22 : maus 5.123 : added phoneme /pp\/ to the jpn-JP set to allow a buggy G2P output to pass; /pp\/ points to /p\p\/. 25.01.23 : maus 5.124, maus.trn 3.12 : Arabic script containing '\n' (e.g. /G\na/) caused a infinite recursive call of maus.trn (because the '\n' split the TRN entry line into two) -> fixed; build in some extra checks to avoid line splitting in the future; in maus single, well-formed TRN entries are not passed back to maus.trn any longer but processed within maus (to avoid possible future infinite recursive loops). 15.05.23 : maus 5.125 : added 4 phone symbols to language independent set: e:~ ks o~_?\ e_?\ 27.05.23 : maus.trn 3.13 : no change maus.pipe 10.7 : pipes of type 'ASR_G2P_MAUS_...' will set USETRN=true automatically to allow MAUS to use the TRN tiers produced by the ASR module; add ASR option USEWORDASTURN=false; set forkDelayBetweenForks=100msec (was: 1000msec); check pipes 'ASR_G2P_CHUNKER_MAUS_...' for an ASR produced TRN tier; if present, issue a WARNING and skip the superfluous CHUNKER module 04.06.23 : maus 5.126 : completely reworked language independent phoneme set: integrated 96 new HMM from DoReCo project; re-evaluated all links from X-SAMPA symbols to existing HMMs 14.06.23 : maus 5.127 : users reported problems with the HMM ts.gsw in the deu-DE PARAM set: the model seem to be 'greedy' in that it occupies more signal than usual (several words); replaced ts.gsw and tS.gsw by ts.hun and tS.hun seem to improve results the reason for this strange behavor is a mapping error in the gsw training materials: recording-initial segments include the - often very long - sil interval from 0sec; this must be fixed. 17.06.23 : maus 5.128 : re-build language gsw-CH to remove corrupt models trained on recording-initial segments that included the - often very long - sil interval from 0sec 19.06.23 : maus 5.129 : re-build cloned HMM of language mlt-MT 27.06.23 : maus 5.130 : replaced deu-DE /m/ HMM by /m.gsw-CH/ because the original German /m/ HMM had the tendency to spread out into long adjacent silence intervals (that were zero signals!); changed the check of the RULESET file for 'ASCII' to 'ASCII|CSV'; cat-ES : changed default RULESET=CATALAN.nosmooth.prune5.rul to CATALAN.nosmooth.prune50.rul because of unclear errors in creation of the pronunciation model (word_var) 11.07.23 : maus 5.131 maus.trn 3.14 reduced size of processible MAUS signal from 10000 to 5000 words (also in Chunker/Config/master.config!) because we had problems with very many very long processing times (~20h) blocking the server 18.09.23 : maus.pipe 10.8 : adapted to speaker_diarization 3.X : changed options added language mapping deu-DE-OH -> deu-OH to pipe mapping 16.12.23 : maus 5.132 : added new symbol l\:` to language independent set 28.02.24 : maus 5.133 : bug fix in PARAM.fin-FI : DICT contained references to outdated models T.use and D.use which caused Finish processing to crash when a T or D was in the input 02.05.24 : maus.pipe 10.9 : bug fix : pipes SD_ASR_... running with a ASR service that does not deliver a WOR tier throw a WARNING: cat ... : cannot find file; the warning did not change the result of the pipe 21.06.24 : maus.pipe 10.10 : bug fix : pipes CHUNKPREP_..._SUBTITLE did read the input TextGrid/EAF/CSV as transcript and matched in SUBTITLE; fixed: now the transcript is taken from either TRO/TRN/ORT 21.01.25 : maus 5.134 : fixed server error: PLOSIVE file was partially linked to Flo's HOME 24.02.25 : maus.pipe 10.11 : removed relative path to 'runASR'; from now on the in PATH installed runASR is used 17.03.25 : maus 6.0 : re-organised development process of BAS Web Services maus.pipe 10.12 : new installation method DEVELOPMENT STATUS TABLE The following table gives an overview about the current status of the individual languages supported by maus. Support for a language can be of ascending modelling complexity. The gold standard is a full acoustical and pronunciation modelling for MAUS and a bigram model for MINNI; the minimum support is a mapping of a phoneme set to existing phonems of other languages (SUPERHMM, see HMM/README for details). lng trained HMM on: trained pronunciation on: trained MINNI bigram on: afr-ZA cloned from nld - - arb Jalal Tamimi - Jalal Tamimi (509804 phones) aus-AU Pan_AUS - Pan_AUS cat-ES GLISSANDO News GLISSANDO News GLISSANDO News (292000 phones) deu-DE KielCorpus 1 KielCorpus 1 BASStat (2269063 phones) ekk-EE BABEL+PhED PhED (147925 words) PhED+BABEL (684952 phones) eng-GB AIX-MARSEC AIX-MARSEC (53682 words) AIX-MARSEC (206280 phones) - eng-US TIMIT TIMIT (54384 words) TIMIT (213704 phones) eng-AU AUSTALK AUSTALK (71332 words) AUSTALK (248790 phones) eng-NZ WatsonCorpus (AUSTALK) - eng-SC ICE (Scottish) ICE (Scottish) - eus-ES SUPERHMM - - eus-FR SUPERHMM - - far-IR ETH-Zurich - ETH-Zurich (83216 phones) fin-FI SUPERHMM - - fra-FR Rhapsodie Rhapsodie Rhapsodie (100658 phones) gsw-CH TEVIOS etc. - TEVIOS etc. (243795 phones) hun-HU BEACorpus BEACorpus (41986 words) BEACorpus (211326 phones) ita-IT CLIPS-MT-MANUAL CLIPS-MT-MANUAL (47341 words) CLIPS-MT-MANUAL (90704 phones) jpn-JP CSJ - CSJ (1550398 phones) kat-GE ZakhariaCorpus - - ltz-LU Kiel Corpus 1 Kiel Corpus 1 - mlt-MT SUPERHMM - - nan-TW NYJU corpus NYJU corpus NYJU corpus (454554 phones) nld-BE CGN (*.fon) (German) CGN *.pho (274483 phones) nld-NL CGN (*.fon) CGN *.awd - *.pho (71738 w.) CGN *.pho (293545 phones) nor-NO NB Tale Corpus - NB Tale (298372 phones) pol-PL CLARIN-PL-STUDIO CLARIN-PL-STUDIO (305000 w.) CLARIN-PL-STUDIO (2mio phones) por-PT SUPERHMM - - ron-RO SUPERHMM - - rus-RU INTAS Corpus - INTAS Corpus (53856 phones) spa-ES GLISSANDO News GLISSANDO News G2P/GLISSANDO (480000 phones) sqi-AL SUPERHMM (hun!) - - swe-SE NB Tale Corpus - - tha-TH LOTUS - LOTUS KNOWN BUGS / PROBLEMS / INSTALLATION ISSUES - on some LINUX systems 'mawk' might be installed instead of GNU awk. Then you will most likely get an error message like: awk: run time error: regular expression compile failed (missing operand) awk: run time error: regular expression compile failed (missing operand) Try installing GNU awk instead. - The usage of the SAM-PA symbol '?' for glottal stop will not work in the option KANSTR="? i: g @ n" because the shell can't handle the '?'. However, it works if the input is read from a BPF file with the option BPF=file.par. - The table of Extended German SAM-PA lists some non-standard symbols %< and %> for uncertain word boundaries which are not recognized by MAUS The symbols '#' and '' are recogized and modelled as optional silence intervals; the symbols '<' and '>' are recognized and modelled as non-optional silence intervals. Please note that the symbol '#' is also used in the the option KANSTR="..." to mark word boundaries - The tier KAN in the BPF input file must not contain any 'silence words' that is words that are entirely encoded as a single optional silence model, e.g. KAN: 0 If you must model such silence words, use the non-optional silence model '

' instead. - If running in 'chunk segmentation' mode (= USETRN=1 and more than one TRN entry in the input BPF), overlapping chunks cannot be processed for TextGrid|emu|EMU output because these formats do not allow segments with negative time. - If running in 'chunk segmentation' mode (= USETRN=1 and more than one TRN entry in the input BPF) and the TRN tier in the input BPF does not cover the entire KAN tier (= they are not synchronous, e.g. the TRN describes only a subset of words), this will not work for OUTFORMAT=emu|EMU, and the script will terminate with an ERROR; other formats tolerated partial TRN (covering only a subset of the KAN tier). - If a phoneme set requires coding of quantities with an intermediate ':', such as Estonian Q III 'C:C', there is a rare bug in the pronunciation module that issues an error, if + the rule set contains two rules of the form L,C:C,R>L,X,R and L,C:C,R>L,Y,Z,R where L,R,X,Y,Z are arbitrary symbols, and + the input BPF contains the sequence L,C:C,R We take care that the rule sets delivered in the MAUS package do not contain any such rules. - maus reports: .../word_var-2.0: Befehl nicht gefunden. or .../word_var-2.0: command not found word_var and also graphvis are very old binaries compiled with ELF 5. Most likely your Linux system is missing one of the following libraries: libXm.so.2.0 libm.so.5 libg++.so.27 Since version 1.20 we replaced the word_var-2.0 binary by a new binary compiled under SuSE 9.0 Also we added the sources for word_var in the subdir ./word_var - you run into problems when creating a TextGrid output. Some 'awk' installations produce float number with a comma instead of a dot (e.g. 1,2435 instead of 1.2435). This can be caused by a wrong LC_* or LANG environment variable. Since maus expects awk to print floats with dots, this may cause your problems. - Rule sets (option RULESET=*.nrul|rul) distinguish between three 'boundary' symbols: // : utterance end /#/ : utterance medial word boundary (= not the utterance initial or end!) So, the logical structure of an utterance looks like: < wordfirst # ... # wordlast > // may not (since it is part of the rule syntax). E.g. we want to model the possible insertion of a /?/ before a word-initial /a:/, then the phonological rules would look like: <-a:-><-?,a:- #-a:->#,?,a:- In case we want the rule only applied only to utterance intial /a:/ we simply use <-a:-><-?,a:- If we use the utterance final symbol />/ in the same fashion, such as -a:->>-a:,n-> (after an utterance final /a:/ an /n/ may be iserted), this leads to an error (because the first '>' is interpreted as the rule '>' and not the utterance end symbol!).