End User License Agreements (EULAs): Bavarian Archive for Speech Signals Webservices: Terms of Use for Academic Institutions Date 2018-12-26 The Bavarian Archive for Speech Signals at the Ludwig-Maximilians-Universitaet Munich (BAS) provides free webservices (termed 'webservices' in the following) for members of academic institutions (termed 'user' in the following) subject to the following Terms of Use. The BAS may amend these Terms of Use at any time by posting amended versions on this website. 1. In order to access the webservices, the user must be a member of an academic institution (i.e. can be authenticated via a Shiboleth authentification service provided by the academic home institution of the user; in case your institution does not provide such a service, please obtain a CLARIN IDP account at https://user.clarin.eu/user/register). The BAS reserves the right to grant or deny access to the webservices or to terminate running processes at any time. 2. The results of the webservices may be used for non-profit research purposes only. If you intend to utilize webservice results for commercial purposes or to access webservices from a commercial host, please contact the BAS prior to any usage to obtain a BAS user license. The BAS will apply reverse IP mapping to determine the IP address of hosts calling/accessing the webservices to verify non-profit usage. 3. Uploaded Data The user must be able to present proof that they have the rights to use all data that they upload. The user entitles the BAS to store the uploaded data, to process them, to store all intermediate and final results of the process and to remove data that have been stored during processing. All uploaded material will be deleted automatically after 24 hours. Uploaded data will not be forwarded to third parties, except in the case of the service 'ASR', which forwards user data to a third-party, commercial webservice provider (see details and EULAS of these third-party providers on the 'ASR' webservice page). The Terms of Use of these third-party providers differ from the Terms of Use of the BAS. The user indemnifies and will not hold the BAS responsible for any claim arising from use of these third party webservices. 4. Results The user agrees to avoid any non-ethical usage of the webservices or of results of webservices. The copyright of the results of the webservices belongs to the user of the webservices. The BAS retains the right to store the results only for the technical purpose of providing the service. Intermediate or end results will not be exploited or reviewed in any way by BAS, and are deleted automatically after 24 hours. 5. Monitoring For monitoring purposes each transaction on the BAS server will be logged internally and by an external non-public monitoring service of the CLARIN consortium. Internal logging information is confidential and will not be released to third parties. External logging information will be anonymized (stripped from personal information) and made accessible to partners of the CLARIN consortium for the purpose of deriving usage statistics regarding the webservices. The user agrees to these monitoring policies. 6. The user will indemnify and will not hold the BAS responsible for any claim arising out of the use of the webservices. 7. Disclaimer: the use of the webservices is at the user's own risk. The webservices are provided on an "as is" basis. The BAS does not provide a warranty of any kind for the webservices. 8. Limitation of liability: the BAS will not be liable for any damages resulting from the use of the webservices or the use of results of the webservices; the BAS aims to provide the webservices on a 24/7 basis, but will not be liable for any damages that are caused by non-availability of the webservices for any reason. ------------------ Bavarian Archive for Speech Signals Webservices: Terms of Use for Commercial Institutions Date 2018-12-26 The Bavarian Archive for Speech Signals at the Ludwig-Maximilians-Universitaet Munich (BAS) provides licensed webservices (termed 'webservices' in the following) to commercial institutions (termed 'user' in the following) subject to the following Terms of Use. The BAS may amend these Terms of Use at any time by posting amended versions on this website. 1. In order to access the webservices, the user must obtain a BAS user license for the respective service, except for the service 'ASR' which may not be used by commercial institutions. BAS user licenses are time limited and amount limited (e.g. a maximum of calls per day); the user is obliged to restrict his/her usage within these limits; otherwise the BAS reserves the right to deny access to the webservices or to terminate running processes. The BAS will apply amount monitoring and reverse IP mapping to determine the IP address of hosts calling/accessing the webservices to verify usage within the contracted limits. 2. The results of the licensed webservice may be used for profit purposes except where the results of the service are traded directly to a third party (i.e. the user acts as a retailer or broker). 3. Uploaded Data The user must be able to present proof that he has the rights to use all data that are uploaded. The user entitles the BAS to store the uploaded data, to process them, to store all intermediate and final results of the process and to remove data that have been stored during processing. All uploaded materials will be deleted automatically after 24 hours. Uploaded data will not be forwarded to third parties. 4. Results The user agrees to avoid any non-ethical usage of the webservices or of results of webservices. The copyright of the results of the webservices belongs to the user of the webservices. The BAS retains the right to store the results only for the technical purpose of providing the service. Intermediate or end results will not be exploited or reviewed in any way by BAS, and are deleted automatically after 24 hours. 5. Monitoring For monitoring purposes each transaction on the BAS server will be logged internally and by an external non-public monitoring service of the CLARIN consortium. Internal logging information is confidential and will not be released to third parties. External logging information will be anonymized (stripped from personal information) and made accessible to partners of the CLARIN consortium for the purpose of deriving usage statistics regarding the webservices. The user agrees to these monitoring policies. 6. The user will indemnify and will not hold the BAS responsible for any claim arising out of the use of the licensed webservice(s). 7. Disclaimer: the use of the licensed webservice(s) is at the user's own risk. The webservices are provided on an "as is" basis. The BAS does not provide a warranty of any kind for the webservice(s). 8. Limitation of liability: the BAS will not be liable for any damages resulting from the use of the licensed webservice(s) or the use of results of the licensed webservice(s); the BAS aims to provide the webservice(s) on a 24/7 basis, but will not be liable for any damages that are caused by non-availability of the webservice(s) for any reason; the BAS will not be liable for any unauthorized use of the webservice 'ASR'. ------------------ API of BAS WebService REST Calls ================================ Note about Server Load: to avoid overloading our servers, please consider using the following GET to check the current server load: https://clarin.phonetik.uni-muenchen.de/BASWebServices/services/getLoadIndicator This call returns a number (as string): 0 : low load, 1 : medium load, 2 : full load Please do not issue more calls, when this call returns 2. Note about availability: server is available 24/7 except saturdays, when we schedule maintenance cycles; these service cylces are announced three days in advance on the web interface. help ------------------ Example curl call for this document is: curl -X GET http://clarin.phonetik.uni-muenchen.de/BASWebServices/services/help ---------------------------------------------------------------- ---------------------------------------------------------------- runPipelineWithASR ------------------ Description: This is a service that combines two or more BAS webservices into a processing chain (pipeline) including Automatic Speech Recognition (ASR). Since not every BAS webservice can be combined with another, the service only offers pipelines that make sense for the user. Most pipelines executed by this service can also be executed by calling two or more BAS webservices after another and passing the output of one service to the next (exceptions are pipelines dealing with speaker diarization, SD). The benefit, however, is that the user data (which can be substantially large) will be up- and down-loaded only once, and of course that the user does not have to formulate several BAS webservice calls (with matching parameters). The parameter PIPE defines which processing pipeline will be executed; depending on the value of PIPE the service accepts parameters for the BAS webservices which are involved in the pipeline, and which make sense in the context of the pipeline. Other parameters will be set automatically depending on the value of PIPE (e.g. the MAUS parameter USETRN will be set to 'true' in the case of a pipeline where the runChunkPreparation service passes a BPF file to the runMAUS service containing a chunk segmentation in the TRN tier). Since this service basically comprise of all BAS web services, the number of possible parameters is necessarily huge. To make the selection easier we group the parameters into MANDATORY (that have to be set for every pipeline), optional parameters that are shared by more than one service, and then by PIPELINE ELEMENT (e.g. ASR, MAUS, in alphabetical order). In most cases it is sufficient to set the MANDATORY parameters, and the PipelineWithASR service will then set the element specific parameters automatically. The service will perform a pre-check on all set parameters to detect conflicts and then terminate with an informative message; but there are still many cases where the pipeline will start working and then terminate with an error caused by a service later down the pipe. Starting with version 6.0 the service will deliver a ZIP archive instead of the output of the last service in PIPE, if the option 'KEEP' ('Keep everything') is enabled; this ZIP will contain input(s), all intermediary results, end result and a protocol of the pipeline process. This service is experimental and can be terminated any time without warning. It is restricted for academic use only; therefore this service cannot be called as a RESTful service like other BAS services, and the Web API to this service is protected by AAI Shiboleth authentification. Example curl call is: curl -v -X POST -H 'content-type: multipart/form-data' -F com=yes -F INSKANTEXTGRID=true -F selectSpeaker= -F USETEXTENHANCE=true -F TARGETRATE=100000 -F TEXT=@ -F NOISE=0 -F PIPE= -F aligner=hirschberg -F ACCESSCODE= -F NOISEPROFILE=0 -F neg=@ -F speakMatch= -F speakNumber=0 -F ASIGNAL=brownNoise -F NORM=true -F mauschunking=false -F minSpeakNumber=0 -F INSORTTEXTGRID=true -F WEIGHT=default -F minanchorlength=3 -F TROSpeakerID=true -F LANGUAGE=deu-DE -F NHANS=none -F USEAUDIOENHANCE=true -F speakMatchASR= -F maxlength=0 -F KEEP=false -F LEFT_BRACKET=# -F nrm=no -F LOWF=0 -F WHITESPACE_REPLACEMENT=_ -F CHANNELSELECT= -F marker=punct -F USEREMAIL= -F boost=true -F except=@ -F MINPAUSLEN=5 -F forcechunking=false -F NOINITIALFINALSILENCE=false -F InputTierName=unknown -F BRACKETS=<> -F OUTFORMAT=TextGrid -F syl=no -F ENDWORD=999999 -F TROSpeakerIDASR=false -F wsync=yes -F UTTERANCELEVEL=false -F featset=standard -F pos=@ -F APHONE= -F INSPROB=0.0 -F OUTSYMBOL=x-sampa -F RULESET=@ -F maxSpeakNumber=0 -F USEWORDASTURN=false -F allowOverlaps=false -F minchunkduration=15 -F SIGNAL=@ -F stress=no -F imap=@ -F MODUS=default -F RELAXMINDUR=false -F ATERMS=@ -F numberSpeakDiar=0 -F RELAXMINDURTHREE=false -F STARTWORD=0 -F INSYMBOL=sampa -F PRESEG=false -F AWORD=ANONYMIZED -F USETRN=false -F ASRType=autoSelect -F MAUSSHIFT=default -F diarization=false -F HIGHF=0 -F silenceonly=0 -F boost_minanchorlength=4 -F ADDSEGPROB=false 'https://clarin.phonetik.uni-muenchen.de/BASWebServices/services/runPipelineWithASR' Parameters: [com] [INSKANTEXTGRID] [selectSpeaker] [USETEXTENHANCE] [TARGETRATE] [TEXT] [NOISE] [PIPE] [aligner] [ACCESSCODE] [NOISEPROFILE] [neg] [speakMatch] [speakNumber] [ASIGNAL] [NORM] [mauschunking] [minSpeakNumber] [INSORTTEXTGRID] [WEIGHT] [minanchorlength] [TROSpeakerID] [LANGUAGE] [NHANS] [USEAUDIOENHANCE] [speakMatchASR] [maxlength] [KEEP] [LEFT_BRACKET] [nrm] [LOWF] [WHITESPACE_REPLACEMENT] [CHANNELSELECT] [marker] [USEREMAIL] [boost] [except] [MINPAUSLEN] [forcechunking] [NOINITIALFINALSILENCE] [InputTierName] [BRACKETS] [OUTFORMAT] [syl] [ENDWORD] [TROSpeakerIDASR] [wsync] [UTTERANCELEVEL] [featset] [pos] [APHONE] [INSPROB] [OUTSYMBOL] [RULESET] [maxSpeakNumber] [USEWORDASTURN] [allowOverlaps] [minchunkduration] SIGNAL [stress] [imap] [MODUS] [RELAXMINDUR] [ATERMS] [numberSpeakDiar] [RELAXMINDURTHREE] [STARTWORD] [INSYMBOL] [PRESEG] [AWORD] [USETRN] [ASRType] [MAUSSHIFT] [diarization] [HIGHF] [silenceonly] [boost_minanchorlength] [ADDSEGPROB] Parameter description: com: [yes, no] Option com (Keep Annotation): yes/no decision whether <*> strings in text inputs should be treated as annotation markers (yes) or as spoken words (no). If set to 'yes', then strings of this type are considered as annotation markers that are not processed as spoken words but passed on to the output. The <*> markers will appear in the ORT and KAN tier with a word index on their own. WebMAUS makes use of two special markers < usb > (e.g. non-understandable word or other human noises) and < nib > (non-human noise). All other markers <*> are modelled as silence. Markers must be separated from word tokens by blanks; they do not need to be blank-separated from non-word tokens as punctuation. Note that the default service 'TEXTENHANCE' that is called by any pipeline that reads input text will replace white space characters (such as blanks) within the <*> by the character given in option 'White space replacement'. INSKANTEXTGRID: [true, false] OPTION INSKANTEXTGRID: Switch to create an additional tier in the TextGrid output file with a word segmentation labelled with the canonic phonemic transcript (taken from the input KAN tier). selectSpeaker: Option selectSpeaker ('Speaker processed by pipeline'): the rest of the pipeline processes only the speech segments labelled with the speaker name given in this option. Note that the name must match the standard speaker labels 'S1', 'S2' etc., or - if the option 'speakMatch' ('Speaker label mapping') is used - it must match the assigned speaker name instead. Example: 'speakMatch' is set to 'Ann,Tom' and you want just to process the speech of the second appearing speaker (Tom) in the pipeline, then set 'selectSpeaker=Tom'; if 'speakMatch' is not set, set 'selectSpeaker=S2'. If the option is not set or set to the empty string, all speakers are processed by the pipe. USETEXTENHANCE: [true, false] Switch on the input text pre-processing 'textEnhance' (true). If the PIPE starts with G2P, the input text is first normalized by 'textEnhance'. Different TXT formats are mapped to simple UTF-8 Unix style TXT format, and textmarkers are normalized to be conform with BAS WebServices. TARGETRATE: [100000, 20000, 10000] Option TARGETRATE: the resolution of segment boundaries in output measured in 100nsec units (default 100000 = 10msec). Decreasing this value (min is 10000) increases computation time, does not increase segmental accuracy in average, but allows output segment boundaries to assume more possible values (default segment boundaries are quantizised in 10msec steps). This is useful, if MAUS results are analysed for duration of phones or syllables. TEXT: Optional parameter TEXT: The textual input to the pipeline, usually some form of text or transcript. Depending on parameter PIPE this can be a text document (all formats that service runTextEnhance supports), a comma separated spreadsheet (csv), a praat TextGrid (TextGrid), an ELAN EAF (eaf), or a BAS Partitur Format (par) file. See http://www.bas.uni-muenchen.de/forschung/Bas/BasFormatseng.html for detailed description of the BPF. Note that PIPEs starting with service ASR or MINNI do not require this parameter. Special languages for text input: Thai, Russian and Georgian expect their respective standard alphabets; Japanese allows Kanji or Katakana or a mixture of both, but the tokenized output will contain only the Katakana version of the input; Swiss German expects input to be transcribed in 'Dieth' (https://en.wikipedia.org/wiki/Swiss_German); Australian Aboriginal languages (including Kunwinjku, Yolnu Matha) expect so called 'Practical Orthography' (https://en.wikipedia.org/wiki/Transcription_of_Australian_Aboriginal_languages); Persian accepts a romanized version of Farsi developped by Elisa Pellegrino and Hama Asadi (see http://www.bas.uni-muenchen.de/Bas/BASWebServices/DOCS/PersianRomanizationTable.pdf) for details). NOISE: [0.0, 100.0] Option NOISE: if set to a value between 1...100, a noise profile is calculated from the leading and/or trailing parts of the input signal, and then the signal is noise reduced with a strength proportional to the NOISE value (using SoX spectral noise reduction effect 'noisered'). The noise reduction is applied before any other processing/merging in all input channels. If NOISE=0, no noise reduction takes place. PIPE: [ASR_G2P_CHUNKER, ASR_SUBTITLE, G2P_CHUNKER, MINNI_PHO2SYL, ASR_G2P_CHUNKER_MAUS, ASR_G2P_CHUNKER_MAUS_SD, ASR_G2P_CHUNKER_MAUS_PHO2SYL, ASR_G2P_CHUNKER_MAUS_PHO2SYL_SD, ASR_G2P_CHUNKER_MAUS_SUBTITLE, ASR_G2P_CHUNKER_MAUS_SUBTITLE_SD, ASR_G2P_CHUNKER_MAUS_SUBTITLE_PHO2SYL, ASR_G2P_CHUNKER_MAUS_SUBTITLE_PHO2SYL_SD, ASR_G2P_MAUS, ASR_G2P_MAUS_SD, ASR_G2P_MAUS_PHO2SYL, ASR_G2P_MAUS_PHO2SYL_SD, ASR_G2P_MAUS_SUBTITLE, ASR_G2P_MAUS_SUBTITLE_SD, ASR_G2P_MAUS_SUBTITLE_PHO2SYL, ASR_G2P_MAUS_SUBTITLE_PHO2SYL_SD, CHUNKER_MAUS, CHUNKER_MAUS_SD, CHUNKER_MAUS_PHO2SYL, CHUNKER_MAUS_PHO2SYL_SD, CHUNKER_MAUS_SUBTITLE, CHUNKER_MAUS_SUBTITLE_SD, CHUNKER_MAUS_SUBTITLE_PHO2SYL, CHUNKER_MAUS_SUBTITLE_PHO2SYL_SD, CHUNKPREP_G2P_MAUS, CHUNKPREP_G2P_MAUS_SD, CHUNKPREP_G2P_MAUS_PHO2SYL, CHUNKPREP_G2P_MAUS_PHO2SYL_SD, CHUNKPREP_G2P_MAUS_SUBTITLE, CHUNKPREP_G2P_MAUS_SUBTITLE_SD, CHUNKPREP_G2P_MAUS_SUBTITLE_PHO2SYL, CHUNKPREP_G2P_MAUS_SUBTITLE_PHO2SYL_SD, SD_ASR_G2P_MAUS, SD_ASR_G2P_MAUS_PHO2SYL, SD_ASR_G2P_MAUS_SUBTITLE, SD_ASR_G2P_MAUS_SUBTITLE_PHO2SYL, G2P_CHUNKER_MAUS, G2P_CHUNKER_MAUS_SD, G2P_CHUNKER_MAUS_PHO2SYL, G2P_CHUNKER_MAUS_PHO2SYL_SD, G2P_CHUNKER_MAUS_SUBTITLE, G2P_CHUNKER_MAUS_SUBTITLE_SD, G2P_CHUNKER_MAUS_SUBTITLE_PHO2SYL, G2P_CHUNKER_MAUS_SUBTITLE_PHO2SYL_SD, G2P_MAUS, G2P_MAUS_SD, G2P_MAUS_PHO2SYL, G2P_MAUS_PHO2SYL_SD, G2P_MAUS_SUBTITLE, G2P_MAUS_SUBTITLE_SD, G2P_MAUS_SUBTITLE_PHO2SYL, G2P_MAUS_SUBTITLE_PHO2SYL_SD, MAUS_PHO2SYL, MAUS_PHO2SYL_SD, MAUS_SUBTITLE, MAUS_SUBTITLE_SD, MAUS_SUBTITLE_PHO2SYL, MAUS_SUBTITLE_PHO2SYL_SD, ASR_G2P_CHUNKER_MAUS_ANONYMIZER, ASR_G2P_CHUNKER_MAUS_ANONYMIZER_SD, ASR_G2P_CHUNKER_MAUS_PHO2SYL_ANONYMIZER, ASR_G2P_CHUNKER_MAUS_PHO2SYL_ANONYMIZER_SD, ASR_G2P_CHUNKER_MAUS_ANONYMIZER_SUBTITLE, ASR_G2P_CHUNKER_MAUS_ANONYMIZER_SUBTITLE_SD, ASR_G2P_CHUNKER_MAUS_SUBTITLE_PHO2SYL_ANONYMIZER, ASR_G2P_CHUNKER_MAUS_SUBTITLE_PHO2SYL_ANONYMIZER_SD, ASR_G2P_MAUS_ANONYMIZER, ASR_G2P_MAUS_ANONYMIZER_SD, ASR_G2P_MAUS_PHO2SYL_ANONYMIZER, ASR_G2P_MAUS_PHO2SYL_ANONYMIZER_SD, ASR_G2P_MAUS_ANONYMIZER_SUBTITLE, ASR_G2P_MAUS_ANONYMIZER_SUBTITLE_SD, ASR_G2P_MAUS_SUBTITLE_PHO2SYL_ANONYMIZER, ASR_G2P_MAUS_SUBTITLE_PHO2SYL_ANONYMIZER_SD, CHUNKER_MAUS_ANONYMIZER, CHUNKER_MAUS_ANONYMIZER_SD, CHUNKER_MAUS_PHO2SYL_ANONYMIZER, CHUNKER_MAUS_PHO2SYL_ANONYMIZER_SD, CHUNKER_MAUS_ANONYMIZER_SUBTITLE, CHUNKER_MAUS_ANONYMIZER_SUBTITLE_SD, CHUNKER_MAUS_SUBTITLE_PHO2SYL_ANONYMIZER, CHUNKER_MAUS_SUBTITLE_PHO2SYL_ANONYMIZER_SD, CHUNKPREP_G2P_MAUS_ANONYMIZER, CHUNKPREP_G2P_MAUS_ANONYMIZER_SD, CHUNKPREP_G2P_MAUS_PHO2SYL_ANONYMIZER, CHUNKPREP_G2P_MAUS_PHO2SYL_ANONYMIZER_SD, CHUNKPREP_G2P_MAUS_ANONYMIZER_SUBTITLE, CHUNKPREP_G2P_MAUS_ANONYMIZER_SUBTITLE_SD, CHUNKPREP_G2P_MAUS_SUBTITLE_PHO2SYL_ANONYMIZER, CHUNKPREP_G2P_MAUS_SUBTITLE_PHO2SYL_ANONYMIZER_SD, SD_ASR_G2P_MAUS_ANONYMIZER, SD_ASR_G2P_MAUS_PHO2SYL_ANONYMIZER, SD_ASR_G2P_MAUS_ANONYMIZER_SUBTITLE, SD_ASR_G2P_MAUS_SUBTITLE_PHO2SYL_ANONYMIZER, G2P_CHUNKER_MAUS_ANONYMIZER, G2P_CHUNKER_MAUS_ANONYMIZER_SD, G2P_CHUNKER_MAUS_PHO2SYL_ANONYMIZER, G2P_CHUNKER_MAUS_PHO2SYL_ANONYMIZER_SD, G2P_CHUNKER_MAUS_ANONYMIZER_SUBTITLE, G2P_CHUNKER_MAUS_ANONYMIZER_SUBTITLE_SD, G2P_CHUNKER_MAUS_SUBTITLE_PHO2SYL_ANONYMIZER, G2P_CHUNKER_MAUS_SUBTITLE_PHO2SYL_ANONYMIZER_SD, G2P_MAUS_ANONYMIZER, G2P_MAUS_ANONYMIZER_SD, G2P_MAUS_PHO2SYL_ANONYMIZER, G2P_MAUS_PHO2SYL_ANONYMIZER_SD, G2P_MAUS_ANONYMIZER_SUBTITLE, G2P_MAUS_ANONYMIZER_SUBTITLE_SD, G2P_MAUS_SUBTITLE_PHO2SYL_ANONYMIZER, G2P_MAUS_SUBTITLE_PHO2SYL_ANONYMIZER_SD, MAUS_ANONYMIZER, MAUS_ANONYMIZER_SD, MAUS_PHO2SYL_ANONYMIZER, MAUS_PHO2SYL_ANONYMIZER_SD, MAUS_ANONYMIZER_SUBTITLE, MAUS_ANONYMIZER_SUBTITLE_SD, MAUS_SUBTITLE_PHO2SYL_ANONYMIZER, MAUS_SUBTITLE_PHO2SYL_ANONYMIZER_SD] Parameter PIPE: The type of pipeline to process. Values of parameter PIPE have the general form SERVICE_SERVICE[_SERVICE ...], where SERVICE is one of ASR, G2P, MAUS, CHUNKER, CHUNKPREP, PHO2SYL, MINNI, SUBTITLE, ANONYMIZER, SD. For example PIPE=G2P_CHUNKER_MAUS_PHO2SYL_SD denotes a pipe that runs over these 5 services. The first SERVICE in the PIPE value detemines whether both, SIGNAL and TEXT, inputs are necessary or only a SIGNAL; the last SERVICE in PIPE determines which output the pipeline can produce. Therefore it is quite possible to call a pipe with impossible input/output configuration which will cause an ERROR. Every media file uploaded will first be passed through the service 'AudioEnhance' to normalized the media file to a RIFF WAVE format file; every text input is first run through the service 'TextEnhance' to normalize the text format; for both these obligatory services exist options as for the other pipeline SERVICES. Special pipelines: There are some pipes that do more than simply chaining the services and piping the output of a module as input into the next module: 1. Pipes that end on "..._SD" The final speaker diarization module (SD) does not actual read any annotations from the previous services; it rather runs the speaker diarization in parallel on the signal input and then merges the speaker segmentation and laelling with whatever the rest of the pipe has produced, e.g. it merges speaker segments and word segments to produce a (symbolic) speaker labelling of the word segments. 2. Pipes that start with "SD_ASR_..." First a speaker diarization is run on the input signal; then only the speaker segments (optionally filtered by option 'selectSpeaker') are passed to the ASR module; all results (one per speaker segment) of ASR are summarized into a single BPF file with tiers ORT,TRO (from ASR) and TRN,SPD (from SD) and then passed on through the rest pipe, which treats this exactly like a chunk segmentation as produced by module CHUNKPREP. aligner: [hirschberg, fast] Symbolic aligner to be used. The "fast" aligner performs approximate alignment by splitting the alignment matrix into "windows" of size 5000*5000. The "hirschberg" aligner performs optimal matching. On recordings below the 1 hour mark, the choice of aligner does not make a big difference in runtime. On longer recordings, you can improve runtime by selecting the "fast" aligner. Note however that this choice increases the probability of errors on recordings with untranscribed stretches (such as long pauses, musical interludes, untranscribed speech). Therefore, the "hirschberg" aligner should be used on this kind of material. ACCESSCODE: Exceed quota code (ACCESSCODE): special code a user has acquired to override default quotas. Not needed for normal operation. NOISEPROFILE: [-1000000.0, 1000000.0] Option NOISEPROFILE: if set to 0 (default), the noise profile is calculated from the leading and trailing portion of the recording (estimated by a silence detector); if set to a positive value, the noise profile is calculated from the leading NOISEPROFILE samples; if set to a negative value, the noise profile is calculated from the trailing NOISEPROFILE samples. This is useful, if the recording contains loud noise at the begin/end of the recording that would not be selected by the silence detector (because of too much energy). neg: Option neg : N-HANS sample recording (RIFF WAVE *.wav) of the noise to be removed from signal (mode 'denoiser') or the speaker/speaker group to be removed from signal (mode 'separator'). The 'neg' sample is applied to all processed input signals; do not upload more than 2sec of clean signal, and make sure that the relevant signal is present within the very first second; 'clean signal' means that the sample should not contain any traces of the main voice or of the 'pos' noise sample. The upload of the 'neg' sample is mandatory for both N-HANS modi (see option 'NHANS'). speakMatch: Option speakMatch (': if set to a list of comma separated names (e.g. speakMatch='Anton,Berta,Charlie', the corresponding speaker labels found by the speaker diarization in the order of appearance are replaced by these names (e.g. 'S1' to 'Anton', 'S2' to 'Berta' etc.). This allows the user to create SD annotation using her self defined speaker labels, if the user knows the order of appearance; it is obvious that this feature only makes sense in single file processing, since the speaker labels and the order of appearance differ from one recording to the next; the suggested mode of operation is to run the service in batch mode over all recordings with speakMatch="", then inspect manually the resulting annotation and define speaker labels in the order of appearance for each recording, and then run the service in single file mode for each recording again with the corresponding speakMatch list. If the speakMatch option contains a comma separated list of value pairs like 'S1:Anton', only the speaker labels listed on the lefthand side of each pair are patched, e.g. for speakMatch='S3:Charlie,S6:Florian' only the third and sixth appearing speaker are renamed to Charlie and Florian respectively. speakNumber: [0.0, 999999.0] Option speakNumber restricts the number of detected speakers by the speaker diarization to the given number. If set to 0 (default), the SD method determines the number automatically. ASIGNAL: [brownNoise, beep, silence] Option ASIGNAL: the type of signal to mask anonymized terms in the signal. 'brownNoise' is brown noise; 'beep' is a 500Hz sinus; 'silence' is total silence (zero signal); masking signals have an amplitude of -10dB of the maximum amplitude and are faded in and out with a very short sinoid function. NORM: [true, false] Option NORM: if true (selected) each input channel is amplitude normalised to -3dB before any merge. mauschunking: [true, false] If this parameter is set to true, the recognition module will model words as MAUS graphs as opposed to canonical chains of phonemes. This will slow down the recognition engine, but it may help with non-canonical speech (e.g., accents or dialects). minSpeakNumber: [0.0, 999999.0] Option minSpeakNumber defines a hard lower bound of the number of detected speakers. If set to 0 (default), no lower bound. INSORTTEXTGRID: [true, false] Option INSORTTEXTGRID: Switch to create an additional tier ORT in the TextGrid output file with a word segmentation labelled with the orthographic transcript (taken from the input ORT tier); this option is only effective, if the input BPF contains an additional ORT tier. WEIGHT: MAUS pipeline: The option WEIGHT weights the influence of the statistical pronunciation model against the acoustical scores. More precisely WEIGHT is multiplied to the pronunciation model score (log likelihood) before adding the score to the acoustical score within the search. Since the pronunciation model in most cases favors the canonical pronunciation, increasing WEIGHT will at some point cause MAUS to choose always the canonical pronunciation; lower values of WEIGHT will favor less probable paths be selected according to acoustic evidence. If the acoustic quality of the signal is very good and the HMMs of the language are well trained, it makes sense to lower WEIGHT. For most languages this option is default to 1.0. In an evaluation on parts of the German Verbmobil data set (27425 segments) which were segmented and labelled manually (MAUS DEV set) WEIGHT was optimized to 7.0. Note that this might NOT be the optimal value for other languages. For instance Italian shows best results with WEIGHT=1.0, Estonian with WEIGHT=2.5. If set to default, a language specific optimal value is chosen automatically. MINNI pipeline: The option WEIGHT weights the influence of the statistical phonotactic bigram model (the a-priori probability of pronuciation) against the acoustical scores. More precisely WEIGHT is multiplied to the phonotactic model score (log likelihood) before adding the score to the acoustical score within the Viterbi search. Since MINNI uses a phonotactic bigram model, increasing WEIGHT will at some point cause MINNI to choose always the same most likely sequence of phones according to the bigram model (disregarding the acoustics) with equally long segments, i.e. no meaningful segmentation at all; lower values of WEIGHT will cause phoneme sequences to be detected according to acoustic evidence, even if the resulting pronunciation is less likely according to the phonotactic bigram model; if WEIGHT is set to 0.0 the bigram is completely ignored and MINNI performs a phone recognition bases only on acoustic likelihood (and any sequence of phones is a-priori equally probable). If the acoustic quality of the signal is very good and the HMMs of the language are well trained, it makes sense to lower WEIGHT to achieve more precise results given the acoustic. For most languages this option is default to 1.0 (which means that acoustic evidence and a-priori pronunciation probability are treated equally). minanchorlength: [2.0, 8.0] The chunker performs speech recognition and symbolic alignment to find regions of correctly aligned words (so-called 'anchors'). Setting this parameter to a high value (e.g. 4-5) means that the chunker finds chunk boundaries with higher certainty. However, the total number of discovered chunk boundaries may be reduced as a consequence. A low value (e.g. 2) is likely to lead to a more fine-grained chunking result, but with lower confidence for individual chunk boundaries. TROSpeakerID: [true, false] If set to true (default: false), in pipes 'SD_ASR_...' speaker ID labels of the form ' ' will be inserted before words in the TRO tier, that start a new speaker turn of speaker labelled by 'XXX'. The inserted speaker label 'XXX' is either one of the standardized labels 'S1', 'S2', ... or mapped speaker labels taken from the option 'speakMatch'. The service also checks each preceeding word to a speaker turn change (the last word of the previous turn) and adds a trailing '.', if the word does not has already a trailing final punctuation sign (one of '!?.:...). This option enables pipelines that start with 'ASR' and end with 'SUBTITLE' to create subtitle tracks (e.g. WebVTT) that show the speaker ID and start a new subtitle at ech speaker turn change. LANGUAGE: [cat, deu, eng, fin, hat, hun, ita, mlt, nld, nze, pol, aus-AU, afr-ZA, sqi-AL, arb, eus-ES, eus-FR, cat-ES, nld-NL-GN, nld-NL, nld-NL-OH, nld-NL-PR, eng-US, eng-AU, eng-GB, eng-GB-OH, eng-GB-OHFAST, eng-GB-LE, eng-SC, eng-NZ, eng-CA, eng-GH, eng-IN, eng-IE, eng-KE, eng-NG, eng-PH, eng-ZA, eng-TZ, ekk-EE, kat-GE, fin-FI, fra-FR, deu-AT, deu-CH, deu-DE, deu-DE-OH, gsw-CH-BE, gsw-CH-BS, gsw-CH-GR, gsw-CH-SG, gsw-CH-ZH, gsw-CH, hat-HT, hun-HU, isl-IS, ita-IT, jpn-JP, gup-AU, sampa, ltz-LU, mlt-MT, nor-NO, fas-IR, pol-PL, ron-RO, rus-RU, slk-SK, spa-ES, spa-AR, spa-BO, spa-CL, spa-CO, spa-CR, spa-DO, spa-EC, spa-SV, spa-GT, spa-HN, spa-MX, spa-NI, spa-PA, spa-PY, spa-PE, spa-PR, spa-US, spa-UY, spa-VE, swe-SE, tha-TH, guf-AU] Language: RCFC5646 locale code of the processed speech; defines the phoneme set of input and the orthographic system of input text (if any); we use the RFC5646 sub-structure 'iso639-3 - iso3166-1 [ - iso3166-2], e.g. 'eng-US' for American English, 'deu-AT-1' for Austrian German spoken in 'Oberoesterreich'; the code 'sampa' ('Language independent') allows the user to upload a customized mapping from orthographic to phonologic form (see option 'imap'). Special languages: 'gsw-CH' denotes text written in Swiss German 'Dieth' transcription (https://en.wikipedia.org/wiki/Swiss_German); 'gsw-CH-*' are localized varieties in larger Swiss cities; 'jpn-JA' (Japanese) accepts Kanji or Katakana or a mixture of both, but the tokenized output will contain only the Katakana version of the input; 'aus-AU' (Australian Aboriginal languages, including Kunwinjku, Yolnu Matha) accept so called 'Modern Practical Orthography' (https://en.wikipedia.org/wiki/Transcription_of_Australian_Aboriginal_languages); 'fas-IR' (Persian) accepts a romanized version of Farsi developped by Elisa Pellegrino and Hama Asadi (see http://www.bas.uni-muenchen.de/Bas/BASWebServices/DOCS/PersianRomanizationTable.pdf for details); 'arb' is a macro language covering all Arabic varieties; the input must be encoded in a broad phonetic romanization developped by Jalal Tamimi and colleagues (see http://www.bas.uni-muenchen.de/Bas/BASWebServices/DOCS/TamimiRomanization.pdf for details). The language code is passed to all services of the pipeline, thus influencing the way these services will process the speech. If one member of the PIPE does not support the language, the service will try to determine another suitable language (WARNING is issued) or, if that is not possible, an ERROR is returned. Note that some services will support more languages than offered in the pipeline service, but we restrict the pipeline languages to a reasonable core set that is supported by most services. NHANS: [none, denoiser, separator] Option NHANS: the N-HANS audio enhancement mode (default: 'none') applied to the result of the SoX pipeline. 'denoiser' : the noise as represented in the sample recording uploaded in the mandatory option file 'neg' is removed from the signal; if another voice or noise sample is uploaded in option file 'pos' (optional), this noise/voice is being preserved in the signal together with the main voice. 'separator' : an interference speaker or speaker group as represented in the sample recording uploaded in the mandatory option file 'neg' is removed from the signal while the voice of a target speaker as uploaded in the mandatory option file 'pos' is being preserved in the signal. Both sample signals, 'neg' and 'pos', are applied to all processed input signals; do not upload more than 2sec of clean signal, and make sure that the relevant signal is present within the very first second; 'clean signal' means that the sample should not contain any traces of the main voice or of the other noise sample. USEAUDIOENHANCE: [true, false] Switch on the signal normalization 'AudioEnhance' (true). speakMatchASR: Option speakMatchASR: if set to a list of comma separated names (e.g. speakMatch='Anton,Berta,Charlie', the corresponding speaker labels found by the speaker diarization in the order of appearance are replaced by these names (e.g. 'S1' to 'Anton', 'S2' to 'Berta' etc.). This allows the user to create SD annotation using her self defined speaker labels, if the user knows the order of appearance; it is obvious that this feature only makes sense in single file processing, since the speaker labels and the order of appearance differ from one recording to the next; the suggested mode of operation is to run the service in batch mode over all recordings with speakMatch="", then inspect manually the resulting annotation and define speaker labels in the order of appearance for each recording, and then run the service in single file mode for each recording again with the corresponding speakMatch list. If the speakMatch option contains a comma separated list of value pairs like 'S1:Anton', only the speaker labels listed on the lefthand side of each pair are patched, e.g. for speakMatch='S3:Charlie,S6:Florian' only the third and sixth appearing speaker are renamed to Charlie and Florian respectively. maxlength: [0.0, 999.0] Maximum subtitle length. If set to 0, subtitles of indefinite length are created, based only on the distance of the split markers. If set to a value greater than 0, subtitles are split whenever a stretch between two neighbouring split markers is longer than that value (in words). Caution: This may lead to subtitle splits in suboptimal locations (e.g. inside syntactic phrases). KEEP: [true, false] Keep everything (KEEP): If set to true (default: false), the service will return a ZIP archive instead of the output of the last service in PIPE. The ZIP is named as the output file name (as defined in OUT) with extension zip and contains the following files: input(s) including optional files (e.g. RULESET), all intermediary results of the PIPE, the result of the pipeline, and a protocol listing all options; all stored files in the ZIP start with the file name body of the SIGNAL input followed by the marker '_LABEL', which indicates from which part of the pipe the file is produced, and the appropriate file type extension; 'LABEL' is one of INPUT, AUDIOENHANCE (which marks the pre-processed media file), ASR, CHUNKER, CHUNKPREP, G2P, MAUS, PHO2SYL, ANONYMIZER, SUBTITLE and README (which marks the protocol file). The protocol file contains a simple list of 'option = value' pairs. The result file(s) of the pipeline have no '_LABEL' marker. The KEEP option is useful for documenting scientific pipeline runs, and for retrieving results that are produced by the PIPE but are overwritten/not passed on by later services (e.g. an anonymized video or CHUNKER output). LEFT_BRACKET: One or more characters which mark comments reaching until the end of the line (default: #). E.g. if your input text contains comment lines that begin with ';', set this option to ';' to avoid that these comments are treated as spoken text. If you want to suppress the default '#' comment character, set this option to 'NONE'. If you are using comment lines in your input text, you must be absolutely sure that the comment character appears nowhere in the text except in comment lines! Note 1: the characters '&', '|' and '=' do not work as comment characters. Note 2: for technical reasons the value for this option cannot be empty. Note 3: the default character '#' cannot be combined with other characters, e.g. if you define this option as ';#', the '#' will be ignored. Note 4 (sorry): for the service 'Subtitle' comment lines must be terminated with a so called 'final punctuation sign', i.e. one of '.!?:…'; otherwise, an immediately following speaker marker will not be recognized. nrm: [yes, no] Text normalization. Currently available for German and English only. Detects and expands 22 non-standard word types. All output file types supported but not available for the following tokenized input types: bpf, TextGrid, and tcf. If switched off, only number expansion is carried out. LOWF: [0.0, 30000.0] Option LOWF: lower filter edge in Hz. If set >0Hz and HIGHF is 0Hz, a high pass filter with LOWF Hz is applied; if set >0Hz and HIGHF is set higher than LOWF, a band pass between LOWF and HIGHF is applied; if set >0Hz and HIGHF is set higher than 0Hz but lower than LOWF, a reject band pass between HIGHF and LOWF is applied. E.g. HIGHF = 3000 LOWF = 300 is telephone band; HIGHF = 45 LOWF = 55 filters out a 50Hz hum. WHITESPACE_REPLACEMENT: The character that whitespace in comments should be substituted by (default: '_'). The BAS WebServices require that annotation markers or comment lines in input texts do not contain white spaces. This option let you decide which character should be used to replace the white spaces. If set to the string 'NONE' no replacement takes place. CAUTION: the characters '&' and '=' do not work as replacements. CHANNELSELECT: Option CHANNELSELECT: list of comma-separated channel numbers that are selected for further processing from the input media file. Examples: MONO=true,CHANNELSELECT="" : merge multi-channel files into one channel, MONO=true,CHANNELSELECT="2,3,4" : merge only selected channels into one channel, MONO=false, CHANNELSELECT="3,4,1,2" : select and re-arrange channels, MONO=false, CHANNELSELECT="" : do nothing. Note that channels are numbered starting with 1 = left channel in stereo, 2 = right channel, ... By reversing the order of channel numbers in CHANNELSELECT you can swap channels, e.g. CHANNELSELECT="2,1" MONO=false will swap left and right channel of a stereo signal. marker: [punct, newline, tag] Marker used to split transcription into subtitles. If set to 'punct' (default), the transcription is split after 'terminal' punctuation marks (currently [.!?:…]. If set to 'newline', the transcription is split at newlines (\n or \r\n). If set to 'tag', the program expects a special < BREAK > tag inside the transcription (without the blanks between the brackets and BREAK!). USEREMAIL: Option USEREMAIL: if a valid email address is provided through this option, the service will send the XML file containing the results of the service run to this address after completion. It is recommended to set this option for long recordings (batch size <6, length >1h) since it is often problematic to wait for service completion over an instable internet connection or from a laptop that might go into hibernation. The email address provided is not stored on the server. It is sometimes even advisable to kill the browser tab after starting the call and wait for the result emails (only for batch size <6!). Beware: the download link to your result(s) will be valid for 24h after you receive the email; after that all your data will be purged from the server. Disclaimer: the usage of this option is at your own risk; the key URL to download your result file will be send without encryption in this email; be aware that anybody who can intercept this email will be able to access your result files using this key; the BAS at LMU Munich will not be held responsible for any security breach caused by using this email notification option. boost: [true, false] If set to true (the default), the chunker will start by running a so-called boost phase over the recording. This boost phase uses a phoneme-based decoder instead of speech recognition. Usually, the boost option reduces processing time. On noisy input or faulty transcriptions, the boost option can lead to an increase in errors. In this case (or if a previous run with boost set to 'true' has led to chunking errors), set this option to 'false'. except: Exception dictionary file overwriting the standard G2P output. Format: 2 semicolon-separated columns: word;transcript. Phonemes in transcript must be blank-separated. Example: sagt;z ' a x t. Note that the transcript must not contain phonemic symbols that are unknown to other services in the pipeline for the selected language; the service 'WebMAUS General' provides a list of all known symbols of a language MINPAUSLEN: [0.0, 999.0] Option MINPAUSLEN: Controls the behaviour of optional inter-word silence. If set to 1, maus will detect all inter-word silence intervals that can be found (minimum length for a silence interval is then 10 msec = 1 frame). If set to values n>1, the minimum length for an inter-word silence interval to be detected is set to n*10 msec. For example MINPAUSLEN of 5 will cause MAUS to suppress inter-word silence intervals up to a length of 40msec. Since 40 msec seems to be the border of perceivable silence, we set this option default to 5. With other words: inter-word silences smaller than 50msec are not segmented but rather distributed equally to the adjacent segments. If one of the adjacent segments happens to be a plosive then the deleted silence interval is added totally to the plosive; if both adjacent segments are plosives, the interval is equally spread as with non-plosive adjacent segments. forcechunking: [true, false, rescue] If this parameter is set to true, the chunker will run in the experimental 'forced chunking' mode (chunker option 'force'). While forced chunking is much more likely to return a fine-grained chunk segmentation, it is also more prone to chunking errors. As a compromise, you can also set this parameter to 'rescue'. In this case, the forced chunking algorithm is only invoked when the original algorithm has returned chunks that are too long for MAUS. NOINITIALFINALSILENCE: [true, false] Option NOINITIALFINALSILENCE: Switch to suppress the automatic modeling of an optional leading/trailing silence interval. This is useful, if for instance the signal is known to start with a stop and no leading silence, and the silence model would 'capture' the silence interval from the plosive. InputTierName: Option InputTierName: Only needed, if TEXT is in TextGrid/ELAN format. Name of the annotation tier, that contains the input words/chunks. BRACKETS: One or more pairs of characters which bracket annotation markers in the input. E.g. if your input text contains markers '{Lachen}' and '[noise]' that should be passed as markers and not as spoken text, set this option to '{}[]'. Note that blanks replacement within such markers (see option 'WHITESPACE_REPLACEMENT') only takes place in markern/comments that are defined here. OUTFORMAT: [bpf, exb, csv, TextGrid, emuDB, eaf, tei, srt, sub, vtt, par] Option OUTFORMAT: the output format of the pipe. Note that this depends on the selected PIPE, more precisely, whether the last service in the pipeline supports the format; if not, an ERROR is returned. Possible (selectable) formats are: 'TextGrid' - a praat compatible TextGrid file; 'par|bpf' - a BPF file (if the input (TEXT) is also a BPF file, the input is usually copied to the output with new (or replaced) tiers); 'csv' - a spread sheet (CSV table) containing the most prominent tiers of the annotation; 'emuDB' - an Emu compatible *_annot.json file; 'eaf' - an ELAN compatible annotation file; 'exb' - an EXMARaLDA compatible annotation file; 'tei' - an Iso TEI document; 'srt' - a SubRip subtitle format file; 'sub' - a SubViewer subtitle format file; 'vtt' - a 'WebVTT' subtitle format file. If output format is 'vtt' and a subtitle starts with a speaker marker of the form '<...>', a 'v ' is inserted before the '...'. For a description of BPF see http://www.bas.uni-muenchen.de/forschung/Bas/BasFormatseng.html. For a description of Emu see https://github.com/IPS-LMU/emuR. Note 1: using 'emuDB' will first produce only single annotation file *_annot.json; in the WebMAUS interface (https://clarin.phonetik.uni-muenchen.de/BASWebServices) you can process more than one file and than download a zipped Emu database; in this case don't forget to change the default name of the emuDB 'MAUSOUTPUT' using the R function emuR::rename_emuDB(). Note 2: if you need the same result in more than one format, select 'bpf' to produce a BPF file, and then convert this file with the service runAnnotConv ('AnnotConv') into the desired formats. Note 3: some format conversions are not loss-less; select 'bpf' to be sure that no information is lost. syl: [yes, no] Switches syllabification of the pronunciation in the KAN tier produced by module G2P on; the syllable boundary marker is '.'. This option only makes sense in languages in which the module G2P produces a different syllabification than the module PHO2SYL (e.g. tha-TH). Otherwise use a pipe that ends with the module PHO2SYL which will create tiers MAS (phonetic syllable) and KAS (phonologic syllable). WARNING: syl=yes causes G2P to switch off MAUS embedded mode; this might change the output for some languages because the output phoneme inventar is then SAMPA and not the SAMPA variant used by MAUS. Subsequent modules like MAUS might report an ERROR then. ENDWORD: [0.0, 999999.0] Option ENDWORD: If set to a value n<999999, this option causes maus to end the segmentation with the word number n (word numbering in BPF starts with 0). This is useful if the input signal file is just a segment within a longer transcript. See also option STARTWORD. TROSpeakerIDASR: [true, false] If set to true (default: false), and if the selected ASR service delivers a valid speaker diarization (tier SPK) and a TRO tier, the service will insert speaker ID labels of the form ' ' before each word in the TRO tier, that starts a new speaker turn of speaker labelled by 'XXX'. The inserted speaker label 'XXX' is either one of the standardized labels 'S1', 'S2', ... or mapped speaker labels taken from the option 'speakMatch'. The service also checks each preceeding word to a speaker turn change (the last word of the previous turn) and adds a trailing '.', if the word does not has already a trailing final punctuation sign (one of '!?.:...). This option enables pipelines that start with 'ASR' and end with 'SUBTITLE' to create subtitle tracks (e.g. WebVTT) that show the speaker ID and start a new subtitle at ech speaker turn change. wsync: [yes, no] Yes/no decision, whether each word boundary is considered as syllable boundary. Only relevant for phonetic transcription input from MAU, PHO, or SAP tiers (for input from the KAN tier this option is always set to 'yes'). If set to 'yes', each syllable is assigned to exactly one word index. If set to 'no', syllables can be part of more than one word. UTTERANCELEVEL: [true, false] Switch on utterance level modelling (true); only for PIPEs with text input. Every TEXT input line is modelled as an utterance in an additional annotation layer ('TRL') between recording (bundle) and words (ORT). This is usefull if the recording contains several sentences/utterances and you need hierarchical access to these in the resulting annotation structure. For example, in EMU-SDMS output the default hierarchy bundle->ORT->MAU is then changed to bundle->TRL->ORT->MAU. Note 1 : does not have any effect in CSV output. Note 2 : the use of this option causes the ORT tier to contain the raw word tokens instead of the (default) word-normalized word tokens (e.g. '5,' (raw token) vs. 'five' (word-normalized). featset: [standard, extended] Feature set used for grapheme-phoneme conversion. The standard set is the default and comprises a letter window centered on the grapheme to be converted. The extended set additionally includes part of speech and morphological analyses. The extended set is currently available for German and British English only. For connected text the extended feature set generally generally yields better performance. However, if the input comprises a high amount of proper names provoking erroneous part of speech tagging and morphologic analyses, than the standard feature set is more robust. pos: Option pos : N-HANS sample recording (RIFF WAVE *.wav) of the noise to be preserved in the signal (mode 'denoiser') or the target speaker to be preserved in the signal (mode 'separator'). The 'pos' sample is applied to all processed input signals; do not upload more than 2sec of clean signal, and make sure that the relevant signal is present within the very first second; 'clean signal' means that the sample should not contain any traces of the main voice (mode 'denoiser') nor of the 'pos' noise sample (modes 'denoiser' and 'separator'). The upload of the 'pos' sample is mandatory for N-HANS mode 'separator' and optional for mode 'denoiser' (see option 'NHANS'). APHONE: Option APHONE: the string used to mask phonetic/phonologic labels for anonymized terms. If not set, the service will use the label 'nib' for masking encodings in SAMPA, and the label '(.)' for encodings in IPA. If set to another label, this label is used to mask in all encodings. INSPROB: Option INSPROB: The option INSPROB influences the probability of deletion of segments. It is a constant factor (a constant value added to the log likelihood score) after each segment. Therefore, a higher value of INSPROB will cause the probability of segmentations with more segments go up, thus decreasing the probability of deletions (and increasing the probability of insertions, which are rarely modelled in the rule sets). This parameter has been evaluated on parts of the German Verbmobil data set (27425 segments) which were segmented and labelled manually (MAUS DEV set) and found to have its optimum at 0.0 (which is nice). Therefore we set the default value of INSPROB to 0.0. INSPROB was also tested against the MAUS TEST set to confirm the value of 0.0. It had an optimum at 0.0 as well. Note that this might NOT be the optimal value for other MAUS tasks. OUTSYMBOL: [x-sampa, ipa, manner, place] Option Output Encoding (OUTSYMBOL): Defines the encoding of phonetic symbols in output. If set to 'x-sampa' (default), phonetic symbols in output are encoded in X-SAMPA (with some minor differences in languages Norwegian/Icelandic in which the retroflex consonants are encoded as 'rX' instead of X-SAMPA 'X_r'); use service runMAUSGetInventar with option LANGUAGE=sampa to get a list of symbols and their mapping to IPA. If set to 'ipa', the service produces UTF-8 IPA output in annotation tiers MAU (MAUS last module in PIPE) or in KAS/MAS (PHO2SYL last module in PIPE). Just for pipes with MAUS as the last module: if set to 'manner', the service produces Manner of articulation for each segment; possible values are: silence, vowel, diphthong, plosive, nasal, fricative, affricate, approximant, lateral-approximant, ejective; if set to 'place', the service produces Place of articulation for each segment; possible values are: silence, labial, dental, alveolar, post-alveolar, palatal, velar, uvular, glottal, front, central, back. RULESET: MAUS rule set file; UTF-8 encoded; one rule per line; there are two different file types defined by the extension: 1. Phonological rule set without statistical information '*.nrul', synopsis is: 'leftContext-match-rightContext>leftContext-replacement-rightContext', e.g. 't,s-e:-n>t,s-@-n'. 2. Rule set with statistical information '*.rul', synopsis is: 'leftContext,match,rightContext>leftContext,replacement,rightContext ln(P(replacement|match) 0.0000', e.g. 'P9,n,@,n,#>P9,# -3.761200 0.000000'; 'P(replacement|match)' is the conditional probability that 'match' is being replaced by 'replacement'; the sum over all conditional probabilities with the same condition 'match' must be less than 1; the difference between the sum and 1 is the conditional probability 'P(match|match)', i.e. no for no change. 'leftContext/rightContext/match/replacememt' = comma separated lists of SAMPA symbols or empty lists (for *.rul the leftContext/rightContext must be exactly one symbol!); special SAMPA symbols in contexts are: '#' = word boundary between words, and '<' = utterance begin (may be used instead of a phonemic symbol); digits in SAMPA symbols must be preceded by 'P' (e.g. '2:' -> 'P2:'); all used SAMPA symbols must be defined in the language specific SAMPA set (see service runMAUSGetInventar). Examples for '*.rul' : 'P9,n,@,n,#>P9,# = 'the word final syllable /n@n/ is deleted, if preceded by /9/', '#,k,u:>#,g,u:' = 'word intial /k/ is replaced by /g/ if followed by the vowel /u:/'. Examples for '*.nrul' : '-->-N,k-' = 'insert /Nk/ at arbitrary positions', '#-?,E,s-#>#-s-#' = 'delete /?E/ in word /?Es/', 'aI-C-s,t,#>aI-k-s,t,#' = 'replace /C/ in word final syllable /aICst/ by /k/'. maxSpeakNumber: [0.0, 999999.0] Option maxSpeakNumber defines a hard upper bound of the number of detected speakers. If set to 0 (default), no upper bound. USEWORDASTURN: [true, false] If set to true (default: false), and if the selected ASR service delivers a valid word segmentation (tier WOR), this word segmentation is encoded as a chunk segmentation in the output (tier TRN) instead of the (possible) result of a speaker diarization (default). Both, the speaker diarization (which is basically a turn segmentation) and the word segmentation, when used as a chunk segmentation input to MAUS, might improve the phonetic alignment of MAUS, since they act as fix time anchors for the MAUS segmentation process. In some cases the word segmentation as time anchors yields better results (simply because there are more of them and a gross misalignment of MAUS is less likely); sometimes the chosen ASR service does not deliver a speaker diarization, then this option allows to switch to the word segmentation (which is delivered by all ASR services). allowOverlaps: [true, false] Option allowOverlaps: If set to true, the un-altered output of PyAnnote is returned in the SPD tier (note that overlaps cannot be handled by most annotation formats; only use if you really need to detect overlaps!); if set to false (default), overlaps, missing silence intervals etc. are resolved in the output tier SPD, making this output compatible with all annotation formats. The postprocessing works as follows: 1. all silence intervals are removed. 2. all speaker segments that are 100% within another (larger) speaker segment are removed. 3. If an overlap occurs the earlier segment(s) are truncated to the start of the new segment. 4. all remaining gaps in the segmentation are filled with silence intervals. minchunkduration: [0.0, 999999.0] Lower bound for output chunk duration in seconds. Note that the chunker does not guarantee an upper bound on chunk duration. SIGNAL: Mandatory parameter SIGNAL: mono sound file or video file containing the speech signal to be processed; PCM 16 bit resolution; any sampling rate. Although the mimetype of this input file is restricted to RIFF AUDIO audio/x-wav (extension wav), most pipes will also process NIST/SPHERE (nis|sph) and video (mp4|mpeg|mpg|avi|flv). stress: [yes, no] yes/no decision whether or not word stress is to be added to the canonical transcription (KAN tier). Stress is marked by a single apostroph (') that is inserted before the syllable nucleus into the transcription. imap: Customized mapping table from orthography to phonology. If pointing to a valid mapping table, the pipeline service will automatically set the LANGUAGE option for service G2P to 'und' (undefined) while leaving the commandline option LANGUAGE for the remaining services unchanged (most likely 'sampa'). This mapping table is used then to translate the input text into phonological symbols. See https://www.bas.uni-muenchen.de/Bas/BASWebServices/DOCS/readme_g2p_mappingTable.txt for details about the format of the mapping table. MODUS: [default, standard, align] Option MODUS: Operation modus of MAUS: default is to use the language dependent default modus; the two possible modi are: 'standard' which is the segmentation and labelling using the MAUS technique as described in Schiel ICPhS 1999, and 'align', a forced alignment is performed on the input SAM-PA string defined in the KAN tier of the BPF (the same effect as the deprecated former option CANONLY=true). RELAXMINDUR: [true, false] Option Relax Min Duration (RELAXMINDUR) changes the default minimum duration of 30msec for consonants and short/lax vowels and of 40msec for tense/long vowels and diphthongs to 10 and 20msec respectively. This is not optimal for general segmentation because MAUS will start to insert many very short vowels/glottal stops where they are not appropriate. But for some special investigations (e.g. the duration of /t/) it alleviates the ceiling problem at 30msec duration. ATERMS: Option ATERMS: file encoded in UTF-8 containing the terms that are to be anonymized by the service. One term per line; terms may contain blanks, in which case only consecutive occurances of the words within the term are anonymized. numberSpeakDiar: [0.0, 999999.0] Option numberSpeakDiar restricts the number of detected speakers by the speaker diarization to the given number. If set to 0 (default), the ASR service determines the number automatically. RELAXMINDURTHREE: [true, false] Alternative option to Relax Min Duration (RELAXMINDUR): changes the minimum duration for all models to 3 states (= 30msec with standard frame rate)to 30msec. This can be useful when comparing the duration of different phone groups. STARTWORD: [0.0, 999999.0] Option STARTWORD: If set to a value n>0, this option causes maus to start the segmentation with the word number n (word numbering in BPF starts with 0). This is useful if the input signal file is just a segment within a longer transcript. See also option ENDWORD. INSYMBOL: [sampa, ipa] Option INSYMBOL: Defines the encoding of phonetic symbols in input. If set to 'sampa' (default), phonetic symbols are encoded in X-SAMPA (with some coding differences in Norwegian/Icelandic); use service runMAUSGetInventar with option LANGUAGE=sampa to get a list of symbols and their mapping to IPA). If set to 'ipa', the service expects blank-separated UTF-8 IPA. PRESEG: [true, false] Option PRESEG: If set to true, a pre-segmentation using the wav2trn tool is done by the webservice on-the-fly; this is useful, if the input signal (or processed chunks within the signal) has leading and/or trailing silence. AWORD: Option AWORD: the string used to mask word labels for anonymized terms. USETRN: [true, false, force] Option USETRN: If the pipe produces/processes a chunk segmentation (CHUNKER/CHUNKPREP), this option is set automatically. If set to true, MAUS searches the input BPF for a TRN tier (turn/chunk segmentation, see http://www.bas.uni-muenchen.de/forschung/Bas/BasFormatsdeu.html#TRN). The synopsis for a TRN entry is: 'TRN: (start-sample) (duration-sample) (word-link-list) (label)', e.g. 'TRN: 23654 56432 0,1,2,3,4,5,6 sentence1' (the speech within the recording 'sentence1' starts with sample 23654, last for 56432 samples and covers the words 0-6). If only one TRN entry is found, the segmentation is restricted within a time range given by this TRN tier entry; this is useful, if there exists a reliable pre-segmentation of the recorded utterance, i.e. the start and end of speech within the recording is known. If more than one TRN entry is found, the webservice performs an segmentation for each 'chunk' defined by a TRN entry and aggregates all individual results into a single results file; this is useful if the input consists of long recordings, for which a manual chunk segmentation is available. If USETRN is set to 'force' (deprecated since maus 4.11; use PRESEG=true instead!, a pre-segmentation using the wav2trn tool is done by the webservice on-the-fly; this is useful, if the input BPF does not contain a TRN entry and the input signal has leading and/or trailing silence. ASRType: [autoSelect, callAmberscriptASR, callEMLASR, callFraunhoferASR, callGoogleASR, callLSTDutchASR, callLSTEnglishASR, callWatsonASR, callUWEBASR, callWhisperXASR] Name of the ASR service applied. If set to 'autoSelect', the service will select the next available ASR service that supports the LANGUAGE; if set to 'allServices', the service will send the input signal to all ASR services that support LANGUAGE and output the ASR results in simple txt format. Please note that your input signal is send to a third party ASR service which is not a part of BAS. By selecting a third party service you accept the end user license agreement of this service (as posted on the Web API of BAS services) and agree that your signals are to send to the selected service. Be advised that most of these services store input signals to improve their ASR performance, and that several restrictions (service dependent quotas) apply to the number and amount of input signals (see the 'Show Help' text of the servce 'ASR' on the BAS Web API for details). Some ASR services only allow asynchroneous processing, which means that the response time can be up to several minutes. If you need service capacity exceeding the standard quotas for a specific ASR service, please contact the BAS for special arrangements. MAUSSHIFT: Option MAUSSHIFT: If set to n, this option causes the calculated MAUS segment boundaries to be shifted by n msec (default: 0) into the future. Most likely this systematic shift is caused by a boundary bias in the training material's segmentation. The default should work for most cases. diarization: [true, false] If set to true (default: false), the ASR service will label each word in the result with a speaker label (BPF tier SPK, labels 'S1', 'S2', ... in order of appearance). If the selected ASR service does not support speaker diarization, a WARNING is issued. HIGHF: [0.0, 30000.0] Option HIGHF: upper filter edge in Hz. If set >0Hz and LOWF is 0Hz, a low pass filter with HIGHF Hz is applied; if set >0Hz and LOWF is set lower than HIGHF, a band pass between LOWF and HIGHF is applied; if set >0Hz and LOWF is set higher than 0Hz but higher than HIGHF, a reject band pass between HIGHF and LOWF is applied. E.g. HIGHF = 3000 LOWF = 300 is telephone band; HIGHF = 45 LOWF = 55 filters out a 50Hz hum. silenceonly: [0.0, 999999.0] If set to a value greater than 0, the chunker will only place chunk boundaries in regions where it has detected a silent interval of at least that duration (in ms). Else, silent intervals are prioritized, but not to the exclusion of word boundaries without silence. On speech that has few silent pauses (spontaneous speech or speech with background noise), setting this parameter to a number greater than 0 is likely to hinder the discovery of chunk boundaries. On careful and noise-free speech (e.g. audio books) on the other hand, setting this parameter to a sensible value (e.g. 200) may reduce chunkin errors. boost_minanchorlength: [2.0, 8.0] If you are using the boost phase, you can set its minimum anchor length independently of the general minimum anchor length. Setting this parameter to a low value (e.g. 2-3) means that the boost phase has a greater chance of finding preliminary chunk boundaries, which is essential for speeding up the chunking process. On the other hand, high values (e.g. 5-6) lead to more conservative and more reliable chunking decisions. If boost is set to false, this option is ignored. ADDSEGPROB: [true, false] Option Add Viterbi likelihoods (ADDSEGPROB) causes that the frame-normalized natural-log total Viterbi likelihood of an aligned segment is appended to the segment label in the output annotation (the MAU tier). This might be used as a 'quasi quality measure' on how good the acoustic signal in the aligned segment has been modelled by the combined acoustical and pronunciation model of MAUS. Note that the values are not probabilities but likelihood densities, and therefore are not comparable for different signal segments; they are, however, comparable for the same signal segment. Warning: this option breaks the BPF standard for the MAU tier and must not be used, if the resulting MAU tier should be further processed, e.g. in a pipe). Implemented only for output phoneme symbol set SAMPA (default). Output: A XML response containing the elements "success", "downloadLink", "output" and "warning". "success" states if the processing was successful or not, "downloadLink" specifies the location where the output file of the pipeline can be found (the format of the file depends on the option selected in OUTFORMAT), "output" contains the output that is mostly useful during debugging errors and "warning" lists warnings, if any occured during the processing. Depending on input parameter OUTFORMAT the output file in "downloadlink" can be of several different file formats; see mandatory parameter OUTFORMAT for details. ---------------------------------------------------------------- ---------------------------------------------------------------- runCOALAGetTemplates ------------------ Description: Returns a zip file with the template table files and instructions how to fill them. The tables are necessary to create CMDI metadata files with runCOALA. Example curl call is: curl -v -X GET -H 'content-type: application/x-www-form-urlencoded' 'https://clarin.phonetik.uni-muenchen.de/BASWebServices/services/runCOALAGetTemplates' Parameter description: Output: A zip file containing the necessary template files and instructions for running COALA. ---------------------------------------------------------------- ---------------------------------------------------------------- runDoReCo ------------------ Description: A service specifically designed for the DoReCo project. Example curl call is: curl -v -X POST -H 'content-type: multipart/form-data' -F mappingFile=@ -F relaxMinDur=false -F exceptionList=@ -F ruleSet=@ -F fileIn=@ -F fileMapping=@ -F relaxMinDurThree=false -F signal=@ 'https://clarin.phonetik.uni-muenchen.de/BASWebServices/services/runDoReCo' Parameters: [mappingFile] [relaxMinDur] [exceptionList] [ruleSet] fileIn [fileMapping] [relaxMinDurThree] signal Parameter description: mappingFile: Customized mapping table from orthography to phonology (SAM-PA). This mapping table is used then to translate the input text into phonological symbols. See http://www.bas.uni-muenchen.de/Bas/readme_g2p_mappingTable.txt for details about the format of the mapping table. relaxMinDur: [true, false] Optional MAUS option RELAXMINDUR (see description in runMAUS service) exceptionList: Optional list containing symbols and strings that should be deleted before translating the orthography to SAM-PA symbols (e.g., [SONG], (0.6), etc.). ruleSet: Optional MAUS RULESET (see description in runMAUS service) fileIn: File that contains the tiers to process (either EAF or TextGrid). fileMapping: File that contains a mapping from the base filename (e.g., file001.wav) to the tier (e.g., "utterance1 utterance2") that should be use as input to the the SAM-PA transliteration based on mappingFile (orthography to SAM-PA mapping). The format of the file needs to be two columned separated by a semicolon, where the first column contains the filename and the second one the tier names to process. The second column can consist of multiple tiers separated by a blank which will then all be processed. Format example: "file0001.wav;utterance1 utterance2 utterance3" relaxMinDurThree: [true, false] Optional MAUS option RELAXMINDURTHREE (see description in runMAUS service) signal: Mandatory parameter SIGNAL: mono sound file or video file containing the speech signal to be processed; PCM 16 bit resolution; any sampling rate. Although the mimetype of this input file is restricted to RIFF AUDIO audio/x-wav (extension wav), NIST/SPHERE (nis|nist|sph), video (mp4|mpeg|mpg). Output: A XML response containing the elements "success", "downloadLink", "output" and "warning". "success" states if the processing was successful or not, "downloadLink" specifies the location where the output file of the pipeline can be found (the format of the file depends on the option selected in OUTFORMAT), "output" contains the output that is mostly useful during debugging errors and "warning" lists warnings, if any occured during the processing. Depending on input parameter OUTFORMAT the output file in "downloadlink" can be of several different file formats; see mandatory parameter OUTFORMAT for details. ---------------------------------------------------------------- ---------------------------------------------------------------- runChunkPreparation ------------------ Description: This pre-processor to MAUS transforms a chunk segmentation (CSV, EAF or TextGrid) into a BAS Partitur Format (BPF) file containing the tiers tokenized words (ORT) and chunk segmentation (TRN). For details about the BAS Partitur Format (BPF) see http://www.bas.uni-muenchen.de/forschung/Bas/BasFormatseng.html. A 'chunk segmentation' is a rough segmentation of a long speech signal into longer stretches of spoken text ('chunks'), e.g. sentences, speaker turns; each chunk consists of timing information (begin/end) and a label that contains the spoken orthographic text (UTF-8 encoded); chunks can be encoded as tiers of an annotation format (e.g. praat TextGrid or ELAN EAF) or in form of a table (CSV). The BPF TRN and ORT tiers in the output BPF file contains the (tokenized) word chunks as given in the specified input file tier; the presence of the TRN tier improves the performance of a subsequent automatic phonetic segmentation by WebMAUS. 'Tokenization' here means not only the break-up of the transcript at white spaces but also the replacement of digits by number names, the deletion of punctuations and some special characters (see service description runG2P for details). if you want to avoid these normalisations, select the language code (option 'Language'/'lng') 'und'. Example curl call is: curl -v -X POST -H 'content-type: multipart/form-data' -F com=no -F lng=deu-DE -F tier=ORT -F rate=-1 -F i=@ -F iform=tg 'https://clarin.phonetik.uni-muenchen.de/BASWebServices/services/runChunkPreparation' Parameters: [com] [lng] [tier] [rate] i [iform] Parameter description: com: [yes, no] Option com: yes/no decision whether <*> strings in the annotation input should be treated as annotation markers. If set to 'yes', then strings of this type are considered as annotation markers that are not processed but passed on to the output. The string * within the <*> must not contain any white space characters. This means, that the markers appear in the ORT and KAN tier of the output BPF file with a word index on their own. WebMAUS makes use of the markers < usb > (e.g. non-understandable word or other human noises) and < nib > (non-human noise) without the blanks between "usb", "nib" and the brackets "<" and ">" (which are needed for formatting reasons). All other markers <*> are modelled as silence, if you use this service as a pre-processing for WebMAUS. Markers must not contain white spaces, and must be separated from word tokens by blanks. They do not need to be blank-separated from non-word tokens as punctuation. lng: [aus-AU, afr-ZA, sqi-AL, arb, eus-ES, eus-FR, cat-ES, cze-CZ, nld-NL, eng-AU, eng-GB, eng-NZ, eng-US, ekk-EE, fin-FI, fra-FR, kat-GE, deu-DE, gsw-CH, gsw-CH-BE, gsw-CH-BS, gsw-CH-GR, gsw-CH-SG, gsw-CH-ZH, hat-HT, hun-HU, isl-IS, ita-IT, jpn-JP, gup-AU, ltz-LU, mlt-MT, nor-NO, fas-IR, pol-PL, ron-RO, rus-RU, slk-SK, spa-ES, swe-SE, tha-TH, guf-AU, und] RCFC5646 locale language code of the speech to be processed; this is necessary since the tokenization and the replacement of numerals in the input text is language-dependent; we use the RFC5646 sub-structure 'iso639-3 - iso3166-1 [ - iso3166-2], e.g. 'eng-US' for American English, 'deu-AT-1' for Austrian German spoken in 'Oberoesterreich'; alternatively and where possible, Iso 639-3 char language code is supported; non-standard codes: 'nze' stands for New Zealand English, 'arb' is for variety-independent Arabic romanization ('Tamimi Romanization'), 'use' for American English. 'und' (undefined) can be used to pass the tokens unchanged, i.e. the tokens found in the chunk label are passed unchanged into the 'ORT' tier of the output BPF. tier: Name of the item in the TextGrid or EAF input, which is to be transformed into TRN and ORT tier of the BPF format. Case-sensitive. Only ELAN annotation tiers that contain timing information are processed. It is possible in an EAF file to have a 'referenced' annotation tier (element REF_ANNOTATION) that only refers to another tier with timing information, but our service cannot process this. A work-around is to go back into ELAN and copy the contents of the reference tier onto the tier with timing information and then use this tier as input. rate: [0.0, 999999.0] Sample rate of signal file, from which the TextGrid or EAF file has been derived. Needed for the conversion of absolute times into samples. i: Input file containing the chunk segmentation. MIMEType depends on input parameter iform. iform: [tg, eaf, csv] Format of the input file. Currently 'tg' (standard or short TextGrid), 'eaf' (ELAN annotation format) and 'csv' are supported. Only one tier in the TextGrid/EAF input file is processed (see also option 'tier' for details about ELAN tiers). The csv table file should contain three columns separated by a semicolon containing time onset, offset (in samples), and the transcript (UTF-8 encoded), respectively. Output: A XML response containing the elements "success", "downloadLink", "output" and "warning". "success" states if the processing was successful or not, "downloadLink" specifies the location where the output BPF file can be found, "output" contains the console output of the service that is mostly useful during debugging errors, and "warnings" contains any warnings that occured during the processing. The format of the output file is BAS Partitur Format (BPF) containing the tiers ORT,KAN,TRN.. ---------------------------------------------------------------- ---------------------------------------------------------------- runMINNI ------------------ Description: This service segments and labels a speech audio file into SAM-PA (or IPA) phonetic segments without any text/phonological input; it uses HMM technology combined with a language-specific phonotactic bigram model. This is a general service to process a single file which enables the usage of all possible options of MINNI. See the section Input for a detailed description of these options or use the operation 'runMAUSGetHelp' to download a current version of the MAUS/MINNI documentation. Example curl call is: curl -v -X POST -H 'content-type: multipart/form-data' -F SIGNAL=@ -F LANGUAGE=deu-DE -F OUTFORMAT=TextGrid -F PRESEG=false -F MAUSSHIFT=default -F INSPROB=0.0 -F MINPAUSLEN=5 -F OUTSYMBOL=sampa -F WEIGHT=default -F ADDSEGPROB=false 'https://clarin.phonetik.uni-muenchen.de/BASWebServices/services/runMINNI' Parameters: SIGNAL [LANGUAGE] [OUTFORMAT] [PRESEG] [MAUSSHIFT] [INSPROB] [MINPAUSLEN] [OUTSYMBOL] [WEIGHT] [ADDSEGPROB] Parameter description: SIGNAL: mono sound file containing the speech signal to be segmented; PCM 16 bit resolution; any sampling rate; optimal results if leading and trailing silence intervals are truncated before processing. Although the mimetype of this input file is restricted to audio/x-wav (wav|WAV), the service will also process NIST/SPHERE (nis|NIS) and ALAW (al|AL|dea|DEA). LANGUAGE: [afr-ZA, aus-AU, cat-ES, nld-BE, nld-NL, eng-AU, eng-GB, eng-US, ekk-EE, fra-FR, deu-DE, gsw-CH, hun-HU, ita-IT, nan-TW, nor-NO, fas-IR, pol-PL, rus-RU, spa-ES, tha-TH] Language of the speech to be processed; defines the possible phoneme symbol set in MAUS input; we use the RFC5646 sub-structure 'iso639-3 - iso3166-1 [- iso3166-2], e.g. 'eng-US' for American English, 'deu-AT-1' for Austrian German spoken in 'Oberoesterreich'. The non-standard language code 'sampa' denotes a language independent SAM-PA variant of MAUS for which the SAM-PA symbols in the input BPF must be blank separated (e.g. /h OY t @/). OUTFORMAT: [bpf, exb, csv, TextGrid, emuDB, eaf, tei, par] Option 'Output format' (OUTFORMAT): Defines the possible output formats: TextGrid - a praat compatible TextGrid file; bpf - a BPF file with tier MAU (phonetic segmentation); csv - a spreadsheet (CSV table) that contains the phonetic segmentation; emuDB - an Emu compatible *_annot.json file; eaf - an ELAN compatible annotation file; exb - an EXMARaLDA compatible annotation file; tei - Iso TEI document (XML). For a description of BPF see http://www.bas.uni-muenchen.de/forschung/Bas/BasFormatseng.html. for a description of Emu see https://github.com/IPS-LMU/emuR. Note 1: using 'emuDB' will first produce only single annotation file *_annot.json; in the WebMAUS interface (https://clarin.phonetik.uni-muenchen.de/BASWebServices) you can process more than one file and than download a zipped Emu database; in this case don't forget to change the default name of the emuDB 'MAUSOUTPUT' using the R function emuR::rename_emuDB(). Note 2: if you need the same result in more than one format, select 'bpf' to produce a BPF file, and then convert this file with the service runAnnotConv ('AnnotConv') into the desired formats. Note 3: some format conversions are not loss-less; select 'bpf' to be sure that no information is lost. PRESEG: [true, false] Option PRESEG: If set to true, a pre-segmentation using the wav2trn tool is done by the webservice on-the-fly; this is useful, if the input signal has leading and/or trailing silence. MAUSSHIFT: If set to n, this option causes the calculated MAUS segment boundaries to be shifted by n msec (default: 0) into the future. Most likely this systematic shift is caused by a boundary bias in the training material's segmentation. The default should work for most cases. INSPROB: The option INSPROB influences the probability of detecting two segments instead of one. It is a constant factor (a constant value added to the log likelihood score) after each segment. Therefore, a higher (positive) value of INSPROB will cause the probability of segmentations with more segments to go up and vice versa negative values will cause the number of detected segments to go down. This parameter has only been evaluated using MAUS (not MINNI) on parts of the German Verbmobil data set (27425 segments) which were segmented and labelled manually (MAUS DEV set) and found to have its optimum at 0.0 (which is nice). Therefore we set the default value of INSPROB to 0.0. INSPROB was also tested against the MAUS TEST set to confirm the value of 0.0. It had an optimum at 0.0 as well. Note that this might NOT be the optimal value for MINNI processing. MINPAUSLEN: [0.0, 999.0] Option MINPAUSLEN: Controls the behaviour of optional inter-phone silence. If set to 1, maus will detect all inter-phone silence intervals that can be found (minimum length for a silence interval is then 10 msec = 1 frame). If set to values n>1, the minimum length for an inter-phone silence interval to be detected is set to n*10 msec. For example MINPAUSLEN of 5 will cause MAUS to suppress inter-phone silence intervals up to a length of 40msec. Since 40 msec seems to be the border of perceivable silence, we set this option default to 5. With other words: inter-phone silences smaller than 50msec are not segmented but rather distributed equally to the adjacent segments. If one of the adjacent segments happens to be a plosive then the deleted silence interval is added totally to the plosive; if both adjacent segments are plosives, the interval is equally spread as with non-plosive adjacent segments. OUTSYMBOL: [sampa, ipa, manner, place] Option Output Encoding (OUTSYMBOL): Defines the encoding of phonetic symbols in output. If set to 'sampa' (default), phonetic symbols in output are encoded in X-SAMPA (with some minor differences in languages Norwegian/Icelandic in which the retroflex consonants are encoded as 'rX' instead of X-SAMPA 'X_r'); use service runMAUSGetInventar with option LANGUAGE=sampa to get a list of symbols and their mapping to IPA. If set to 'ipa', the service produces UTF-8 IPA output. If set to 'manner', the service produces IPA manner of articulation for each segment; possible values are: silence, vowel, diphthong, plosive, nasal, fricative, affricate, approximant, lateral-approximant, ejective. If set to 'place', the service produces IPA place of articulation for each segment; possible values are: silence, labial, dental, alveolar, post-alveolar, palatal, velar, uvular, glottal, front, central, back. WEIGHT: The option WEIGHT weights the influence of the statistical phonotactic bigram model (the a-priori probability of pronuciation) against the acoustical scores. More precisely WEIGHT is multiplied to the phonotactic model score (log likelihood) before adding the score to the acoustical score within the Viterbi search. Since MINNI uses a phonotactic bigram model, increasing WEIGHT will at some point cause MINNI to choose always the same most likely sequence of phones according to the bigram model (disregarding the acoustics) with equally long segments, i.e. no meaningful segmentation at all; lower values of WEIGHT will cause phoneme sequences to be detected according to acoustic evidence, even if the resulting pronunciation is less likely according to the phonotactic bigram model; if WEIGHT is set to 0.0 the bigram is completely ignored and MINNI performs a phone recognition bases only on acoustic likelihood (and any sequence of phones is a-priori equally probable). If the acoustic quality of the signal is very good and the HMMs of the language are well trained, it makes sense to lower WEIGHT to achieve more precise results given the acoustic. For most languages this option is default to 1.0 (which means that acoustic evidence and a-priori pronunciation probability are treated equally). ADDSEGPROB: [true, false] Option Add Viterbi likelihoods (ADDSEGPROB) causes that the frame-normalized natural-log total Viterbi likelihood of an aligned segment is appended to the segment label in the output annotation (the MAU tier). This might be used as a 'quasi quality measure' on how good the acoustic signal in the aligned segment has been modelled by the combined acoustical and pronunciation model of MAUS. Note that the values are not probabilities but likelihood densities, and therefore are not comparable for different signal segments; they are, however, comparable for the same signal segment. Warning: this option breaks the BPF standard for the MAU tier and must not be used, if the resulting MAU tier should be further processed, e.g. in a pipe). Implemented only for output phoneme symbol set SAMPA (default). Output: A XML response containing the tags "success", "downloadLink", "output" and "warning. success states if the processing was successful or not, downloadLink specifies the location where segmentation file can be found (the format of the file depends on the option selected in OUTFORMAT), output contains the output that is mostly useful during debugging errors and warnings if any warnings occured during the processing. ---------------------------------------------------------------- ---------------------------------------------------------------- runTextAlign ------------------ Description: Optimal alignment of text sequence pairs, for example the optimal alignment of an orthographic string to its corresponding phonological transcript. Example curl call is: curl -v -X POST -H 'content-type: multipart/form-data' -F cost=intrinsic -F atype=dir -F i=@ -F displc=no -F costfile=@ 'https://clarin.phonetik.uni-muenchen.de/BASWebServices/services/runTextAlign' Parameters: [cost] [atype] i [displc] [costfile] Parameter description: cost: [naive, intrinsic, import, g2p_aus, g2p_deu, g2p_ekk, g2p_eng, g2p_fin, g2p_fra, g2p_gsw, g2p_hat, g2p_hun, g2p_ita, g2p_kat, g2p_nld, g2p_nze, g2p_pol, g2p_ron, g2p_rus, g2p_slk, g2p_sqi, g2p_use] Cost function for the edit operations substitution, deletion, and insertion to be used for the alignment. 'naive' assigns cost 1 to all operations except of null-substitution, i.e. the substitution of a symbol by itself, which receives cost 0. This 'naive' cost function should be used only if the pairs to be aligned share the same vocabulary, which is NOT the case e.g. in grapheme-phoneme alignment (grapheme 'x' is not the same as phoneme 'x'). 'g2p_deu', 'g2p_eng' etc. are predefined cost functions for grapheme-phoneme alignment for the respective language expressed as iso639-3. By selecting 'intrinsic' a cost function is trained on the input data and returned in the output zip. Costs are derived from co-occurrence probabilities, thus the bigger the input file, the more reliable the emerging cost function. By 'import' the user can provide his/her own cost function file, that must be a semicolon-separated 3-column csv text file. Examples: v;w;0.7 - the substitution of 'v' by 'w' costs 0.7. v;_;0.8 - the delition of 'v' costs 0.8; _;w;0.9 - the insertion of 'w' costs 0.9. A typical usecase is to train a cost function on a big data set with cost='intrinsic', and to subsequently apply this cost function on smaller data sets with cost='import'. atype: [dir, sym] Alignment type: 'dir' - align the second column to the first. 'sym' symmetric alignment. i: csv text file with two semicolon-separated columns. Each row contains a sequence pair to be aligned. The sequence elements must be separated by a blank. Example: a word and its canonical transcription like S c h e r z;S E6 t s displc: [yes, no] Yes/no decision whether alignment costs should be displayed in a third column in the output file. costfile: csv text file with three semicolon-separated columns. Each row contains three columns of the form a;b;c, where c denotes the cost for substituting a by b. Insertion and deletion are are marked by an underscore. Examples (from German grapheme-phoneme conversion): e;E;0.96 - replacing grapheme e by phoneme E costs 0.96. e;_;0.89: e-deletion costs 0.89. _;E;0.99: E-insertion costs 0.99. Output: Output zip file, that contains a semicolon-separated 2- or 3-column csv text file with the aligned output. The third column comprises the alignment costs, if the parameter 'displc' is set to 'yes. If 'cost' is set to 'intrinsic' the zip file additionally contains a cost function file in the format as described for the parameter 'cost'. ---------------------------------------------------------------- ---------------------------------------------------------------- getLoadIndicator ------------------ Description: Returns an indicator how high the server load is - 0 (for low load, i.e., less than 50 percent load), 1 (for middle load, i.e., between 50 and 100 percent load), and 2 (for high load, i.e., more than 100 percent load). Example curl call is: curl -v -X GET -H 'content-type: application/x-www-form-urlencoded' 'https://clarin.phonetik.uni-muenchen.de/BASWebServices/services/getLoadIndicator' Parameter description: Output: Number that indicates the load. ---------------------------------------------------------------- ---------------------------------------------------------------- runChunker ------------------ Description: The chunker is a preprocessing tool for the MAUS segmentation service that splits very long recordings into smaller 'chunks'. Since MAUS's runtime grows quadratically with input duration, it cannot be used on recordings that are longer than approx. 3000 words. In this case, the chunker can presegment the recording into shorter "chunks". This chunk segmentation, which is NOT a semantically meaningful sentence or turn segmentation, can then be used to speed up the MAUS segmentation process. Like MAUS, the chunker accepts a media file containing the speech signal and a BAS Partitur Format (BPF) file containing a canonical transcription of the recording (KAN tier). This canonical transcription can be derived from an orthographic text using the G2P tool (runG2P). The chunker outputs a new BAS Partitur File with a TRN tier that can be used as chunk segmentation input to MAUS. Example curl call is: curl -v -X POST -H 'content-type: multipart/form-data' -F maus=false -F language=deu-DE -F aligner=hirschberg -F USEREMAIL= -F bpf=@ -F boost=true -F force=false -F audio=@ -F silenceonly=0 -F minanchorlength=3 -F boost_minanchorlength=4 -F insymbols=sampa -F minchunkduration=15 'https://clarin.phonetik.uni-muenchen.de/BASWebServices/services/runChunker' Parameters: [maus] [language] [aligner] [USEREMAIL] bpf [boost] [force] audio [silenceonly] [minanchorlength] [boost_minanchorlength] [insymbols] [minchunkduration] Parameter description: maus: [true, false] If this parameter is set to true, the recognition module will model words as MAUS graphs as opposed to canonical chains of phonemes. This will slow down the recognition engine, but it may help with non-canonical speech (e.g., accents or dialects). language: [aus-AU, afr-ZA, sqi-AL, arb, eus-ES, eus-FR, cat-ES, nld-BE, nld-NL, eng-US, eng-AU, eng-GB, eng-NZ, eng-SC, ekk-EE, fin-FI, fra-FR, kat-GE, deu-DE, gsw-CH, gsw-CH-BE, gsw-CH-BS, gsw-CH-GR, gsw-CH-SG, gsw-CH-ZH, hun-HU, isl-IS, ita-IT, jpn-JP, sampa, ltz-LU, mlt-MT, nan-TW, nor-NO, fas-IR, pol-PL, por-PT, ron-RO, rus-RU, spa-ES, tha-TH] Language of the speech to be processed. This parameter defines the set of possible input phonemes and their acoustic models. RFC5646 sub-structure 'iso639-3 - iso3166-1 [- iso3166-2], e.g. 'eng-US' for American English. The language code 'sampa' (not RCFC5646) denotes a language independent SAM-PA variant of MAUS for which the SAM-PA symbols in the input BPF must be blank separated (e.g. /h OY t @/). aligner: [hirschberg, fast] Symbolic aligner to be used. The "fast" aligner performs approximate alignment by splitting the alignment matrix into "windows" of size 5000*5000. The "hirschberg" aligner performs optimal matching. On recordings below the 1 hour mark, the choice of aligner does not make a big difference in runtime. On longer recordings, you can improve runtime by selecting the "fast" aligner. Note however that this choice increases the probability of errors on recordings with untranscribed stretches (such as long pauses, musical interludes, untranscribed speech). Therefore, the "hirschberg" aligner should be used on this kind of material. USEREMAIL: Option USEREMAIL: if a valid email address is provided through this option, the service will send the XML file containing the results of the service run to this address after completion. It is recommended to set this option for long recordings (batch size <6, length >1h) since it is often problematic to wait for service completion over an instable internet connection or from a laptop that might go into hibernation. The email address provided is not stored on the server. It is sometimes even advisable to kill the browser tab after starting the call and wait for the result emails (only for batch size <6!). Beware: the download link to your result(s) will be valid for 24h after you receive the email; after that all your data will be purged from the server. Disclaimer: the usage of this option is at your own risk; the key URL to download your result file will be send without encryption in this email; be aware that anybody who can intercept this email will be able to access your result files using this key; the BAS at LMU Munich will not be held responsible for any security breach caused by using this email notification option. bpf: Phonemic transcription of the utterance to be segmented. Format is a BAS Partitur Format (BPF) file with a KAN tier. The KAN tier contains a table with 3 columns and one line per word in the input. Column 1 is always 'KAN:'; column 2 is an integer starting with 0 denoting the word position within the input; column 3 contains the canonical pronunciation of the word coded in SAM-PA. The canonical pronunciation string may contain phoneme-separating blanks. For supported languages, the BPF can be derived using the G2P service (runG2P). See http://www.bas.uni-muenchen.de/forschung/Bas/BasFormatseng.html for detailed description of the BPF. boost: [true, false] If set to true (the default), the chunker will start by running a so-called boost phase over the recording. This boost phase uses a phoneme-based decoder instead of speech recognition. Usually, the boost option reduces processing time. On noisy input or faulty transcriptions, the boost option can lead to an increase in errors. In this case (or if a previous run with boost set to 'true' has led to chunking errors), set this option to 'false'. force: [true, false, rescue] If this parameter is set to true, the chunker will run in the experimental 'forced chunking' mode. While forced chunking is much more likely to return a fine-grained chunk segmentation, it is also more prone to chunking errors. As a compromise, you can also set this parameter to 'rescue'. In this case, the forced chunking algorithm is only invoked when the original algorithm has returned chunks that are too long for MAUS. audio: Mono WAVE or NIST/SPHERE sound file or video file (MP4,MPEG) containing the speech signal to be segmented. PCM 16 bit resolution, any sampling rate. silenceonly: [0.0, 999999.0] If set to a value greater than 0, the chunker will only place chunk boundaries in regions where it has detected a silent interval of at least that duration (in ms). Else, silent intervals are prioritized, but not to the exclusion of word boundaries without silence. On speech that has few silent pauses (spontaneous speech or speech with background noise), setting this parameter to a number greater than 0 is likely to hinder the discovery of chunk boundaries. On careful and noise-free speech (e.g. audio books) on the other hand, setting this parameter to a sensible value (e.g. 200) may reduce chunkin errors. minanchorlength: [2.0, 8.0] The chunker performs speech recognition and symbolic alignment to find regions of correctly aligned words (so-called 'anchors'). Setting this parameter to a high value (e.g. 4-5) means that the chunker finds chunk boundaries with higher certainty. However, the total number of discovered chunk boundaries may be reduced as a consequence. A low value (e.g. 2) is likely to lead to a more fine-grained chunking result, but with lower confidence for individual chunk boundaries. boost_minanchorlength: [2.0, 8.0] If you are using the boost phase, you can set its minimum anchor length independently of the general minimum anchor length. Setting this parameter to a low value (e.g. 2-3) means that the boost phase has a greater chance of finding preliminary chunk boundaries, which is essential for speeding up the chunking process. On the other hand, high values (e.g. 5-6) lead to more conservative and more reliable chunking decisions. If boost is set to false, this option is ignored. insymbols: [sampa, ipa] Defines the encoding of phonetic symbols in the input KAN tier. If set to 'sampa' (default), phonetic symbols are encoded in language specific SAM-PA (with some coding differences to official SAM-PA; use service runMAUSGetInventar with option LANGUAGE=sampa to get a list of symbols and their mapping to IPA). If set to 'ipa', the service expects blank-separated UTF-8 IPA. minchunkduration: [0.0, 999999.0] Lower bound for output chunk duration in seconds. Note that the chunker does not guarantee an upper bound on chunk duration. Output: An XML response containing the tags "success", "downloadLink", "output" and "warning". "success" states whether the processing was successful or not, "downloadLink" specifies the location where the output BPF file is provided. The BPF contains the content of the input BPF (option "bpf") with an appended TRN tier. The TRN tier contains the discovered chunking of the signal. The output BPF can be used as an input BPF to runMAUS together with the option USETRN=true. ---------------------------------------------------------------- ---------------------------------------------------------------- runG2P ------------------ Description: This web service converts an orthographic text input into a canonical phonological transcript (standard pronunciation). G2P (short for 'grapheme to phoneme conversion') reads a continuous text or word list, and estimates the most likely string of phonemes that a standard speaker of that language is expected to articulate. G2P uses statistically trained decision trees and some more tricks like Part-of-speech tagging and morphological segmentation to improve the decision process. Each language version of G2P is trained on a large set of pronunciations from this language (a pronunciation dictionary) or is based on a letter-sound mapping table in case of simple unique correspondences. The way G2P operates depends on numerous options and the chosen input and output format. For instance, some input formats contain non-tokenized text (e.g. txt) that will be subject to tokenisation and normalisation, while others contain already tokenized text (list,bpf) that will be processed as is. Most output formats also come in an 'extended' version (indicated by a 'ext' in the format name, e.g. 'exttab') that lists more information than the the phonemic transcript; extended output is only avaliable for a small subset of language yet. For more detailed information about the methods G2P applies please refer to: Reichel, U.D. (2012). PermA and Balloon: Tools for string alignment and text processing, Proc. of the Interspeech. Portland, Oregon, paper no. 346. Example curl call is: curl -v -X POST -H 'content-type: multipart/form-data' -F com=no -F tgrate=16000 -F stress=no -F imap=@ -F lng=deu-DE -F lowercase=yes -F syl=no -F outsym=sampa -F nrm=no -F i=@ -F tgitem=ort -F align=no -F featset=standard -F iform=txt -F except=@ -F embed=no -F oform=bpf 'https://clarin.phonetik.uni-muenchen.de/BASWebServices/services/runG2P' Parameters: [com] [tgrate] [stress] [imap] [lng] [lowercase] [syl] [outsym] [nrm] i [tgitem] [align] [featset] [iform] [except] [embed] [oform] Parameter description: com: [yes, no] yes/no decision whether <*> strings should be treated as annotation markers. If set to 'yes', then strings of this type are considered as annotation markers that are not processed but passed on to the output. The string * within the <*> must not contain any white space characters. For oform='bpf' this means, that the markers appear in the ORT and KAN tier with a word index on their own. WebMAUS makes use of the markers < usb > (e.g. non-understandable word or other human noises) and < nib > (non-human noise) without the blanks between "usb", "nib" and the brackets "<" and ">" (which are needed for formatting reasons). All other markers <*> are modelled as silence, if you use runG2P for WebMAUS. Markers must not contain white spaces, and must be separated from word tokens by blanks. They do not need to be blank-separated from non-word tokens as punctuation. tgrate: [0.0, 999999.0] Signal sampling rate: only needed, if 'iform' ('Input format') is 'tg' and 'oform' ('Output format') is 'bpf(s)'. Sample rate of the corresponding speech signal; needed to convert time values from TextGrid to sample values in BAS Partitur Format (BPF) file. If you don't know the sample rate, look in the Properties/Get Info list of the sound file. stress: [yes, no] yes/no decision whether or not word stress is to be added to the canonical transcription (KAN tier). Stress is marked by a single apostroph (') that is inserted before the syllable nucleus into the transcription. imap: Customized mapping table from orthography to phonology. If the option 'lng' ('Language') is set to 'und' ('User defined'), a G2P mapping table must be provided via this option. This mapping table is used then to translate the input text into phonological symbols. See https://www.bas.uni-muenchen.de/Bas/BASWebServices/DOCS/readme_g2p_mappingTable.txt for details about the format of the mapping table. lng: [cat, deu, eng, fin, hat, hun, ita, mlt, nld, nze, pol, aus-AU, afr-ZA, sqi-AL, arb, eus-ES, eus-FR, cat-ES, cze-CZ, nld-NL, eng-US, eng-AU, eng-GB, eng-NZ, ekk-EE, fin-FI, fra-FR, kat-GE, deu-DE, gsw-CH-BE, gsw-CH-BS, gsw-CH-GR, gsw-CH-SG, gsw-CH-ZH, gsw-CH, hat-HT, hun-HU, isl-IS, ita-IT, jpn-JP, gup-AU, ltz-LU, mlt-MT, nan-TW, nor-NO, fas-IR, pol-PL, ron-RO, rus-RU, slk-SK, spa-ES, swe-SE, tha-TH, guf-AU, und] Language: RCFC5646 locale code of the processed text; defines the phoneme set of input and output; we use the RFC5646 sub-structure 'iso639-3 - iso3166-1 [ - iso3166-2], e.g. 'eng-US' for American English, 'deu-AT-1' for Austrian German spoken in 'Oberoesterreich'; the code 'und' ('User defined') allows the user to upload a customized mapping from orthographic to phonologic form (see option '-imap'); for backwards compatibility some older non-standard codes are still supported: 'nze' stands for New Zealand English, 'use' for American English. Special languages: 'gsw-CH' denotes text written in Swiss German 'Dieth' transcription (https://en.wikipedia.org/wiki/Swiss_German); 'gsw-CH-*' are localized varieties in larger Swiss cities; 'jpn-JA' (Japanese) accepts Kanji or Katakana or a mixture of both, but the tokenized output will contain only the Katakana version of the input; 'aus-AU' (Australian Aboriginal languages, including Kunwinjku, Yolnu Matha) accept so called 'Modern Practical Orthography' (https://en.wikipedia.org/wiki/Transcription_of_Australian_Aboriginal_languages); 'fas-IR' (Persian) accepts a romanized version of Farsi developped by Elisa Pellegrino and Hama Asadi (see http://www.bas.uni-muenchen.de/Bas/BASWebServices/DOCS/PersianRomanizationTable.pdf for details); 'arb' is a macro language covering all Arabic varieties; the input must be encoded in a broad phonetic romanization developped by Jalal Tamimi and colleagues (see http://www.bas.uni-muenchen.de/Bas/BASWebServices/DOCS/TamimiRomanization.pdf for details). lowercase: [yes, no] yes/no decision whether orthographic input is treated case sensitive (no) or not (yes). Applies only, if the option 'lng' is set to 'und' and a customized mapping table is loaded via option 'imap'. syl: [yes, no] yes/no decision whether or not the output transcription is to be syllabified. Syllable boundaries '.' are inserted into the transcription with separating blanks. outsym: [sampa, x-sampa, maus-sampa, ipa, arpabet] Ouput phoneme symbol inventory. The language-specific SAMPA variant is the default. Alternatives are: language independent X-SAMPA, MAUS-SAMPA, IPA and ARPABET. MAUS-SAMPA maps the output to a language-specific phoneme subset that WEBMAUS can process. ARPABET is supported for eng-US only. nrm: [yes, no] Text normalization. Currently available for German and English variants only. Detects and expands 22 non-standard word types. All output file types supported but not available for the following tokenized input types: bpf, TextGrid, and tcf. If switched off, only number expansion is carried out. i: Orthographic text or annotation of the utterance to be converted; encoding must be UTF-8; formats are defined in option 'iform'. Continuous text input undergoes several text normalization stages resulting in a tokenized word chain that repesents the most likely spoken utterance (e.g. numbers are converted into their full word forms). See the webservice help page of the Web interface for details: https://clarin.phonetik.uni-muenchen.de/BASWebServices/interface/Grapheme2Phoneme. Special languages for text input: Thai, Russian and Georgian expect their respective standard alphabets; Japanese allows Kanji or Katakana or a mixture of both, but the tokenized output will contain only the Katakana version of the input; Swiss German expects input to be transcribed in 'Dieth' (https://en.wikipedia.org/wiki/Swiss_German); Australian Aboriginal languages (including Kunwinjku, Yolnu Matha) expect so called 'Practical Orthography' (https://en.wikipedia.org/wiki/Transcription_of_Australian_Aboriginal_languages); Persian accepts a romanized version of Farsi developped by Elisa Pellegrino and Hama Asadi (see http://www.bas.uni-muenchen.de/Bas/BASWebServices/DOCS/PersianRomanizationTable.pdf) for details). tgitem: TextGrid tier name: only needed, if 'iform' ('Input format') is 'tg'. Name of the TextGrid tier (item), that contains the words to be transcribed. In case of TextGrid output, this tier is the reference tier for the added tiers. align: [yes, no, maus] yes/no/sym decision whether or not the transcription is to be letter-aligned. Examples: if align is set to 'yes' the transcription for 'archaeopteryx' is 'A: _ k _ _ I Q p t @ r I k+s', i.e. 'ar' is mapped to 'A: _', and 'x' to 'k+s'. If contained in the output, syllable boundaries and word stress are '+'-concatenated with the preceeding, resp. following symbol. 'sym' causes a special symmetric alignment which is needed e.g. for MAUS rule training, i.e. word: a r c h a e o p t e r y x _; transcription: A: _ k _ _ I Q p t @ r I k s. Syllable boundaries and word stress are not part of the output of this 'sym' alignment. For the output formats 'tab', 'exttab', 'lex', and 'extlex' also the aligned orthography is letter-splitted to account for multi-character letters in languages as Hungarian. featset: [standard, extended] Feature set used for grapheme-phoneme conversion. The standard set is the default and comprises a letter window centered on the grapheme to be converted. The extended set additionally includes part of speech and morphological analyses. The extended set is currently available for German and British English only. For connected text the extended feature set generally generally yields better performance. However, if the input comprises a high amount of proper names provoking erroneous part of speech tagging and morphologic analyses, than the standard feature set is more robust. iform: [txt, bpf, list, tcf, tg] Accepted input formats for grapheme phoneme conversion: 'txt' indicates normal text input, which will be tokenized before the conversion. 'list' indicates a sequence of unconnected words, that does not need to be tokenized. Furthermore, 'list' requires a different part-of-speech tagging strategy than 'txt' for the extraction of the 'extended' feature set (see Parameter 'featset'). 'tcf' indicates, that the input format is TCF containing at least a tokenization dominated by the element 'tokens'. 'tg' indicates TextGrid input. Long and short format is supported. For TextGrid input additionally the name of the item containing the words to be transcribed is to be specified by the parameter 'tgname'. In combination with 'bpf' output format 'tg' input additionally requires the specification of the sample rate by the parameter 'tgrate'. Input format 'bpf' indicates BAS partitur file input containing an ORT tier to be transcribed. ------------------------- Connected input text ('txt') will be (word-)tokenized and (partially) normalized before it is converted into phonemic symbols. In the following we list the most important conversions done on the text input: - all non-alphanumeric characters (including '$' and '€') are deleted, except '-', '.' and ',' in connection with digits. - all forms of single apostrophes are deleted, except for the languages ita, fra and ltz, in which d' D' l' L' preceeding a word (e.g. l'aqua) are split from the word and treated as extra tokens (e.g. l'aqua will be l' + aqua); note that there are many more cases of apostrophe usage where this split is not done. - other punctuations and brackets: are deleted. - if option 'Keep annotation = yes': expressions within '<>' brackets are protected and passed as is to the output, e.g. '' will appear as '' in the phonemic transcription. White space characters (blanks, tabs etc.) are not allowed within the '<>' brackets; if they are necessary, replace them with '_'. - numerals: are converted into number words, e.g. '5' --> 'five', '12' --> twelve, '23' --> 'twenty-three'. - single small and capital characters: are spelled out, e.g. 'b C g' --> /bi: zi: dZi:/. - strings of capital characters: are spelled out, e.g. 'USA' --> /ju:eseI/. If option 'Text normalization = yes' the following extra rules apply (only for languages deu-DE and eng-GB): - Many special characters such as '$' '€' '£' etc. are spelled out as 'Dollar' 'Euro' 'Pfund/Pound'. Often this depends on the context, e.g. a '.' can be translated as 'dot' within an URL but ignored otherwise. - special characters that can be expanded: % & $ § @ = € £ ₤ ¼ ½ ¾ © ° + < > ≤ ≥ - characters ² ³ , . / \ : _ ~ are sometimes expanded in special contexts of equations, units, URLs etc. - special numeric expressions such as date, time, amounts, ordinal numbers are translated correctly, e.g. '5. January 1923' --> 'fifth January nineteen-twentythree', '23€' --> 'twentythree Euro', '$30' --> 'thirty dollars', 'Clemens X' --> 'Clemens tenth', '10:15' --> 'a-quarter-past-ten'. - strings of capital characters that can be pronounced as words ('acronyms') sometimes are not spelled but spoken as a word: 'ESSO' --> /?E:sO/. Since plain text files can have different encodings, BOMs, line terminators etc., it is highly recommended to run input text files through the service 'TextEnhance' before feeding them into G2P (the 'Pipeline' services do that automatically); this service also allows the correct bracketing of linguistic markers and comment lines so that they can be passed through the pipeline and are not interpreted as being spoken. Special languages: Thai, Russian and Georgian expect their respective standard alphabets; Japanese allows Kanji or Katakana or a mixture of both, but the tokenized output will contain only the Katakana version of the input; Swiss German expects input to be transcribed in 'Dieth' (https://en.wikipedia.org/wiki/Swiss_German); Australian Aboriginal languages (including Kunwinjku, Yolnu Matha) expect so called 'Modern Practical Orthography' (https://en.wikipedia.org/wiki/Transcription_of_Australian_Aboriginal_languages); Persian accepts a romanized transcript developped by Elisa Pellegrino and Hama Asadi (see ... for details). except: name of an exception dictionary file overwriting the g2p output. Format: 2 semicolon-separated columns 'word; transcription (in X-SAMPA). Phonemes blank-separated. Example: sagt;z ' a x t. embed: [no, maus] Macro option for embedding G2P into WEBMAUS. If set to 'maus', it overwrites several basic options as follows: 'stress', 'syl', and 'align' are set to 'no'. 'oform' is set to 'bpfs'. 'outsym' is set to 'maus-sampa'. Small single letters are transcribed as word fragments instead of spelling. oform: [txt, tab, exttab, lex, extlex, bpf, bpfs, extbpf, extbpfs, tcf, exttcf, tg, exttg] Output format: 'bpf' indicates the BAS Partitur Format (BPF) file with an ORT/KAN tier. The tiers contains a table with 3 columns and one line per word in the input. Column 1 is always 'ORT:/KAN:'; column 2 is an integer starting with 0 denoting the word position within the input; column 3 contains for ORT the (possibly normalized) orthographic word, for KAN the canonical pronunciation of the word coded in SAM-PA (or IPA); the latter does not contain blanks. 'bpfs' differs from 'bpf' only in that respect, that the phonemes in KAN are separated by blanks. In case of TextGrid input, both 'bpf' and 'bpfs' require the additional parameters 'tgrate' and 'tgitem'. Additionally, the content of the TextGrid tier 'tgitem' is stored as a word chunk segmentation in the BPF tier TRN. 'extbpf' or 'extbpfs' extend the BPF output file by the tiers POS (part of speech, STTS tagset), KSS (full phonemic transcript including e.g. lexical accent), TRL (orthographic transcript with punctuation), and MRP (morph segmentation and classes). 'txt' cause a replacement of the input words by their phonemic transcriptions; single line output without punctuation, where phonemes are separated by blanks and words by tabulators. 'tab' returns the grapheme phoneme conversion result in form of a table with two columns. The first column comprises the words, the second column their blank-separated transcriptions. 'exttab' results in a 5-column table. The columns contain from left to right: word, transcription, part of speech, morpheme segmentation, and morpheme class segmentation. 'lex' transforms the table to a lexicon, i.e. words are unique and sorted. 'extlex' provides the same information as 'exttab' in a unique and sorted manner. For all lex and tab outputs columns are separated by ';'. If option 'align' is switched on, the first (word) column is letter-segmented. 'tcf' creates either a tcf output file from scratch (in case iform is not 'tcf'), or a transcription tier is added to the input tcf file. If a tcf file is generated from scratch, it contains the elements 'text', 'tokens', and 'BAS_TRS' for the phonemic transcription. oform 'exttcf' additionally adds the elements 'BAS_POS' (part of speech, STTS tagset), 'BAS_MORPH' (morph segmentation), and 'BAS_MORPHCLASS' (morph classes). 'tg' and 'exttg' produce TextGrid output; for this a TextGrid input (iform 'tg') is required. With 'tg' the tier 'BAS_TRS' (phonemic transcript) is inserted to the TextGrid which runs parallel to the tier specified by the parameter 'tgitem'; words are separated by an '#' symbol. 'exttg' adds the tiers 'BAS_POS', 'BAS_MORPH', and 'BAS_MORPHCLASS' parallel to 'BAS_TRS'. Their content is the same as for oform 'exttcf'. The 'extended' oform versions 'exttab', 'extlex', 'exttcf', and 'exttg' are only available for languages deu|eng-*|aus|nze|use; for the other languages these formats are replaced by the corresponding non-extended format. While the output contains punctuation for 'exttab', 'tcf', 'exttcf', and 'exttg' for the other formats it is ignored. Output: A XML response containing the elements "success", "downloadLink", "output" and "warning". "success" states if the processing was successful or not, "downloadLink" specifies the location where the file containing the phonemic transcription in SAM-PA (segmented in words) can be found (the format of the file depends on the option selected in oform), "output" contains the output that is mostly useful during debugging errors, and "warnings" contains any warnings that occured during the processing. The format of the output file depends on the value of input parameter oform. ---------------------------------------------------------------- ---------------------------------------------------------------- runCOALA ------------------ Description: Generates corpus and session CMDIs according to the media-corpus-profile and the media-session-profile of the ComponentRegistry by converting five CSV tables to the CMDI format. Use the runCOALAGetTemplates WebService to get templates for these tables. The resulting session CMDIs can be used as they are, while the corpus CMDI needs to be edited by hand. Example curl call is: curl -v -X POST -H 'content-type: multipart/form-data' -F writtenresources-table=@ -F mediafiles-table=@ -F corpus-title= -F bundles-table=@ -F corpus-name= -F sessions-table=@ -F actors-table=@ 'https://clarin.phonetik.uni-muenchen.de/BASWebServices/services/runCOALA' Parameters: writtenresources-table mediafiles-table [corpus-title] bundles-table [corpus-name] sessions-table actors-table Parameter description: writtenresources-table: Assigns the file as the WrittenResources (i.e. Annotations) table. mediafiles-table: Assigns the file as the MediaFiles table. corpus-title: The corpus title or long name. Except for your institution you should not use any abbreviations. You may use white spaces here. bundles-table: Assigns the file as the Bundles table. corpus-name: The short code name of the corpus (no spaces allowed). sessions-table: Assigns the file as the Sessions table. actors-table: Assigns the file as the Actors (e.g. Speakers, Signers, ...) table. Output: An XML response containing the tags "success", "downloadLink", "output" and "warning". "success" states if the processing was successful or not, "downloadLink" specifies the location where the generated zip file (containing CMDI files) can be found, "output" contains the output that is mostly useful during debugging errors and "warning" contains any warnings, which occured during the processing. ---------------------------------------------------------------- ---------------------------------------------------------------- runASR ------------------ Description: Automatic transcription of speech signal using several third party ASR services (experimental). By using this service you indemnify and hold the BAS harmless from any claim arising out of the use of these third party webservices. Note that ASR services support different sets of languages; if you select an ASR service and a language code that are not compatible, an ERROR will be returned. Also note that ASR services have different quota limitations; if the service returns a quota violation ERROR, you might consider trying a different ASR service or contact the BAS for an extended user account. This service is experimental and can be terminated any time without warning. It is restricted for academic use only; therefore this service cannot be called as a RESTful service like other BAS services, and the Web API to this service is protected by AAI Shiboleth authentification. Example curl call is: curl -v -X POST -H 'content-type: multipart/form-data' -F SIGNAL=@ -F LANGUAGE=deu-DE -F speakMatch= -F OUTFORMAT=bpf -F USEREMAIL= -F ASRType=autoSelect -F diarization=true -F numberSpeakDiar=0 -F USEWORDASTURN=false -F TROSpeakerID=false -F ACCESSCODE= 'https://clarin.phonetik.uni-muenchen.de/BASWebServices/services/runASR' Parameters: SIGNAL [LANGUAGE] [speakMatch] [OUTFORMAT] [USEREMAIL] [ASRType] [diarization] [numberSpeakDiar] [USEWORDASTURN] [TROSpeakerID] [ACCESSCODE] Parameter description: SIGNAL: Input signal file that contains the spoken text to be transcribed. Accepted file formats are *.wav (WAVE RIFF), *.nis|nist|sph (NIST SPHERE), *.mpeg|mpg (Video, several codecs) and *.mp4 (MPEG4) and all formats supported by the service 'AnnotConv'. File format will be determined by extension only. LANGUAGE: [afr-ZA, sqi-AL, amh-ET, ara-DZ, ara-BH, ara-EG, ara-IQ, ara-IL, ara-JO, ara-KW, ara-LB, arb, ara, ara-MA, ara-OM, ara-QA, ara-SA, ara-PS, ara-TN, ara-AE, hye-AM, aze-AZ, eus-ES, ben-BD, ben-IN, bul-BG, mya-MM, cat-ES, yue-HK, cmn-CN, cmn-TW, hrv-HR, ces-CZ, dan-DK, nld-BE, nld-NL-GN, nld-NL-OH, nld-NL-PR, nld-NL, eng-AU, eng-CA, eng-GH, eng-GB, eng-IN, eng-IE, eng-KE, eng-NG, eng-NZ, eng-PH, eng-SC, eng-SG, eng-ZA, eng-TZ, eng-US, est-EE, fil-PH, fin-FI, fra-CA, fra-FR, glg-ES, kat-GE, deu-AT, deu-DE, deu-DE-OH, deu-CH, ell-GR, guj-IN, heb-IL, hin-IN, hun-HU, isl-IS, ind-ID, ita-IT, jpn-JP, jav-ID, kan-IN, khm-KH, kor-KR, lao-LA, lav-LV, lit-LT, mkd-MK, mal-IN, msa-MY, mar-IN, mon-MN, nep-NP, nob-NO, fas-IR, pol-PL, por-BR, por-PT, pan-guru-IN, ron-RO, rus-RU, srp-RS, sin-LK, slk-SK, slv-SI, spa-AR, spa-BO, spa-CL, spa-CO, spa-CR, spa-DO, spa-EC, spa-SV, spa-GT, spa-HN, spa-MX, spa-NI, spa-PA, spa-PY, spa-PE, spa-PR, spa-ES, spa-US, spa-UY, spa-VE, sun-ID, swa-KE, swa-TZ, swe-SE, tam-IN, tam-MY, tam-SG, tam-LK, tel-IN, tha-TH, tur-TR, ukr-UA, urd-IN, urd-PK, uzb-UZ, vie-VN, zul-ZA] Language of the speech to be recognized; we use the RFC5646 sub-structure 'iso639-3 - iso3166-1 [- iso3166-2], e.g. 'eng-US' for American English, 'deu-AT-1' for Austrian German spoken in 'Oberoesterreich'. Some ASR services distinguish not by region but by language model applied, i.e. the third part of the RFC5646 sub-structure is not a (country-specific) region code but a propriatory code for a language domain, e.g. 'nld-NL-OH' is the Dutch language model optimized for Oral History speech data. Special languages: Thai, Russian and Georgian expect their respective standard alphabets; Japanese allows Kanji or Katakana or a mixture of both, but the tokenized output will contain only the Katakana version of the input; Swiss German expects input to be transcribed in 'Dieth' (https://en.wikipedia.org/wiki/Swiss_German); Australian Aboriginal languages (including Kunwinjku, Yolnu Matha) expect so called 'Practical Orthography' (https://en.wikipedia.org/wiki/Transcription_of_Australian_Aboriginal_languages); Persian accepts a romanized version of Farsi developped by Elisa Pellegrino and Hama Asadi (see http://www.bas.uni-muenchen.de/Bas/BASWebServices/DOCS/PersianRomanizationTable.pdf) for details). speakMatch: Option speakMatch: if set to a list of comma separated names (e.g. speakMatch='Anton,Berta,Charlie', the corresponding speaker labels found by the service in the order of appearance are replaced by these names (e.g. 'S1' to 'Anton', 'S2' to 'Berta' etc.). This allows the user to create SD annotation using her self defined speaker labels, if the user knows the order of appearance; it is obvious that this feature only makes sense in single file processing, since the speaker labels and the order of appearance differ from one recording to the next; the suggested mode of operation is to run the service in batch mode over all recordings with speakMatch="", then inspect manually the resulting annotation and define speaker labels in the order of appearance for each recording, and then run the service in single file mode for each recording again with the corresponding speakMatch list. If the speakMatch option contains a comma separated list of value pairs like 'S1:Anton', only the speaker labels listed on the lefthand side of each pair are patched, e.g. for speakMatch='S3:Charlie,S6:Florian' only the third and sixth appearing speaker are renamed to Charlie and Florian respectively. OUTFORMAT: [bpf, exb, csv, TextGrid, emuDB, eaf, tei, txt, native] Format of result file: 'txt' : simple text file with one line of recognized text. 'bpf' : BAS Partitur Format with tiers ORT, TRO, SPK (if the service delivers) and WOR (if the service delivers). 'TextGrid' : praat compatible annotation file. 'emuDB' : EMU-SDMS annotation file *_annot.json with level ORT (ITEM), SPK and WOR (SEGMENT). 'native' : original JSON/XML response file of the service (if provided). 'csv' : CSV spread sheet table with flattened hierarchy entries, columns BEGIN and DURATION in sample counts. 'eaf' : ELAN compatible XML annotation file. 'exb' : EXMARaLDA compatible XML annotation file; 'tei' : Iso TEI document. Note that some of these formats will cause an error, if the selected ASR service does not provide any segmental information (such as a word segmentation). For a description of BPF see http://www.bas.uni-muenchen.de/forschung/Bas/BasFormatseng.html. for a description of Emu see https://github.com/IPS-LMU/emuR. Note 1: using 'emuDB' will first produce only single annotation file *_annot.json; in the WebMAUS interface (https://clarin.phonetik.uni-muenchen.de/BASWebServices) you can process more than one file and than download a zipped Emu database; in this case don't forget to change the default name of the emuDB 'MAUSOUTPUT' using the R function emuR::rename_emuDB(). Note 2: if you need the same result in more than one format, select 'bpf' to produce a BPF file, and then convert this file with the service runAnnotConv ('AnnotConv') into the desired formats. Note 3: some format conversions are not loss-less; select 'bpf' to be sure that no information is lost. USEREMAIL: Option USEREMAIL: if a valid email address is provided through this option, the service will send the XML file containing the results of the service run to this address after completion. It is recommended to set this option for long recordings (batch size <6, length >1h) since it is often problematic to wait for service completion over an instable internet connection or from a laptop that might go into hibernation. The email address provided is not stored on the server. It is sometimes even advisable to kill the browser tab after starting the call and wait for the result emails (only for batch size <6!). Beware: the download link to your result(s) will be valid for 24h after you receive the email; after that all your data will be purged from the server. Disclaimer: the usage of this option is at your own risk; the key URL to download your result file will be send without encryption in this email; be aware that anybody who can intercept this email will be able to access your result files using this key; the BAS at LMU Munich will not be held responsible for any security breach caused by using this email notification option. ASRType: [autoSelect, callAmberscriptASR, callEMLASR, callFraunhoferASR, callGoogleASR, callLSTDutchASR, callLSTEnglishASR, callWatsonASR, callUWEBASR, callWhisperXASR, allServices] Name of the ASR service applied. If set to 'autoSelect', the service will select the next available ASR service that supports the LANGUAGE; if set to 'allServices', the service will send the input signal to all ASR services that support LANGUAGE and output the ASR results in simple txt format. Please note that in some cases (see details in service manual) your input signal is send to a third party ASR service which is not a part of BAS. By selecting a third party service you accept the end user license agreement of this service (as posted on the Web API of BAS ASR service manual) and agree that your signals are to send to the selected service. Be advised that some of these services store input signals to improve their ASR performance, and that several restrictions (service dependent quotas) apply to the number and amount of input signals (see the ASR service manual on the BAS Web API for details). Some ASR services only allow asynchroneous processing, which means that the response time can be up to several minutes. If you need service capacity exceding the standard quotas for a specific ASR service, please contact the BAS for special arrangements. diarization: [true, false] If set to true (default: false), the ASR service will label each word in the result with a speaker label (BPF tier SPK); speaker labels are 'S1', 'S2', etc. with ascending numbers in the order of appearance in the signal file. If the selected ASR service does not support speaker diarization, a WARNING is issued. Currently the service IBM Watson supports diarization for the languages eng-US, jpn-JP and spa-ES; the service EML supports diarization for all languages; the service Google Cloud supports diarization with presetting the number of speakers for about 20 out of 120 languages. See also option 'numberSpeakDIar'. numberSpeakDiar: [0.0, 100.0] If set to a value greater 1, the speaker diarization is restricted to results with this number of speakers; this significantly improves results. Note that not all ASR services offer this option; see service manual for details. If set to 0 or if the services does not offer this option, the service determines the number of speakers automatically. USEWORDASTURN: [true, false] If set to true (default: false), and if the selected ASR service delivers a valid word segmentation (tier WOR), this word segmentation is encoded as a chunk segmentation in the output (tier TRN) instead of the (possible) result of a speaker diarization (default). Both, the speaker diarization (which is basically a turn segmentation) and the word segmentation, when used as a chunk segmentation input to MAUS, might improve the phonetic alignment of MAUS, since they act as fix time anchors for the MAUS segmentation process. In some cases the word segmentation as time anchors yields better results (simply because there are more of them and a gross misalignment of MAUS is less likely); sometimes the chosen ASR service does not deliver a speaker diarization, then this option allows to switch to the word segmentation (which is delivered by all ASR services). TROSpeakerID: [true, false] If set to true (default: false), and if the selected ASR service delivers a valid speaker diarization (tier SPK) and a TRO tier, the service will insert speaker ID labels 'XXX: ' before each word in the TRO tier, that starts a new speaker turn of speaker labelled as 'XXX'. The inserted speaker label 'XXX' is either one of the standardized labels 'S1', 'S2', ... or mapped speaker labels taken from the option 'speakMatch'. The service also checks each preceeding word to a speaker turn change (the last word of the previous turn) and adds a trailing '.', if the word does not has already a trailing final punctuation sign (one of '!?.:...). This option enables pipelines that start with 'ASR' and end with 'SUBTITLE' to create subtitle tracks (e.g. WebVTT) that show the speaker ID at speaker changes, and that start a new subtitle at each speaker turn change. ACCESSCODE: Exceed quota code (ACCESSCODE): special code a user has acquired to override default quotas. Not needed for normal operation. Output: A XML response containing the elements "success", "downloadLink", "output" and "warning". "success" states if the processing was successful or not, "downloadLink" specifies the location where the file containing the resulting transcription (segmented in words) can be found (the format of the file depends on the option selected in parameter OUTFORMAT), "output" contains the output that is mostly useful during debugging errors, and "warnings" contains any warnings that occured during the processing. ---------------------------------------------------------------- ---------------------------------------------------------------- runSubtitle ------------------ Description: This service maps the result of a MAUS process (a word/phone segmentation) or the result of an ASR (word segmentation) to the original transcript and groups the transcript into subtitles. The service can be used to automatically create a subtitle track from a signal (+ text); it is recommended to use the service Pipeline with parameter PIPE=G2P_(CHUNKER)_MAUS_SUBTITLE (with text input) or PIPE=ASR_SUBTITLE (without text input). Alternatively, this service can be used to map a transcript in arbitrary format (e.g. containing non-normalized words or punctuations) to a MAUS segmentation. The latter is useful, if the word normalisation/tokenisation changes the original transcript for the MAUS segmentation, but you need for instance punctuations for your analysis of the MAUS segmentation. If the service reads the original transcript, this transcript file is piped through the service runTextEnhance (webinterface: TextEnhance) before aligning it to the result of runMAUS; this ensures that results from a Pipeline run (in which runTextEnhance is always applied to the text input) can be processed using the same text input format (e.g. RTF). Example curl call is: curl -v -X POST -H 'content-type: multipart/form-data' -F transcription=@ -F maxlength=0 -F marker=punct -F bpf=@ -F outformat=bpf+trn 'https://clarin.phonetik.uni-muenchen.de/BASWebServices/services/runSubtitle' Parameters: [transcription] [maxlength] [marker] bpf [outformat] Parameter description: transcription: (Non-normalized) transcription of the recording to be segmented into subtitles (usually this file is the input of the earlier Pipeline G2P_... process). Format is all formats that can be converted by service TextEnhance; content is non-normalized text. For example, the transcript could contain numerals or abbreviations or punctuations that are all retained in the subtitles, while the output of a runMAUS process contains only the normalized text stripped from punctuations. Note that this input can be ommitted; then the subtitles are derived from the TRO tier of the BPF input, or - if TRO does not exist - from the ORT tier. maxlength: [0.0, 999.0] Maximum subtitle length. If set to 0, subtitles of indefinite length are created, based only on the distance of the split markers. If set to a value greater than 0, subtitles are split whenever a stretch between two neighbouring split markers is longer than that value (in words). Caution: This may lead to subtitle splits in suboptimal locations (e.g. inside syntactic phrases). left-bracket: One or more characters which mark comments reaching until the end of the line (default: NONE). E.g. if your input transcript contains comment lines that begin with ';', set this option to ';' to avoid that these comments are treated as spoken text. If you want to suppress the masking of comment lines, set this option to 'NONE' (default). If you are using comment lines in your input text, you must be absolutely sure that the comment character appears nowhere in the text except in comment lines! Note 1: the characters '&', '|' and '=' do not work as comment characters. Note 2: for technical reasons the value for this option cannot be empty. Note 3: the default character '#' cannot be combined with other characters, e.g. if you define this option as ';#', the '#' will be ignored. Note 4 (sorry): for the service 'Subtitle' comment lines must be terminated with a so called 'final punctuation sign', i.e. one of '.!?:…'; otherwise, an immediately following speaker marker will not be recognized. marker: [punct, newline, tag] Marker used to split transcription into subtitles. If set to 'punct' (default), the transcription is split after 'terminal' punctuation marks (currently [.!?:…]. If set to 'newline', the transcription is split at newlines (\n or \r\n). If set to 'tag', the program expects a special <BREAK> tag inside the transcription. bpf: Phonemic transcription of the recording to be mapped to subtitles (usually this file is the output of a runMAUS or a runASR process). Format is a BAS Partitur Format (BPF) file with at least an ORT and a MAU or WOR tier. The ORT tier contains a table with 3 columns and one line per (tokenized) word in the video. Column 1 is always 'ORT:'; column 2 is an integer starting with 0 denoting the word position within the input; column 3 contains the (normalized) orthography of the word coded in UTF-8. The MAUi/WOR tier is a table with 5 columns containing the segmentation and labeling of phones/words: column 1 is always 'MAU:/WOR:'; column 2 is the begin of a segment in samples from the start of the recording (0); column 3 contains the duration of the segment in samples minus 1; column 4 contains the word number to which this segment belongs (see tier ORT); column 5 contains the SAMPA/IPA encoding of the phone or the word label. If the input BPF contains a TRO (tokenized original text) tier, the input of the original transcript can be ommitted. See http://www.bas.uni-muenchen.de/forschung/Bas/BasFormatseng.html for a detailed description of the BPF. replace-whitespace-char: The character that whitespace in comments and annotation markers should be substituted by (default: NONE). The BAS WebServices require that annotation markers or comment lines in the input transcript do not contain white spaces. This option let you decide which character should be used to replace the white spaces; the most common character used for this purpose is '_' (this is the default in Pipelines). If set to the string 'NONE' no replacement takes place (default). CAUTION: the characters '&' and '=' do not work as replacements. outformat: [srt, sub, vtt, bpf+trn] Output format. 'srt', 'sub' or 'vtt' denote 'SubRip', 'SubViewer' or 'WebVTT' subtitle format respectively. 'bpf+trn' (default) denotes a BAS Partitur Format file (BPF, *.par, copied from input) with an added TRO tier that maps the original transcript to the tokenized and word-normalized ORT tier, and an added TRN tier (existing TRN are over-written!) that corresponds to subtitles. If output format is 'vtt' and a subtitle starts with a speaker marker of the form '<...>', a 'v ' is inserted before the '...'. See http://www.bas.uni-muenchen.de/forschung/Bas/BasFormatseng.html for details about the BPF format. brackets: One or more pairs of characters which bracket annotation markers in the input transcript. E.g. if your input transcript contains markers like '{Lachen}' and '[noise]' that should be passed as markers and not as spoken text, set this option to '{}[]'. Note that blanks replacement within such markers (see next option 'replace-whitespace-char') only takes place in markern/comments that are defined here. Output: An XML response containing the tags "success", "downloadLink", "output" and "warning". "success" states whether the processing was successful or not, "downloadLink" specifies the location where the output file is provided; depending on parameter 'outformat' this can be BPF file (*.par), a SubRip subtitle format (*.srt), or a SubViewer subtitle format (*.sub). The BPF contains the content of the input BPF (option "bpf") with appended TRO and TRN tier (existing TRO/TRN tiers in the BPF input are over-written). The TRO tier contains the mapping from the ORT tier to the input transcription; the TRN tier contains the subtitle grouping. ---------------------------------------------------------------- ---------------------------------------------------------------- runSpeakDiar ------------------ Description: This services reads a media file (sound, video) and performs a speaker diarization (SD) based on the pyannote 2 python library. Website: https://github.com/pyannote; Paper: https://dx.doi.org/10.1109/ICASSP40776.2020.9052974. The service is a wrapper around the pyannote2.0 library [1][2], which has proven to be the most reliable open source diarization model at the time of testing (2023). The library applies pretrained models for voicing segmentation (trained on dihard3: https://dihardchallenge.github.io/dihard3/index) and overlap detection to pre-segment the data. The voiced segments are subsequently converted into speaker-identifying embeddings using the public ECAPA-TDNN architecture from speechbrain [3] (trained on voxceleb 1+2: https://www.robots.ox.ac.uk/~vgg/data/). These embeddings are then clustered using Hierarchical agglomerative clustering to find speaker segments likely belonging to the same person. Finally the embeddings are mapped back into the time domain which yields a final diarization output. More details can be found in the pyannote pipeline’s technical report here: https://huggingface.co/pyannote/speaker-diarization Example curl call is: curl -v -X POST -H 'content-type: multipart/form-data' -F SIGNAL=@ -F speakMatch= -F speakNumber=0 -F OUTFORMAT=bpf -F SAMPLERATE=1 -F TEXT=@ -F minSpeakNumber=0 -F maxSpeakNumber=0 -F allowOverlaps=false 'https://clarin.phonetik.uni-muenchen.de/BASWebServices/services/runSpeakDiar' Parameters: SIGNAL [speakMatch] [speakNumber] [OUTFORMAT] [SAMPLERATE] [TEXT] [minSpeakNumber] [maxSpeakNumber] [allowOverlaps] Parameter description: SIGNAL: Required input SIGNAL: sound or video file containing the speech signal to be speaker diarized. Although the mimetype of this input file is restricted to RIFF AUDIO audio/x-wav (extension wav), all media formats that are supported by BAS WebService AudioEnhance are accepted. speakMatch: Option speakMatch: if set to a list of comma separated names (e.g. speakMatch='Anton,Berta,Charlie', the corresponding speaker labels found by the service in the order of appearance are replaced by these names (e.g. 'S1' to 'Anton', 'S2' to 'Berta' etc.). This allows the user to create SD annotation using her self defined speaker labels, if the user knows the order of appearance; it is obvious that this feature only makes sense in single file processing, since the speaker labels and the order of appearance differ from one recording to the next; the suggested mode of operation is to run the service in batch mode over all recordings with speakMatch="", then inspect manually the resulting annotation and define speaker labels in the order of appearance for each recording, and then run the service in single file mode for each recording again with the corresponding speakMatch list. If the speakMatch option contains a comma separated list of value pairs like 'S1:Anton', only the speaker labels listed on the lefthand side of each pair are patched, e.g. for speakMatch='S3:Charlie,S6:Florian' only the third and sixth appearing speaker are renamed to Charlie and Florian respectively. speakNumber: [0.0, 999999.0] Option speakNumber restricts the number of detected speakers to the given number. If set to 0 (default), the SD method determines the number automatically. OUTFORMAT: [bpf, exb, csv, TextGrid, emuDB, eaf, tei] Option 'Output format' (OUTFORMAT): Defines the possible output formats: TextGrid - a praat compatible TextGrid file; bpf - a (input) BPF file with new (or replaced) tier(s) SPD (and SPK if BPF was input); csv - a spreadsheet (CSV table) that contains the most important information; emuDB - an Emu compatible *_annot.json file; eaf - an ELAN compatible annotation file; exb - an EXMARaLDA compatible annotation file; tei - Iso TEI document (XML). For a description of BPF see http://www.bas.uni-muenchen.de/forschung/Bas/BasFormatseng.html. for a description of Emu see https://github.com/IPS-LMU/emuR. Note 1: using 'emuDB' will first produce only single annotation file *_annot.json; in the WebMAUS interface (https://clarin.phonetik.uni-muenchen.de/BASWebServices) you can process more than one file and than download a zipped Emu database; in this case don't forget to change the default name of the emuDB 'MAUSOUTPUT' using the R function emuR::rename_emuDB(). Note 2: if you need the same result in more than one format, select 'bpf' to produce a BPF file, and then convert this file with the service runAnnotConv ('AnnotConv') into the desired formats. Note 3: some format conversions are not loss-less; select 'bpf' to be sure that no information is lost. SAMPLERATE: [0.0, 999999.0] Option SAMPLERATE of signal file: if the sample rate cannot be determined automatically from SIGNAL, you can provide the sampling rate via this option. Usually you can leave it to the default value of 1. TEXT: Optional BPF input: BAS Partitur Format (BPF) file (*.par or *.bpf) to which the SD result is appended to and copied to output (possibly converted to another format). If the BPF contains a word segmentation (tier ORT/MAU), the service matches the SD result against the word segmentation and creates a word-wise SD labelling (SPK tier) based on maximum overlap. See http://www.bas.uni-muenchen.de/forschung/Bas/BasFormatseng.html for detailed description of the BPF. minSpeakNumber: [0.0, 999999.0] Option minSpeakNumber defines a hard lower bound of the number of detected speakers. If set to 0 (default), no lower bound. maxSpeakNumber: [0.0, 999999.0] Option maxSpeakNumber defines a hard upper bound of the number of detected speakers. If set to 0 (default), no upper bound. allowOverlaps: [true, false] Option allowOverlaps: If set to true, the un-altered output of PyAnnote is returned in the SPD tier (note that overlaps cannot be handled by most annotation formats; only use if you really need to detect overlaps!); if set to false (default), overlaps, missing silence intervals etc. are resolved in the output tier SPD, making this output compatible with all annotation formats. The postprocessing works as follows: 1. all silence intervals are removed. 2. all speaker segments that are 100% within another (larger) speaker segment are removed. 3. If an overlap occurs the earlier segment(s) are truncated to the start of the new segment. 4. all remaining gaps in the segmentation are filled with silence intervals. Output: A XML response containing the elements "success", "downloadLink", "output" and "warning". "success" states if the processing was successful or not, "downloadLink" specifies the location where the result file can be found which contains the speaker diarization result. The format of the annotation file depends on the option selected in OUTFORMAT. "output" contains the output that is mostly useful during debugging errors and "warning" lists warnings, if any occured during the processing. ---------------------------------------------------------------- ---------------------------------------------------------------- runMAUSGetInventar ------------------ Description: Returns the available phonemic input inventar (in SAMPA) for a given language. Example curl call is: curl -v -X GET -H 'content-type: application/x-www-form-urlencoded' 'https://clarin.phonetik.uni-muenchen.de/BASWebServices/services/runMAUSGetInventar?LANGUAGE=deu-DE' Parameters: [LANGUAGE] Parameter description: LANGUAGE: [aus-AU, afr-ZA, sqi-AL, eus-ES, eus-FR, cat-ES, nld-BE, nld-NL, nor-NO, eng-US, eng-AU, eng-GB, eng-SC, eng-NZ, ekk-EE, fin-FI, fra-FR, kat-GE, deu-DE, gsw-CH, gsw-CH-BE, gsw-CH-BS, gsw-CH-GR, gsw-CH-SG, gsw-CH-ZH, hun-HU, isl-IS, ita-IT, jpn-JP, gup-AU, sampa, ltz-LU, mlt-MT, nan-TW, fas-IR, pol-PL, por-PT, ron-RO, rus-RU, spa-ES, swe-SE, tha-TH, guf-AU] Language of the phoneme symbol set; we use the RFC5646 sub-structure 'iso639-3 - iso3166-1 [ - iso3166-2], e.g. 'eng-US' for American English, 'deu-AT-1' for Austrian German spoken in 'Oberoesterreich'. Output: List of accepted input phonemic SAM-PA symbols for the selected language; one symbol per line; this can be used by calling applications to pre-test the transcription input to the runMAUS service for faulty symbols. ---------------------------------------------------------------- ---------------------------------------------------------------- runMAUS ------------------ Description: Segments a media file into phonetic and word segments given a tokenized phonemic transcription as input. This service allows the usage of all possible options of the MAUS program. The service creates a stochastic, language specific pronunciation model derived from the canonical input transcript and then combines this model with a phonetic Hidden Markov Model trained on the language to decode the most likely segmentation and labelling. See the section Input for a detailed description of all options or use the operation 'runMAUSGetHelp' to download a current version of the MAUS documentation. Note that this service does not process text files (*.txt) as an input, but rather BAS Partitur Format (BPF, *.par, see https://www.bas.uni-muenchen.de/forschung/Bas/BasFormatseng.html#Partitur for details) or CSV tables (*.csv). To process text input use either the service runMAUSBasic, or - in case you require some options that are only available for runMAUS - use the operation 'Pipeline' with the PIPE=G2P_MAUS. Example curl call is: curl -v -X POST -H 'content-type: multipart/form-data' -F SIGNAL=@ -F LANGUAGE=deu-DE -F MODUS=default -F INSKANTEXTGRID=true -F RELAXMINDUR=false -F OUTFORMAT=TextGrid -F TARGETRATE=100000 -F ENDWORD=999999 -F RELAXMINDURTHREE=false -F STARTWORD=0 -F INSYMBOL=sampa -F PRESEG=false -F USETRN=false -F BPF=@ -F MAUSSHIFT=default -F INSPROB=0.0 -F INSORTTEXTGRID=true -F OUTSYMBOL=sampa -F RULESET=@ -F MINPAUSLEN=5 -F WEIGHT=default -F NOINITIALFINALSILENCE=false -F ADDSEGPROB=false 'https://clarin.phonetik.uni-muenchen.de/BASWebServices/services/runMAUS' Parameters: SIGNAL [LANGUAGE] [MODUS] [INSKANTEXTGRID] [RELAXMINDUR] [OUTFORMAT] [TARGETRATE] [ENDWORD] [RELAXMINDURTHREE] [STARTWORD] [INSYMBOL] [PRESEG] [USETRN] BPF [MAUSSHIFT] [INSPROB] [INSORTTEXTGRID] [OUTSYMBOL] [RULESET] [MINPAUSLEN] [WEIGHT] [NOINITIALFINALSILENCE] [ADDSEGPROB] Parameter description: SIGNAL: media file containing the speech signal to be segmented; any sampling rate; optimal results if leading and trailing silence intervals are truncated before processing. Although the mimetype of this input file is restricted to audio/x-wav (wav|WAV), the service will also process *.nis|nist|sph (NIST SPHERE), *.al|dea (ALAW), *.mpeg|mpg (Video, several codecs) and *.mp4 (MPEG4). File format will be determined by extension only. LANGUAGE: [aus-AU, afr-ZA, sqi-AL, eus-ES, eus-FR, cat-ES, nld-BE, nld-NL, eng-US, eng-AU, eng-GB, eng-SC, eng-NZ, ekk-EE, fin-FI, fra-FR, kat-GE, deu-DE, gsw-CH, gsw-CH-BE, gsw-CH-BS, gsw-CH-GR, gsw-CH-SG, gsw-CH-ZH, hun-HU, isl-IS, ita-IT, jpn-JP, sampa, ltz-LU, mlt-MT, nan-TW, nor-NO, fas-IR, pol-PL, por-PT, ron-RO, rus-RU, spa-ES, swe-SE, tha-TH] Option Language (LANGUAGE): Language of the speech to be processed; defines the possible phoneme symbol set in MAUS input and the pronunciation modelling module. RFC5646 sub-structure 'iso639-3 - iso3166-1 [- iso3166-2], e.g. 'eng-US' for American English, 'deu-AT-1' for Austrian German spoken in 'Oberoesterreich'. The language code 'sampa' (not RCFC5646) or 'und' denotes a language independent variant of MAUS for which the SAM-PA symbols in the input BPF must be blank separated (e.g. /h OY t @/). MODUS: [default, standard, align] Option MODUS: Operation modus of MAUS: default is to use the language dependent default modus; the two possible modi are: 'standard' which is the segmentation and labelling using the MAUS technique as described in Schiel ICPhS 1999, and 'align', a forced alignment is performed on the input SAM-PA string defined in the KAN tier of the BPF (the same effect as the deprecated former option CANONLY=true). INSKANTEXTGRID: [true, false] OPTION KAN tier in TextGrid (INSKANTEXTGRID): Switch to create an additional tier in the TextGrid output file with a word segmentation labelled with the canonic phonemic transcript (taken from the input KAN tier). RELAXMINDUR: [true, false] Option Relax Min Duration (RELAXMINDUR) changes the default minimum duration of 3 statesfor consonants and short/lax vowels and of 4 states for tense/long vowels and diphthongs to 1 and 2 states respectively. This is not optimal for general segmentation because MAUS will start to insert many very short vowels/glottal stops where they are not appropriate. But for some special investigations (e.g. the duration of /t/) it alleviates the ceiling problem at 30msec duration (with standard frame rate of 10msec per state). OUTFORMAT: [bpf, exb, csv, TextGrid, emuDB, eaf, tei, mau, par] Option 'Output format' (OUTFORMAT): Defines the possible output formats: TextGrid - a praat compatible TextGrid file; bpf - the input BPF file with a new (or replaced) tier MAU; csv - a spreadsheet (CSV table) that contains word and phone segmentation; mau - just the BPF tier MAU (phonetic segmentation); emuDB - an Emu compatible *_annot.json file; eaf - an ELAN compatible annotation file; exb - an EXMARaLDA compatible annotation file; tei - Iso TEI document (XML). For a description of BPF see http://www.bas.uni-muenchen.de/forschung/Bas/BasFormatseng.html. for a description of Emu see https://github.com/IPS-LMU/emuR. Note 1: using 'emuDB' will first produce only single annotation file *_annot.json; in the WebMAUS interface (https://clarin.phonetik.uni-muenchen.de/BASWebServices) you can process more than one file and than download a zipped Emu database; in this case don't forget to change the default name of the emuDB 'MAUSOUTPUT' using the R function emuR::rename_emuDB(). Note 2: if you need the same result in more than one format, select 'bpf' to produce a BPF file, and then convert this file with the service runAnnotConv ('AnnotConv') into the desired formats. Note 3: some format conversions are not loss-less; select 'bpf' to be sure that no information is lost. TARGETRATE: [100000, 20000, 10000] Option Output frame rate (TARGETRATE): the resolution of segment boundaries in output measured in 100nsec units (default 100000 = 10msec). Decreasing this value (min is 10000) increases computation time, does not increase segmental accuracy in average, but allows output segment boundaries to assume more possible values (default segment boundaries are quantizised in 10msec steps). This is useful, if MAUS results are analysed for duration of phones or syllables. ENDWORD: [0.0, 999999.0] Option End with word (ENDWORD): If set to a value n<999999, this option causes maus to end the segmentation with the word number n (word numbering in BPF starts with 0). This is useful if the input signal file is just a segment within a longer transcript. See also option STARTWORD. RELAXMINDURTHREE: [true, false] Alternative option to Relax Min Duration (RELAXMINDUR): changes the minimum duration for all models to 3 states (= 30msec with standard frame rate)to 30msec. This can be useful when comparing the duration of different phone groups. STARTWORD: [0.0, 999999.0] Option Start with word (STARTWORD): If set to a value n>0, this option causes maus to start the segmentation with the word number n (word numbering in BPF starts with 0). This is useful if the input signal file is just a segment within a longer transcript. See also option ENDWORD. INFORMAT: Deprecated option INFORMAT: Input format is now detected from input file extension. Defines the possible input formats: bpf - a BPF file with (minimum) tier KAN; bpf-sampa - BPF file with KAN tier with blank separated SAM-PA symbols, switches to language independent SAM-PA mode processing; for a description of BPF see http://www.bas.uni-muenchen.de/forschung/Bas/BasFormatseng.html INSYMBOL: [sampa, ipa] Option Input Encoding (INSYMBOL): Defines the encoding of phonetic symbols in input. If set to 'sampa' (default), phonetic symbols are encoded in X-SAMPA (with some coding differences in Norwegian/Icelandic); use service runMAUSGetInventar with option LANGUAGE=sampa to get a list of symbols and their mapping to IPA). If set to 'ipa', the service expects blank-separated UTF-8 IPA. PRESEG: [true, false] Option Pre-segmentation (PRESEG): If set to true, a pre-segmentation using the wav2trn tool is done by the webservice on-the-fly; this is useful, if the input signal has leading and/or trailing silence. If this option is set in combination with USETRN=true and the input BPF contains a chunk segmentation (tier TRN), then the presegmentation is carried out for every single chunk. USETRN: [true, false, force] Option Chunk segmentation (USETRN): If set to true, the service searches the input BPF for a TRN tier (turn/chunk segmentation, see http://www.bas.uni-muenchen.de/forschung/Bas/BasFormatsdeu.html#TRN). The synopsis for a TRN entry is: 'TRN: (start-sample) (duration-sample) (word-link-list) (label)', e.g. 'TRN: 23654 56432 0,1,2,3,4,5,6 sentence1' (the speech within the recording 'sentence1' starts with sample 23654, last for 56432 samples and covers the words 0-6). If only one TRN entry is found, the segmentation is restricted within a time range given by this TRN tier entry; this is useful, if there exists a reliable pre-segmentation of the recorded utterance, i.e. the start and end of speech within the recording is known. If more than one TRN entry is found, the webservice performs an segmentation for each 'chunk' defined by a TRN entry and aggregates all individual results into a single results file; this is useful if the input consists of long recordings, for which a manual chunk segmentation is available. If USETRN is set to 'force' (deprecacted since maus 4.11; use PRESEG=true instead!), a pre-segmentation using the wav2trn tool is done by the webservice on-the-fly; this is useful, if the input BPF does not contain a TRN entry and the input signal has leading and/or trailing silence. BPF: Phonemic transcription of the utterance to be segmented. Format is either a BAS Partitur Format (BPF, *.par) file with a KAN tier or a spreadsheet CSV file. The KAN tier contains a table with 3 columns and one line per word in the input. Column 1 is always 'KAN:'; column 2 is an integer starting with 0 denoting the word position (tokenization) within the input; column 3 contains the canonical pronunciation of the word coded in SAM-PA (or IPA). The *.csv file contains two columns separated by ';', one word in each line, the UTF-8 encoded orthography in the 1st, the canonical pronunciation in the 2nd colum (SAMPA or IPA). Note that the pronunciation string must contain phoneme-separating blanks for the language independent mode (LANGUAGE = 'sampa' or 'und'), e.g /h OY t @/x); for languages that are official SAMPA these are optional (e.g. /hOYt@/ is possible). See http://www.bas.uni-muenchen.de/forschung/Bas/BasFormatseng.html for detailed description of the BPF. MAUSSHIFT: Option Segment shift (MAUSSHIFT): If set to n, this option causes the calculated MAUS segment boundaries to be shifted by n msec (default: 0) into the future. Most likely this systematic shift is caused by a boundary bias in the training material's segmentation. The default should work for most cases. INSPROB: Option Phon insertion prob (INSPROB): The option INSPROB influences the probability of deletion of segments. It is a constant factor (a constant value added to the log likelihood score) after each segment. Therefore, a higher value of INSPROB will cause the probability of segmentations with more segments go up, thus decreasing the probability of deletions (and increasing the probability of insertions, which are rarely modelled in the rule sets). This parameter has been evaluated on parts of the German Verbmobil data set (27425 segments) which were segmented and labelled manually (MAUS DEV set) and found to have its optimum at 0.0 (which is nice). Therefore we set the default value of INSPROB to 0.0. INSPROB was also tested against the MAUS TEST set to confirm the value of 0.0. It had an optimum at 0.0 as well. Note that this might NOT be the optimal value for other MAUS tasks. INSORTTEXTGRID: [true, false] Option ORT tier in TextGrid (INSORTTEXTGRID): Switch to create an additional tier ORT in the TextGrid output file with a word segmentation labelled with the orthographic transcript (taken from the input ORT tier); this option is only effective, if the input BPF contains an additional ORT tier. OUTSYMBOL: [sampa, ipa, manner, place] Option Output Encoding (OUTSYMBOL): Defines the encoding of phonetic symbols in output. If set to 'sampa' (default), phonetic symbols in output are encoded in X-SAMPA (with some minor differences in languages Norwegian/Icelandic in which the retroflex consonants are encoded as 'rX' instead of X-SAMPA 'X_r'); use service runMAUSGetInventar with option LANGUAGE=sampa to get a list of symbols and their mapping to IPA. If set to 'ipa', the service produces UTF-8 IPA output. If set to 'manner', the service produces IPA manner of articulation for each segment; possible values are: silence, vowel, diphthong, plosive, nasal, fricative, affricate, approximant, lateral-approximant, ejective. If set to 'place', the service produces IPA place of articulation for each segment; possible values are: silence, labial, dental, alveolar, post-alveolar, palatal, velar, uvular, glottal, front, central, back. RULESET: MAUS rule set file; UTF-8 encoded; one rule per line; there are two different file types defined by the extension: 1. Phonological rule set without statistical information '*.nrul', synopsis is: 'leftContext-match-rightContext>leftContext-replacement-rightContext', e.g. 't,s-e:-n>t,s-@-n'. 2. Rule set with statistical information '*.rul', synopsis is: 'leftContext,match,rightContext>leftContext,replacement,rightContext ln(P(replacement|match) 0.0000', e.g. 'P9,n,@,n,#>P9,# -3.761200 0.000000'; 'P(replacement|match)' is the conditional probability that 'match' is being replaced by 'replacement'; the sum over all conditional probabilities with the same condition 'match' must be less than 1; the difference between the sum and 1 is the conditional probability 'P(match|match)', i.e. no for no change. 'leftContext/rightContext/match/replacememt' = comma separated lists of SAMPA symbols or empty lists (for *.rul the leftContext/rightContext must be exactly one symbol!); special SAMPA symbols in contexts are: '#' = word boundary between words, and '<' = utterance begin (may be used instead of a phonemic symbol); digits in SAMPA symbols must be preceded by 'P' (e.g. '2:' -> 'P2:'); all used SAMPA symbols must be defined in the language specific SAMPA set (see service runMAUSGetInventar). Examples for '*.rul' : 'P9,n,@,n,#>P9,# = 'the word final syllable /n@n/ is deleted, if preceded by /9/', '#,k,u:>#,g,u:' = 'word intial /k/ is replaced by /g/ if followed by the vowel /u:/'. Examples for '*.nrul' : '-->-N,k-' = 'insert /Nk/ at arbitrary positions', '#-?,E,s-#>#-s-#' = 'delete /?E/ in word /?Es/', 'aI-C-s,t,#>aI-k-s,t,#' = 'replace /C/ in word final syllable /aICst/ by /k/'. MINPAUSLEN: [1.0, 999.0] Option Inter-word silence (MINPAUSLEN): Controls the behaviour of optional inter-word silence. If set to 1, maus will detect all inter-word silence intervals that can be found (minimum length for a silence interval is then 10 msec = 1 frame). If set to values n>1, the minimum length for an inter-word silence interval to be detected is set to n*10 msec. For example MINPAUSLEN of 5 will cause MAUS to suppress inter-word silence intervals up to a length of 40msec. Since 40 msec seems to be the border of perceivable silence, we set this option default to 5. With other words: inter-word silences smaller than 50msec are not segmented but rather distributed equally to the adjacent segments. If one of the adjacent segments happens to be a plosive then the deleted silence interval is added totally to the plosive; if both adjacent segments are plosives, the interval is equally spread as with non-plosive adjacent segments. WEIGHT: The option Pron model weight (WEIGHT) weights the influence of the statistical pronunciation model against the acoustical scores. More precisely WEIGHT is multiplied to the pronunciation model score (log likelihood) before adding the score to the acoustical score within the search. Since the pronunciation model in most cases favors the canonical pronunciation, increasing WEIGHT will at some point cause MAUS to choose always the canonical pronunciation; lower values of WEIGHT will favor less probable paths be selected according to acoustic evidence. If the acoustic quality of the signal is very good and the HMMs of the language are well trained, it makes sense to lower WEIGHT. For most languages this option is default to 1.0. In an evaluation on parts of the German Verbmobil data set (27425 segments) which were segmented and labelled manually (MAUS DEV set) WEIGHT was optimized to 7.0. Note that this might NOT be the optimal value for other languages. For instance Italian shows best results with WEIGHT=1.0, Estonian with WEIGHT=2.5. If set to default, a language specific optimal value is chosen automatically. NOINITIALFINALSILENCE: [true, false] Option No silence model (NOINITIALFINALSILENCE): Switch to suppress the automatic modeling of an optional leading/trailing silence interval. This is useful, if for instance the signal is known to start with a stop and no leading silence, and the silence model would 'capture' the silence interval from the plosive. ADDSEGPROB: [true, false] Option Add Viterbi likelihoods (ADDSEGPROB) causes that the frame-normalized natural-log total Viterbi likelihood of an aligned segment is appended to the segment label in the output annotation (the MAU tier). This might be used as a 'quasi quality measure' on how good the acoustic signal in the aligned segment has been modelled by the combined acoustical and pronunciation model of MAUS. Note that the values are not probabilities but likelihood densities, and therefore are not comparable for different signal segments; they are, however, comparable for the same signal segment. Warning: this option breaks the BPF standard for the MAU tier and must not be used, if the resulting MAU tier should be further processed, e.g. in a pipe). Implemented only for output phoneme symbol set SAMPA (default). Output: A XML response containing the tags "success", "downloadLink", "output" and "warning. success states if the processing was successful or not, downloadLink specifies the location where the result file can be found (the format of the file depends on the option selected in OUTFORMAT), output contains the output that is mostly useful during debugging errors and warnings if any warnings occured during the processing. ---------------------------------------------------------------- ---------------------------------------------------------------- runMAUSBasic ------------------ Description: Segments a media file into phonetic and word segments given the orthographic transcription as input (text file). The result is stored in a three-layer annotation file (word segmentation with orthographic labels, word segmentation with canonical pronunciation labels in SAM-PA, phonemic segmentation with SAM-PA labels). This is a simple version of a G2P_MAUS pipeline service which applies only default options; see operation 'runMAUS' ('WebMAUS General' service) for the full MAUS service with all options or the operation 'runPipeline' ('Pipeline' service). Example curl call is: curl -v -X POST -H 'content-type: multipart/form-data' -F SIGNAL=@ -F LANGUAGE=deu-DE -F OUTFORMAT=TextGrid -F TEXT=@ 'https://clarin.phonetik.uni-muenchen.de/BASWebServices/services/runMAUSBasic' Parameters: SIGNAL [LANGUAGE] [OUTFORMAT] TEXT Parameter description: SIGNAL: file containing the speech signal to be segmented; PCM 16 bit resolution; mono; any sampling rate; optimal results if leading and trailing silence intervals are truncated before processing. Although the mimetype of this input file is restricted to audio/x-wav (wav|WAV), the service will also process *.nis|nist|sph (NIST SPHERE), *.al|dea (ALAW), *.mpeg|mpg (Video, several codecs) and *.mp4 (MPEG4). File format will be determined by extension only. LANGUAGE: [aus-AU, afr-ZA, sqi-AL, eus-ES, eus-FR, cat-ES, nld-BE, nld-NL, eng-AU, eng-US, eng-GB, eng-SC, eng-NZ, ekk-EE, fin-FI, fra-FR, kat-GE, deu-AT, deu-CH, deu-DE, gsw-CH, gsw-CH-BE, gsw-CH-BS, gsw-CH-GR, gsw-CH-SG, gsw-CH-ZH, hun-HU, isl-IS, ita-IT, jpn-JP, gup-AU, ltz-LU, mlt-MT, nor-NO, fas-IR, pol-PL, ron-RO, rus-RU, spa-ES, swe-SE, tha-TH, guf-AU] Language of the speech to be processed; we use the RFC5646 sub-structure 'iso639-3 - iso3166-1 [ - iso3166-2], e.g. 'eng-US' for American English, 'deu-AT-1' for Austrian German spoken in 'Oberoesterreich'; defines the possible orthographic text language in the input, the text-to-phoneme tranformation and some language specific transformations within the MAUS process. The code 'gsw-CH' (= Swiss German) denotes orthographic text input in Swiss German 'Dieth' encoding. INSKANTEXTGRID: Switch to create an additional tier in the TextGrid output file with a word segmentation labelled with the canonic phonemic transcript (taken from the input KAN tier). This option can not be set in this service. RELAXMINDUR: Option Relax Min Duration (RELAXMINDUR) changes the default minimum duration of 3 states for consonants and short/lax vowels and of 4 states for tense/long vowels and diphthongs to 1 and 2 states respectively. This is not optimal for general segmentation because MAUS will start to insert many very short vowels/glottal stops where they are not appropriate. But for some special investigations (e.g. the duration of /t/) it alleviates the ceiling problem at 30msec duration (at a standard frame rate of 10msec per state). This option can not be set in this service. OUTFORMAT: [par, exb, csv, TextGrid, emuDB, eaf, tei, bpf, mau] Option 'Output format' (OUTFORMAT): Defines the possible output formats: TextGrid - a praat compatible TextGrid file; bpf - a BPF file with tiers ORT (words), KAN (pronunciation) and MAU (phonetic segments); csv - a spreadsheet (CSV table) with word and phone segmentation; emuDB - an Emu compatible *_annot.json file; eaf - an ELAN compatible annotation file; exb - an EXMARaLDA compatible annotation file; tei - Iso TEI document (XML). For a description of BPF see http://www.bas.uni-muenchen.de/forschung/Bas/BasFormatseng.html. for a description of Emu see https://github.com/IPS-LMU/emuR. Note 1: using 'emuDB' will first produce only single annotation file *_annot.json; in the WebMAUS interface (https://clarin.phonetik.uni-muenchen.de/BASWebServices) you can process more than one file and than download a zipped Emu database; in this case don't forget to change the default name of the emuDB 'MAUSOUTPUT' using the R function emuR::rename_emuDB(). Note 2: if you need the same result in more than one format, select 'bpf' to produce a BPF file, and then convert this file with the service runAnnotConv ('AnnotConv') into the desired formats. Note 3: some format conversions are not loss-less; select 'bpf' to be sure that no information is lost. PRESEG: Option PRESEG: If set to true, a pre-segmentation using the wav2trn tool is done by the webservice on-the-fly; this is useful, if the input signal has leading and/or trailing silence. If this option is set in combination with USETRN=true and the input BPF contains a chunk segmentation (tier TRN), then the presegmentation is carried out for every single chunk. This option can not be set in this service. USETRN: If set to true, the service searches the input BPF for a TRN tier (turn/chunk segmentation, see http://www.bas.uni-muenchen.de/forschung/Bas/BasFormatsdeu.html#TRN). The synopsis for a TRN entry is: 'TRN: (start-sample) (duration-sample) (word-link-list) (label)', e.g. 'TRN: 23654 56432 0,1,2,3,4,5,6 sentence1' (the speech within the recording 'sentence1' starts with sample 23654, last for 56432 samples and covers the words 0-6). If only one TRN entry is found, the segmentation is restricted within a time range given by this TRN tier entry; this is useful, if there exists a reliable pre-segmentation of the recorded utterance, i.e. the start and end of speech within the recording is known. If more than one TRN entry is found, the webservice performs an segmentation for each 'chunk' defined by a TRN entry and aggregates all individual results into a single results file; this is useful if the input consists of long recordings, for which a manual chunk segmentation is available. If USETRN is set to 'force', a pre-segmentation using the wav2trn tool is done by the webservice on-the-fly; this is useful, if the input BPF does not contain a TRN entry and the input signal has leading and/or trailing silence. This option can not be set in this service. TARGETRATE: Option TARGETRATE: the resolution of segment boundaries in output measured in 100nsec units (default 100000 = 10msec). Decreasing this value (min is 10000) increases computation time, does not increase segmental accuracy in average, but allows output segment boundaries to assume more possible values (default segment boundaries are quantizised in 10msec steps). This is useful, if MAUS results are analysed for duration of phones or syllables. This option can not be set in this service. TEXT: orthographic text of the utterance to be segmented; words are white space separated; encoding is utf-8; punctuations are ignored INSORTTEXTGRID: Switch to create an additional tier ORT in the TextGrid output file with a word segmentation labelled with the orthographic transcript (taken from the input ORT tier); this option is only effective, if the input BPF contains an additional ORT tier. This option can not be set in this service. NOINITIALFINALSILENCE: NOINITIALFINALSILENCE: Switch to suppress the automatic modeling of an optional leading/trailing silence interval. This is useful, if for instance the signal is known to start with a stop and no leading silence, and the silence model would 'capture' the silence interval from the plosive. This option can not be set in this service. Output: A XML response containing the tags "success", "downloadLink", "output" and "warning. success states if the processing was successful or not, downloadLink specifies the location where the Praat TextGrid file can be found, output contains the output that is mostly useful during debugging errors and warnings if any warnings occured during the processing. The Praat TextGrid file containing three tiers: orthographic transcription (segmented in words), canonical phonemic transcription in SAM-PA (segmented in words), phonemic segmentation by MAUS in SAM-PA ---------------------------------------------------------------- ---------------------------------------------------------------- runAnnotConv ------------------ Description: This service is a general purpose annotation converter from BAS Partitur Format (BPF) to several standards. The services reads an annotation file of format INPFORMAT and converts it into the annotation format given in option OUTFORMAT. Most conversions require at least one annotation layer with timing information. Details about the BAS Partitur Format (BPF) can be found in https://www.bas.uni-muenchen.de/forschung/Bas/BasFormatseng.html#Partitur. Example curl call is: curl -v -X POST -H 'content-type: multipart/form-data' -F INPFORMAT=bpf -F outFormat=TextGrid -F INP=@ 'https://clarin.phonetik.uni-muenchen.de/BASWebServices/services/runAnnotConv' Parameters: [INPFORMAT] [outFormat] INP Parameter description: INPFORMAT: [bpf] Option INPFORMAT: the annotation format of the input file. outFormat: [exb, csv, TextGrid, emuDB, eaf, tei] Option outFormat: the annotation format of the output file. Note that some annotation formats may contain only a subset of the information that is contained in the input. For example, if the input BPF contained the tiers ORT,KAN,MAU,GES and outFormat is set to 'TextGrid' only the tiers ORT,KAN and MAU are transformed into the output TextGrid without warning. This might be important, if you use this converter within a pipeline that produces more than the basic time-alignment. INP: The input annotation file to be converted; the format must match the option INPFORMAT. Output: A XML response containing the elements "success", "downloadLink", "output" and "warning". "success" states if the processing was successful or not, "downloadLink" specifies the location where the output annotation file can be found, "output" contains the output that is mostly useful during debugging errors and "warning" lists warnings, if any occured during processing. ---------------------------------------------------------------- ---------------------------------------------------------------- runMAUSGetHelp ------------------ Description: Returns the help of the MAUS tool on the server which describes the available parameters in more detail. Example curl call is: curl -v -X GET -H 'content-type: application/x-www-form-urlencoded' 'https://clarin.phonetik.uni-muenchen.de/BASWebServices/services/runMAUSGetHelp' Parameter description: Output: Help message of the actual MAUS tool. ---------------------------------------------------------------- ---------------------------------------------------------------- runAudioEnhance ------------------ Description: This services reads a media file and performs several signal processing operations mostly based on the SoX ('Sound Exchange') and N-HANS projects. Without any options set, the service produces a RIFF WAVE audio file optimized for processing in the BAS WebServices. Details about the 'Sound Exchange' project (SoX) see https://www.openhub.net/p/sox. Details about the 'N-HANS' project see https://github.com/N-HANS/N-HANS. Depending on input and given options the service extracts sound track from video input, converts non-RIFF sound formats into RIFF, merge/re-arrange multi-channel files, re-sample to given sampling rate, (spectral) filters signal for constant background noise, applies high-, low-, band pass- and band reject filter, applies speech rate manipulation without changing the length (speed), manipulate pitch without changing the speech rate, removes complex noise while preserving a target noise/voice (N-HANS), separates a target speaker from an interference speaker/speaker group (N-HANS). In the current version the input audio format is not retained; the output audio format is always RIFF WAVE. Example curl call is: curl -v -X POST -H 'content-type: multipart/form-data' -F SIGNAL=@ -F NHANS=none -F MONO=true -F PITCH=0 -F NOISE=0 -F NOISEPROFILE=0 -F LOWF=0 -F neg=@ -F RESAMPLE=0 -F pos=@ -F CHANNELSELECT= -F NORM=true -F TEMPO=1.0 -F HIGHF=0 'https://clarin.phonetik.uni-muenchen.de/BASWebServices/services/runAudioEnhance' Parameters: SIGNAL [NHANS] [MONO] [PITCH] [NOISE] [NOISEPROFILE] [LOWF] [neg] [RESAMPLE] [pos] [CHANNELSELECT] [NORM] [TEMPO] [HIGHF] Parameter description: SIGNAL: The input media file to be processed; the format is recognized by the file's extension; supported formats are: wav,nis,sph,mp3,mpeg,mp4,avi,flv (although the mimetype of this input file is restricted to RIFF AUDIO audio/x-wav). (extension wav), most pipes will also process NIST/SPHERE (nis|sph) and several kinds of video formats.) NHANS: [none, denoiser, separator] Option NHANS: the N-HANS audio enhancement mode (default: 'none') applied to the result of the SoX pipeline. 'denoiser' : the noise as represented in the sample recording uploaded in the mandatory option file 'neg' is removed from the signal; if another voice or noise sample is uploaded in option file 'pos' (optional), this noise/voice is being preserved in the signal together with the main voice. 'separator' : an interference speaker or speaker group as represented in the sample recording uploaded in the mandatory option file 'neg' is removed from the signal while the voice of a target speaker as uploaded in the mandatory option file 'pos' is being preserved in the signal. Both sample signals, 'neg' and 'pos', are applied to all processed input signals; do not upload more than 2sec of clean signal, and make sure that the relevant signal is present within the very first second; 'clean signal' means that the sample should not contain any traces of the main voice or of the other noise sample. MONO: [true, false] Option MONO: if true (selected) input channels are merged. Note that most operations (e.g. filtering, normalization) are performed on the individual channel before the merge. Note that most operations (e.g. filtering, normalization) are performed on the individual channel before the merge. PITCH: [-1000.0, 1000.0] Option PITCH: pitch shift in 100th of a semi-tone without changing speech rate. E.g. PITCH = -100 shift fundamental frequency by one semi-tone down. NOISE: [0.0, 100.0] Option NOISE: if set to a value between 1...100, a noise profile is calculated from the leading and/or trailing parts of the input signal, and then the signal is noise reduced with a strength proportional to the NOISE value (using SoX spectral noise reduction effect 'noisered'). The noise reduction is applied before any other processing/merging in all input channels. If NOISE=0, no noise reduction takes place. NOISEPROFILE: [-1000000.0, 1000000.0] Option NOISEPROFILE: if set to 0 (default), the noise profile is calculated from the leading and trailing portion of the recording (estimated by a silence detector); if set to a positive value, the noise profile is calculated from the leading NOISEPROFILE samples; if set to a negative Value, the noise profile is calculated from the trailing NOISEPROFILE samples. This is useful, if the recording contains loud noise at the begin/end of the recording that woud not be selected by the silence detector (because of too much energy). LOWF: [0.0, 30000.0] Option LOWF: lower filter edge in Hz. If set >0Hz and HIGHF is 0Hz, a high pass filter with LOWF Hz is applied; if set >0Hz and HIGHF is set higher than LOWF, a band pass between LOWF and HIGHF is applied; if set >0Hz and HIGHF is set higher than 0Hz but lower than LOWF, a reject band pass between HIGHF and LOWF is applied. E.g. HIGHF = 3000 LOWF = 300 is telephone band; HIGHF = 45 LOWF = 55 filters out a 50Hz hum. neg: Option neg : N-HANS sample recording (RIFF WAVE *.wav) of the noise to be removed from signal (mode 'denoiser') or the speaker/speaker group to be removed from signal (mode 'separator'). The 'neg' sample is applied to all processed input signals; do not upload more than 2sec of clean signal, and make sure that the relevant signal is present within the very first second; 'clean signal' means that the sample should not contain any traces of the main voice or of the 'pos' noise sample. The upload of the 'neg' sample is mandatory for both N-HANS modi (see option 'NHANS'). RESAMPLE: [0.0, 96000.0] Option RESAMPLE: re-sample signal to this value in Hz; RESAMPLE=0 : no re-sampling. pos: Option pos : N-HANS sample recording (RIFF WAVE *.wav) of the noise to be preserved in the signal (mode 'denoiser') or the target speaker to be preserved in the signal (mode 'separator'). The 'pos' sample is applied to all processed input signals; do not upload more than 2sec of clean signal, and make sure that the relevant signal is present within the very first second; 'clean signal' means that the sample should not contain any traces of the main voice (mode 'denoiser') nor of the 'pos' noise sample (modes 'denoiser' and 'separator'). The upload of the 'pos' sample is mandatory for N-HANS mode 'separator' and optional for mode 'denoiser' (see option 'NHANS'). CHANNELSELECT: Option CHANNELSELECT: list of comma-separated channel numbers that are selected for further processing from the input media file. Examples: MONO=true,CHANNELSELECT="" : merge multi-channel files into one channel, MONO=true,CHANNELSELECT="2,3,4" : merge only selected channels into one channel, MONO=false, CHANNELSELECT="3,4,1,2" : select and re-arrange channels, MONO=false, CHANNELSELECT="" : do nothing. Note that channels are numbered starting with 1 = left channel in stereo, 2 = right channel, ... By reversing the order of channel numbers in CHANNELSELECT you can swap channels, e.g. CHANNELSELECT="2,1" MONO=false will swap left and right channel of a stereo signal. NORM: [true, false] Option NORM: if true (selected) each input channel is amplitude normalised to -3dB before any merge. TEMPO: [0.25, 4.0] Option TEMPO: factor of speech rate change; >1 speeds up, <1 slows down. E.g. TEMPO = 1.5 increase speech rate by 50% (signal gets shorter). HIGHF: [0.0, 30000.0] Option HIGHF: upper filter edge in Hz. If set >0Hz and LOWF is 0Hz, a low pass filter with HIGHF Hz is applied; if set >0Hz and LOWF is set lower than HIGHF, a band pass between LOWF and HIGHF is applied; if set >0Hz and LOWF is set higher than 0Hz but higher than HIGHF, a reject band pass between HIGHF and LOWF is applied. E.g. HIGHF = 3000 LOWF = 300 is telephone band; HIGHF = 45 LOWF = 55 filters out a 50Hz hum. Output: A XML response containing the elements "success", "downloadLink", "output" and "warning". "success" states if the processing was successful or not, "downloadLink" specifies the location where the output annotation file can be found, "output" contains the output that is mostly useful during debugging errors and "warning" lists warnings, if any occured during processing. ---------------------------------------------------------------- ---------------------------------------------------------------- runGetVersion ------------------ Description: Returns the version number of the different underlying tools. If no option is specified it returns the number of the services. Example curl call is: curl -v -X GET -H 'content-type: application/x-www-form-urlencoded' 'https://clarin.phonetik.uni-muenchen.de/BASWebServices/services/runGetVersion?service=services' Parameters: [service] Parameter description: service: [runAnonymizer, runAnnotConv, runASR, runAudioEnhance, runChannelSeparator, runChunker, runChunkPreparation, runCOALA, runEMUMagic, runFormantAnalysis, runG2P, runMAUS, runMINNI, runPho2Syl, runPipeline, runPipelineWithASR, runSpeakDiar, runSubtitle, runTextEnhance, runVoiceActivityDetection, services] Name of the service to get the version. Output: Version number of the tool requested. ---------------------------------------------------------------- ---------------------------------------------------------------- runPipeline ------------------ Description: This is a service that combines two or more BAS webservices into a processing chain (pipeline) without Automatic Speech Recognition (ASR). To run a pipeline with ASR use the service 'Pipeline with ASR' (runPipelineWithASR). Since not every BAS webservice can be combined with another, the service only offers pipelines that make sense for the user. All pipelines executed by this service can also be executed by calling two or more BAS webservices after another and passing the output of one service to the next. The benefit, however, is that the user data (which can be substantially large) will be up- and down-loaded only once, and of course that the user does not have to formulate several BAS webservice calls (with matching parameters). The parameter PIPE defines which processing pipeline will be executed; depending on the value of PIPE the service accepts parameters for the BAS webservices which are involved in the pipeline, and which make sense in the context of the pipeline. Other parameters will be set automatically depending on the value of PIPE (e.g. the MAUS parameter USETRN will be set to 'true' in the case of a pipeline where the runChunkPreparation service passes a BPF file to the runMAUS service containing a chunk segmentation in the TRN tier). Since this service basically comprise of all BAS web services, the number of possible parameters is necessarily huge. To make the selection easier we group the parameters into MANDATORY (that have to be set for every pipeline), optional parameters that are shared by more than one service, and then by PIPELINE ELEMENT (e.g. ASR, MAUS, in alphabetical order). In most cases it is sufficient to set the MANDATORY parameters, and the Pipeline service will then set the element specific parameters automatically. The service will perform a pre-check on all set parameters to detect conflicts and then terminate with an informative message; but there are still many cases where the pipeline will start working and then terminate with an error caused by a service later down the pipe. Starting with version 6.0 the service will deliver a ZIP archive instead of the output of the last service in PIPE, if the option 'KEEP' ('Keep everything') is enabled; this ZIP will contain input(s), all intermediary results, end result and a protocol of the pipeline process. Example curl call is: curl -v -X POST -H 'content-type: multipart/form-data' -F com=yes -F INSKANTEXTGRID=true -F USETEXTENHANCE=true -F TARGETRATE=100000 -F TEXT=@ -F NOISE=0 -F PIPE= -F aligner=hirschberg -F NOISEPROFILE=0 -F neg=@ -F speakMatch= -F speakNumber=0 -F ASIGNAL=brownNoise -F NORM=true -F mauschunking=false -F minSpeakNumber=0 -F INSORTTEXTGRID=true -F WEIGHT=default -F minanchorlength=3 -F LANGUAGE=deu-DE -F NHANS=none -F USEAUDIOENHANCE=true -F maxlength=0 -F KEEP=false -F LEFT_BRACKET=# -F nrm=no -F LOWF=0 -F WHITESPACE_REPLACEMENT=_ -F CHANNELSELECT= -F marker=punct -F USEREMAIL= -F boost=true -F except=@ -F MINPAUSLEN=5 -F forcechunking=false -F NOINITIALFINALSILENCE=false -F InputTierName=unknown -F BRACKETS=<> -F OUTFORMAT=TextGrid -F syl=no -F ENDWORD=999999 -F wsync=yes -F UTTERANCELEVEL=false -F featset=standard -F pos=@ -F APHONE= -F INSPROB=0.0 -F OUTSYMBOL=x-sampa -F RULESET=@ -F maxSpeakNumber=0 -F allowOverlaps=false -F minchunkduration=15 -F SIGNAL=@ -F stress=no -F imap=@ -F MODUS=default -F RELAXMINDUR=false -F ATERMS=@ -F RELAXMINDURTHREE=false -F STARTWORD=0 -F INSYMBOL=sampa -F PRESEG=false -F AWORD=ANONYMIZED -F USETRN=false -F MAUSSHIFT=default -F HIGHF=0 -F silenceonly=0 -F boost_minanchorlength=4 -F ADDSEGPROB=false 'https://clarin.phonetik.uni-muenchen.de/BASWebServices/services/runPipeline' Parameters: [com] [INSKANTEXTGRID] [USETEXTENHANCE] [TARGETRATE] TEXT [NOISE] [PIPE] [aligner] [NOISEPROFILE] [neg] [speakMatch] [speakNumber] [ASIGNAL] [NORM] [mauschunking] [minSpeakNumber] [INSORTTEXTGRID] [WEIGHT] [minanchorlength] [LANGUAGE] [NHANS] [USEAUDIOENHANCE] [maxlength] [KEEP] [LEFT_BRACKET] [nrm] [LOWF] [WHITESPACE_REPLACEMENT] [CHANNELSELECT] [marker] [USEREMAIL] [boost] [except] [MINPAUSLEN] [forcechunking] [NOINITIALFINALSILENCE] [InputTierName] [BRACKETS] [OUTFORMAT] [syl] [ENDWORD] [wsync] [UTTERANCELEVEL] [featset] [pos] [APHONE] [INSPROB] [OUTSYMBOL] [RULESET] [maxSpeakNumber] [allowOverlaps] [minchunkduration] SIGNAL [stress] [imap] [MODUS] [RELAXMINDUR] [ATERMS] [RELAXMINDURTHREE] [STARTWORD] [INSYMBOL] [PRESEG] [AWORD] [USETRN] [MAUSSHIFT] [HIGHF] [silenceonly] [boost_minanchorlength] [ADDSEGPROB] Parameter description: com: [yes, no] Option com (Keep Annotation): yes/no decision whether <*> strings in text inputs should be treated as annotation markers (yes) or as spoken words (no). If set to 'yes', then strings of this type are considered as annotation markers that are not processed as spoken words but passed on to the output. The <*> markers will appear in the ORT and KAN tier with a word index on their own. WebMAUS makes use of two special markers < usb > (e.g. non-understandable word or other human noises) and < nib > (non-human noise). All other markers <*> are modelled as silence. Markers must be separated from word tokens by blanks; they do not need to be blank-separated from non-word tokens as punctuation. Note that the default service 'TEXTENHANCE' that is called by any pipeline that reads input text will replace white space characters (such as blanks) within the <*> by the character given in option 'White space replacement'. INSKANTEXTGRID: [true, false] OPTION INSKANTEXTGRID: Switch to create an additional tier in the TextGrid output file with a word segmentation labelled with the canonic phonemic transcript (taken from the input KAN tier). USETEXTENHANCE: [true, false] Switch on the input text pre-processing 'textEnhance' (true). If the PIPE starts with G2P, the input text is first normalized by 'textEnhance'. Different TXT formats are mapped to simple UTF-8 Unix style TXT format, and textmarkers are normalized to be conform with BAS WebServices. TARGETRATE: [100000, 20000, 10000] Option TARGETRATE: the resolution of segment boundaries in output measured in 100nsec units (default 100000 = 10msec). Decreasing this value (min is 10000) increases computation time, does not increase segmental accuracy in average, but allows output segment boundaries to assume more possible values (default segment boundaries are quantizised in 10msec steps). This is useful, if MAUS results are analysed for duration of phones or syllables. TEXT: Mandatory parameter TEXT: The textual input to the pipeline, usually some form of text or transcript. Depending on parameter PIPE this can be a text document (all formats supported by service runTextEnhance), a comma separated spreadsheet (csv), a praat TextGrid (TextGrid), an ELAN EAF (eaf), or a BAS Partitur Format (par) file. See http://www.bas.uni-muenchen.de/forschung/Bas/BasFormatseng.html for detailed description of the BPF. Note that PIPEs starting with service ASR or MINNI do not require this parameter. Special languages for text input: Thai, Russian and Georgian expect their respective standard alphabets; Japanese allows Kanji or Katakana or a mixture of both, but the tokenized output will contain only the Katakana version of the input; Swiss German expects input to be transcribed in 'Dieth' (https://en.wikipedia.org/wiki/Swiss_German); Australian Aboriginal languages (including Kunwinjku, Yolnu Matha) expect so called 'Practical Orthography' (https://en.wikipedia.org/wiki/Transcription_of_Australian_Aboriginal_languages); Persian accepts a romanized version of Farsi developped by Elisa Pellegrino and Hama Asadi (see http://www.bas.uni-muenchen.de/Bas/BASWebServices/DOCS/PersianRomanizationTable.pdf) for details). NOISE: [0.0, 100.0] Option NOISE: if set to a value between 1...100, a noise profile is calculated from the leading and/or trailing parts of the input signal, and then the signal is noise reduced with a strength proportional to the NOISE value (using SoX spectral noise reduction effect 'noisered'). The noise reduction is applied before any other processing/merging in all input channels. If NOISE=0, no noise reduction takes place. PIPE: [G2P_CHUNKER, CHUNKER_MAUS, CHUNKER_MAUS_SD, CHUNKER_MAUS_PHO2SYL, CHUNKER_MAUS_PHO2SYL_SD, CHUNKER_MAUS_SUBTITLE, CHUNKER_MAUS_SUBTITLE_SD, CHUNKER_MAUS_SUBTITLE_PHO2SYL, CHUNKER_MAUS_SUBTITLE_PHO2SYL_SD, CHUNKPREP_G2P_MAUS, CHUNKPREP_G2P_MAUS_SD, CHUNKPREP_G2P_MAUS_PHO2SYL, CHUNKPREP_G2P_MAUS_PHO2SYL_SD, CHUNKPREP_G2P_MAUS_SUBTITLE, CHUNKPREP_G2P_MAUS_SUBTITLE_SD, CHUNKPREP_G2P_MAUS_SUBTITLE_PHO2SYL, CHUNKPREP_G2P_MAUS_SUBTITLE_PHO2SYL_SD, G2P_CHUNKER_MAUS, G2P_CHUNKER_MAUS_SD, G2P_CHUNKER_MAUS_PHO2SYL, G2P_CHUNKER_MAUS_PHO2SYL_SD, G2P_CHUNKER_MAUS_SUBTITLE, G2P_CHUNKER_MAUS_SUBTITLE_SD, G2P_CHUNKER_MAUS_SUBTITLE_PHO2SYL, G2P_CHUNKER_MAUS_SUBTITLE_PHO2SYL_SD, G2P_MAUS, G2P_MAUS_SD, G2P_MAUS_PHO2SYL, G2P_MAUS_PHO2SYL_SD, G2P_MAUS_SUBTITLE, G2P_MAUS_SUBTITLE_SD, G2P_MAUS_SUBTITLE_PHO2SYL, G2P_MAUS_SUBTITLE_PHO2SYL_SD, MAUS_PHO2SYL, MAUS_PHO2SYL_SD, MAUS_SUBTITLE, MAUS_SUBTITLE_SD, MAUS_SUBTITLE_PHO2SYL, MAUS_SUBTITLE_PHO2SYL_SD, CHUNKER_MAUS_ANONYMIZER, CHUNKER_MAUS_ANONYMIZER_SD, CHUNKER_MAUS_PHO2SYL_ANONYMIZER, CHUNKER_MAUS_PHO2SYL_ANONYMIZER_SD, CHUNKER_MAUS_ANONYMIZER_SUBTITLE, CHUNKER_MAUS_ANONYMIZER_SUBTITLE_SD, CHUNKER_MAUS_SUBTITLE_PHO2SYL_ANONYMIZER, CHUNKER_MAUS_SUBTITLE_PHO2SYL_ANONYMIZER_SD, CHUNKPREP_G2P_MAUS_ANONYMIZER, CHUNKPREP_G2P_MAUS_ANONYMIZER_SD, CHUNKPREP_G2P_MAUS_PHO2SYL_ANONYMIZER, CHUNKPREP_G2P_MAUS_PHO2SYL_ANONYMIZER_SD, CHUNKPREP_G2P_MAUS_ANONYMIZER_SUBTITLE, CHUNKPREP_G2P_MAUS_ANONYMIZER_SUBTITLE_SD, CHUNKPREP_G2P_MAUS_SUBTITLE_PHO2SYL_ANONYMIZER, CHUNKPREP_G2P_MAUS_SUBTITLE_PHO2SYL_ANONYMIZER_SD, G2P_CHUNKER_MAUS_ANONYMIZER, G2P_CHUNKER_MAUS_ANONYMIZER_SD, G2P_CHUNKER_MAUS_PHO2SYL_ANONYMIZER, G2P_CHUNKER_MAUS_PHO2SYL_ANONYMIZER_SD, G2P_CHUNKER_MAUS_ANONYMIZER_SUBTITLE, G2P_CHUNKER_MAUS_ANONYMIZER_SUBTITLE_SD, G2P_CHUNKER_MAUS_SUBTITLE_PHO2SYL_ANONYMIZER, G2P_CHUNKER_MAUS_SUBTITLE_PHO2SYL_ANONYMIZER_SD, G2P_MAUS_ANONYMIZER, G2P_MAUS_ANONYMIZER_SD, G2P_MAUS_PHO2SYL_ANONYMIZER, G2P_MAUS_PHO2SYL_ANONYMIZER_SD, G2P_MAUS_ANONYMIZER_SUBTITLE, G2P_MAUS_ANONYMIZER_SUBTITLE_SD, G2P_MAUS_SUBTITLE_PHO2SYL_ANONYMIZER, G2P_MAUS_SUBTITLE_PHO2SYL_ANONYMIZER_SD, MAUS_ANONYMIZER, MAUS_ANONYMIZER_SD, MAUS_PHO2SYL_ANONYMIZER, MAUS_PHO2SYL_ANONYMIZER_SD, MAUS_ANONYMIZER_SUBTITLE, MAUS_ANONYMIZER_SUBTITLE_SD, MAUS_SUBTITLE_PHO2SYL_ANONYMIZER, MAUS_SUBTITLE_PHO2SYL_ANONYMIZER_SD] Parameter PIPE: The type of pipeline to process. Values of parameter PIPE have the general form SERVICE_SERVICE[_SERVICE ...], where SERVICE is one of G2P, MAUS, CHUNKER, CHUNKPREP, PHO2SYL, SUBTITLE, ANONYMIZER, SD (for pipelines executing ASR or MINNI service see service 'Pipeline with ASR' (runPipelineWithASR)). For example PIPE=G2P_CHUNKER_MAUS_PHO2SYL denotes a pipe that runs over these 4 services. For all pipelines in this service both, SIGNAL and TEXT inputs are necessary; the last SERVICE in PIPE determines which output the pipeline can produce. Therefore it is quite possible to call a pipe with impossible input/output configuration which will cause an ERROR. Every media file uploaded will first be passed through the service 'AudioEnhance' to normalized the media file to a RIFF WAVE format file; every text input is first run through the service 'TextEnhance' to normalize the text format; for both these obligatory services exist options as for the other pipeline SERVICES. Special pipe '..._SD' : the final speaker diarization module (SD) does not actual read any annotations from the previous services; it rather runs the speaker diarization in parallel on the signal input and then merges the speaker segmentation and laelling with whatever the rest of the pipe has produced, e.g. it merges speaker segments and word segments to produce a (symbolic) speaker labelling of the word segments. aligner: [hirschberg, fast] Symbolic aligner to be used. The "fast" aligner performs approximate alignment by splitting the alignment matrix into "windows" of size 5000*5000. The "hirschberg" aligner performs optimal matching. On recordings below the 1 hour mark, the choice of aligner does not make a big difference in runtime. On longer recordings, you can improve runtime by selecting the "fast" aligner. Note however that this choice increases the probability of errors on recordings with untranscribed stretches (such as long pauses, musical interludes, untranscribed speech). Therefore, the "hirschberg" aligner should be used on this kind of material. NOISEPROFILE: [-1000000.0, 1000000.0] Option NOISEPROFILE: if set to 0 (default), the noise profile is calculated from the leading and trailing portion of the recording (estimated by a silence detector); if set to a positive value, the noise profile is calculated from the leading NOISEPROFILE samples; if set to a negative value, the noise profile is calculated from the trailing NOISEPROFILE samples. This is useful, if the recording contains loud noise at the begin/end of the recording that would not be selected by the silence detector (because of too much energy). neg: Option neg : N-HANS sample recording (RIFF WAVE *.wav) of the noise to be removed from signal (mode 'denoiser') or the speaker/speaker group to be removed from signal (mode 'separator'). The 'neg' sample is applied to all processed input signals; do not upload more than 2sec of clean signal, and make sure that the relevant signal is present within the very first second; 'clean signal' means that the sample should not contain any traces of the main voice or of the 'pos' noise sample. The upload of the 'neg' sample is mandatory for both N-HANS modi (see option 'NHANS'). speakMatch: Option speakMatch: if set to a list of comma separated names (e.g. speakMatch='Anton,Berta,Charlie', the corresponding speaker labels found by the speaker diarization in the order of appearance are replaced by these names (e.g. 'S1' to 'Anton', 'S2' to 'Berta' etc.). This allows the user to create SD annotation using her self defined speaker labels, if the user knows the order of appearance; it is obvious that this feature only makes sense in single file processing, since the speaker labels and the order of appearance differ from one recording to the next; the suggested mode of operation is to run the service in batch mode over all recordings with speakMatch="", then inspect manually the resulting annotation and define speaker labels in the order of appearance for each recording, and then run the service in single file mode for each recording again with the corresponding speakMatch list. If the speakMatch option contains a comma separated list of value pairs like 'S1:Anton', only the speaker labels listed on the lefthand side of each pair are patched, e.g. for speakMatch='S3:Charlie,S6:Florian' only the third and sixth appearing speaker are renamed to Charlie and Florian respectively. speakNumber: [0.0, 999999.0] Option speakNumber restricts the number of detected speakers by the speaker diarization to the given number. If set to 0 (default), the SD method determines the number automatically. ASIGNAL: [brownNoise, beep, silence] Option ASIGNAL: the type of signal to mask anonymized terms in the signal. 'brownNoise' is brown noise; 'beep' is a 500Hz sinus; 'silence' is total silence (zero signal); masking signals have an amplitude of -10dB of the maximum amplitude and are faded in and out with a very short sinoid function. NORM: [true, false] Option NORM: if true (selected) each input channel is amplitude normalised to -3dB before any merge. mauschunking: [true, false] If this parameter is set to true, the recognition module will model words as MAUS graphs as opposed to canonical chains of phonemes. This will slow down the recognition engine, but it may help with non-canonical speech (e.g., accents or dialects). minSpeakNumber: [0.0, 999999.0] Option minSpeakNumber defines a hard lower bound of the number of detected speakers. If set to 0 (default), no lower bound. INSORTTEXTGRID: [true, false] Option INSORTTEXTGRID: Switch to create an additional tier ORT in the TextGrid output file with a word segmentation labelled with the orthographic transcript (taken from the input ORT tier); this option is only effective, if the input BPF contains an additional ORT tier. WEIGHT: The option WEIGHT weights the influence of the statistical pronunciation model against the acoustical scores. More precisely WEIGHT is multiplied to the pronunciation model score (log likelihood) before adding the score to the acoustical score within the search. Since the pronunciation model in most cases favors the canonical pronunciation, increasing WEIGHT will at some point cause MAUS to choose always the canonical pronunciation; lower values of WEIGHT will favor less probable paths be selected according to acoustic evidence. If the acoustic quality of the signal is very good and the HMMs of the language are well trained, it makes sense to lower WEIGHT. For most languages this option is default to 1.0. In an evaluation on parts of the German Verbmobil data set (27425 segments) which were segmented and labelled manually (MAUS DEV set) WEIGHT was optimized to 7.0. Note that this might NOT be the optimal value for other languages. For instance Italian shows best results with WEIGHT=1.0, Estonian with WEIGHT=2.5. If set to default, a language specific optimal value is chosen automatically. minanchorlength: [2.0, 8.0] The chunker performs speech recognition and symbolic alignment to find regions of correctly aligned words (so-called 'anchors'). Setting this parameter to a high value (e.g. 4-5) means that the chunker finds chunk boundaries with higher certainty. However, the total number of discovered chunk boundaries may be reduced as a consequence. A low value (e.g. 2) is likely to lead to a more fine-grained chunking result, but with lower confidence for individual chunk boundaries. LANGUAGE: [cat, deu, eng, fin, hat, hun, ita, mlt, nld, nze, pol, aus-AU, afr-ZA, sqi-AL, arb, eus-ES, eus-FR, cat-ES, nld-NL-GN, nld-NL, nld-NL-OH, nld-NL-PR, eng-US, eng-AU, eng-GB, eng-GB-OH, eng-GB-OHFAST, eng-GB-LE, eng-SC, eng-NZ, ekk-EE, kat-GE, fin-FI, fra-FR, deu-DE, gsw-CH-BE, gsw-CH-BS, gsw-CH-GR, gsw-CH-SG, gsw-CH-ZH, gsw-CH, hat-HT, hun-HU, isl-IS, ita-IT, jpn-JP, gup-AU, sampa, ltz-LU, mlt-MT, nor-NO, fas-IR, pol-PL, ron-RO, rus-RU, slk-SK, spa-ES, swe-SE, tha-TH, guf-AU] Language: RCFC5646 locale code of the processed speech; defines the phoneme set of input and the orthographic system of input text (if any); we use the RFC5646 sub-structure 'iso639-3 - iso3166-1 [ - iso3166-2], e.g. 'eng-US' for American English, 'deu-AT-1' for Austrian German spoken in 'Oberoesterreich'; the code 'sampa' ('Language independent') allows the user to upload a customized mapping from orthographic to phonologic form (see option 'imap'). Special languages: 'gsw-CH' denotes text written in Swiss German 'Dieth' transcription (https://en.wikipedia.org/wiki/Swiss_German); 'gsw-CH-*' are localized varieties in larger Swiss cities; 'jpn-JA' (Japanese) accepts Kanji or Katakana or a mixture of both, but the tokenized output will contain only the Katakana version of the input; 'aus-AU' (Australian Aboriginal languages, including Kunwinjku, Yolnu Matha) accept so called 'Modern Practical Orthography' (https://en.wikipedia.org/wiki/Transcription_of_Australian_Aboriginal_languages); 'fas-IR' (Persian) accepts a romanized version of Farsi developped by Elisa Pellegrino and Hama Asadi (see http://www.bas.uni-muenchen.de/Bas/BASWebServices/DOCS/PersianRomanizationTable.pdf for details); 'arb' is a macro language covering all Arabic varieties; the input must be encoded in a broad phonetic romanization developped by Jalal Tamimi and colleagues (see http://www.bas.uni-muenchen.de/Bas/BASWebServices/DOCS/TamimiRomanization.pdf for details). The language code is passed to all services of the pipeline, thus influencing the way these services will process the speech. If one member of the PIPE does not support the language, the service will try to determine another suitable language (WARNING is issued) or, if that is not possible, an ERROR is returned. Note that some services will support more languages than offered in the pipeline service, but we restrict the pipeline languages to a reasonable core set that is supported by most services. NHANS: [none, denoiser, separator] Option NHANS: the N-HANS audio enhancement mode (default: 'none') applied to the result of the SoX pipeline. 'denoiser' : the noise as represented in the sample recording uploaded in the mandatory option file 'neg' is removed from the signal; if another voice or noise sample is uploaded in option file 'pos' (optional), this noise/voice is being preserved in the signal together with the main voice. 'separator' : an interference speaker or speaker group as represented in the sample recording uploaded in the mandatory option file 'neg' is removed from the signal while the voice of a target speaker as uploaded in the mandatory option file 'pos' is being preserved in the signal. Both sample signals, 'neg' and 'pos', are applied to all processed input signals; do not upload more than 2sec of clean signal, and make sure that the relevant signal is present within the very first second; 'clean signal' means that the sample should not contain any traces of the main voice or of the other noise sample. USEAUDIOENHANCE: [true, false] Switch on the signal normalization 'AudioEnhance' (true). maxlength: [0.0, 999.0] Maximum subtitle length. If set to 0, subtitles of indefinite length are created, based only on the distance of the split markers. If set to a value greater than 0, subtitles are split whenever a stretch between two neighbouring split markers is longer than that value (in words). Caution: This may lead to subtitle splits in suboptimal locations (e.g. inside syntactic phrases). KEEP: [true, false] Keep everything (KEEP): If set to true (default: false), the service will return a ZIP archive instead of the output of the last service in PIPE. The ZIP is named as the output file name (as defined in OUT) with extension zip and contains the following files: input(s) including optional files (e.g. RULESET), all intermediary results of the PIPE, the result of the pipeline, and a protocol listing all options; all stored files in the ZIP start with the file name body of the SIGNAL input followed by the marker '_LABEL', which indicates from which part of the pipe the file is produced, and the appropriate file type extension; 'LABEL' is one of INPUT, AUDIOENHANCE (which marks the pre-processed media file), TEXTENHANCE (which marks the pre-processed text input file if applicable), ASR, CHUNKER, CHUNKPREP, G2P, MAUS, PHO2SYL, ANONYMIZER, SUBTITLE and README (which marks the protocol file). The protocol file contains a simple list of 'option = value' pairs. The result file(s) of the pipeline have no '_LABEL' marker. The KEEP option is useful for documenting scientific pipeline runs, and for retrieving results that are produced by the PIPE but are overwritten/not passed on by later services (e.g. an anonymized video or CHUNKER output). LEFT_BRACKET: One or more characters which mark comments reaching until the end of the line (default: #). E.g. if your input text contains comment lines that begin with ';', set this option to ';' to avoid that these comments are treated as spoken text. If you want to suppress the default '#' comment character, set this option to 'NONE'. If you are using comment lines in your input text, you must be absolutely sure that the comment character appears nowhere in the text except in comment lines! Note 1: the characters '&', '|' and '=' do not work as comment characters. Note 2: for technical reasons the value for this option cannot be empty. Note 3: the default character '#' cannot be combined with other characters, e.g. if you define this option as ';#', the '#' will be ignored. Note 4 (sorry): for the service 'Subtitle' comment lines must be terminated with a so called 'final punctuation sign', i.e. one of '.!?:…'; otherwise, an immediately following speaker marker will not be recognized. nrm: [yes, no] Text normalization. Currently available for German and English only. Detects and expands 22 non-standard word types. All output file types supported but not available for the following tokenized input types: bpf, TextGrid, and tcf. If switched off, only number expansion is carried out. LOWF: [0.0, 30000.0] Option LOWF: lower filter edge in Hz. If set >0Hz and HIGHF is 0Hz, a high pass filter with LOWF Hz is applied; if set >0Hz and HIGHF is set higher than LOWF, a band pass between LOWF and HIGHF is applied; if set >0Hz and HIGHF is set higher than 0Hz but lower than LOWF, a reject band pass between HIGHF and LOWF is applied. E.g. HIGHF = 3000 LOWF = 300 is telephone band; HIGHF = 45 LOWF = 55 filters out a 50Hz hum. WHITESPACE_REPLACEMENT: The character that whitespace in comments should be substituted by (default: '_'). The BAS WebServices require that annotation markers or comment lines in input texts do not contain white spaces. This option let you decide which character should be used to replace the white spaces. If set to the string 'NONE' no replacement takes place. CAUTION: the characters '&' and '=' do not work as replacements. CHANNELSELECT: Option CHANNELSELECT: list of comma-separated channel numbers that are selected for further processing from the input media file. Examples: MONO=true,CHANNELSELECT="" : merge multi-channel files into one channel, MONO=true,CHANNELSELECT="2,3,4" : merge only selected channels into one channel, MONO=false, CHANNELSELECT="3,4,1,2" : select and re-arrange channels, MONO=false, CHANNELSELECT="" : do nothing. Note that channels are numbered starting with 1 = left channel in stereo, 2 = right channel, ... By reversing the order of channel numbers in CHANNELSELECT you can swap channels, e.g. CHANNELSELECT="2,1" MONO=false will swap left and right channel of a stereo signal. marker: [punct, newline, tag] Marker used to split transcription into subtitles. If set to 'punct' (default), the transcription is split after 'terminal' punctuation marks (currently [.!?:…]. If set to 'newline', the transcription is split at newlines (\n or \r\n). If set to 'tag', the program expects a special < BREAK > tag inside the transcription (without the blanks between the brackets and BREAK!). USEREMAIL: Option USEREMAIL: if a valid email address is provided through this option, the service will send the XML file containing the results of the service run to this address after completion. It is recommended to set this option for long recordings (batch size <6, length >1h) since it is often problematic to wait for service completion over an instable internet connection or from a laptop that might go into hibernation. The email address provided is not stored on the server. It is sometimes even advisable to kill the browser tab after starting the call and wait for the result emails (only for batch size <6!). Beware: the download link to your result(s) will be valid for 24h after you receive the email; after that all your data will be purged from the server. Disclaimer: the usage of this option is at your own risk; the key URL to download your result file will be send without encryption in this email; be aware that anybody who can intercept this email will be able to access your result files using this key; the BAS at LMU Munich will not be held responsible for any security breach caused by using this email notification option. boost: [true, false] If set to true (the default), the chunker will start by running a so-called boost phase over the recording. This boost phase uses a phoneme-based decoder instead of speech recognition. Usually, the boost option reduces processing time. On noisy input or faulty transcriptions, the boost option can lead to an increase in errors. In this case (or if a previous run with boost set to 'true' has led to chunking errors), set this option to 'false'. except: Exception dictionary file overwriting the standard G2P output. Format: 2 semicolon-separated columns: word;transcript. Phonemes in transcript must be blank-separated. Example: sagt;z ' a x t. Note that the transcript must not contain phonemic symbols that are unknown to other services in the pipeline for the selected language; the service 'WebMAUS General' provides a list of all known symbols of a language MINPAUSLEN: [0.0, 999.0] Option MINPAUSLEN: Controls the behaviour of optional inter-word silence. If set to 1, maus will detect all inter-word silence intervals that can be found (minimum length for a silence interval is then 10 msec = 1 frame). If set to values n>1, the minimum length for an inter-word silence interval to be detected is set to n*10 msec. For example MINPAUSLEN of 5 will cause MAUS to suppress inter-word silence intervals up to a length of 40msec. Since 40 msec seems to be the border of perceivable silence, we set this option default to 5. With other words: inter-word silences smaller than 50msec are not segmented but rather distributed equally to the adjacent segments. If one of the adjacent segments happens to be a plosive then the deleted silence interval is added totally to the plosive; if both adjacent segments are plosives, the interval is equally spread as with non-plosive adjacent segments. forcechunking: [true, false, rescue] If this parameter is set to true, the chunker will run in the experimental 'forced chunking' mode (chunker option 'force'). While forced chunking is much more likely to return a fine-grained chunk segmentation, it is also more prone to chunking errors. As a compromise, you can also set this parameter to 'rescue'. In this case, the forced chunking algorithm is only invoked when the original algorithm has returned chunks that are too long for MAUS. NOINITIALFINALSILENCE: [true, false] Option NOINITIALFINALSILENCE: Switch to suppress the automatic modeling of an optional leading/trailing silence interval. This is useful, if for instance the signal is known to start with a stop and no leading silence, and the silence model would 'capture' the silence interval from the plosive. InputTierName: Option InputTierName: Only needed, if TEXT is in TextGrid/ELAN format. Name of the annotation tier, that contains the input words/chunks. BRACKETS: One or more pairs of characters which bracket annotation markers in the input. E.g. if your input text contains markers '{Lachen}' and '[noise]' that should be passed as markers and not as spoken text, set this option to '{}[]'. Note that blanks replacement within such markers (see option 'WHITESPACE_REPLACEMENT') only takes place in markern/comments that are defined here. OUTFORMAT: [bpf, exb, csv, TextGrid, emuDB, eaf, tei, srt, sub, vtt, par] Option OUTFORMAT: the output format of the pipe. Note that this depends on the selected PIPE, more precisely, whether the last service in the pipeline supports the format; if not, an ERROR is returned. Possible (selectable) formats are: 'TextGrid' - a praat compatible TextGrid file; 'bpf' - a BPF file (if the input (TEXT) is also a BPF file, the input is usually copied to the output with new (or replaced) tiers); 'csv' - a spread sheet (CSV table) containing the most prominent tiers of the annotation; 'emuDB' - an Emu compatible *_annot.json file; 'eaf' - an ELAN compatible annotation file; 'exb' - an EXMARaLDA compatible annotation file; 'tei' - an Iso TEI document; 'srt' - a SubRip subtitle format file; 'sub' - a SubViewer subtitle format file; 'vtt' - a WebVTT suntitle format file. If output format is 'vtt' and a subtitle starts with a speaker marker of the form '<...>', a 'v ' is inserted before the '...'. For a description of BPF see http://www.bas.uni-muenchen.de/forschung/Bas/BasFormatseng.html. For a description of Emu see https://github.com/IPS-LMU/emuR. Note 1: using 'emuDB' will first produce only single annotation file *_annot.json; in the WebMAUS interface (https://clarin.phonetik.uni-muenchen.de/BASWebServices) you can process more than one file and than download a zipped Emu database; in this case don't forget to change the default name of the emuDB 'MAUSOUTPUT' using the R function emuR::rename_emuDB(). Note 2: if you need the same result in more than one format, select 'bpf' to produce a BPF file, and then convert this file with the service runAnnotConv ('AnnotConv') into the desired formats. Note 3: some format conversions are not loss-less; select 'bpf' to be sure that no information is lost. syl: [yes, no] Switches syllabification of the pronunciation in the KAN tier produced by module G2P on; the syllable boundary marker is '.'. This option only makes sense in languages in which the module G2P produces a different syllabification than the module PHO2SYL (e.g. tha-TH). Otherwise use a pipe that ends with the module PHO2SYL which will create tiers MAS (phonetic syllable) and KAS (phonologic syllable). WARNING: syl=yes causes G2P to switch off MAUS embedded mode; this might change the output for some languages because the output phoneme inventar is then SAMPA and not the SAMPA variant used by MAUS. Subsequent modules like MAUS might report an ERROR then. ENDWORD: [0.0, 999999.0] Option ENDWORD: If set to a value n<999999, this option causes maus to end the segmentation with the word number n (word numbering in BPF starts with 0). This is useful if the input signal file is just a segment within a longer transcript. See also option STARTWORD. wsync: [yes, no] Yes/no decision, whether each word boundary is considered as syllable boundary. Only relevant for phonetic transcription input from MAU, PHO, or SAP tiers (for input from the KAN tier this option is always set to 'yes'). If set to 'yes', each syllable is assigned to exactly one word index. If set to 'no', syllables can be part of more than one word. UTTERANCELEVEL: [true, false] Switch on utterance level modelling (true); only for PIPEs with text input. Every TEXT input line is modelled as an utterance in an additional annotation layer ('TRL') between recording (bundle) and words (ORT). This is usefull if the recording contains several sentences/utterances and you need hierarchical access to these in the resulting annotation structure. For example, in EMU-SDMS output the default hierarchy bundle->ORT->MAU is then changed to bundle->TRL->ORT->MAU. Note 1 : does not have any effect in CSV output. Note 2 : the use of this option causes the ORT tier to contain the raw word tokens instead of the (default) word-normalized word tokens (e.g. '5,' (raw token) vs. 'five' (word-normalized). featset: [standard, extended] Feature set used for grapheme-phoneme conversion. The standard set is the default and comprises a letter window centered on the grapheme to be converted. The extended set additionally includes part of speech and morphological analyses. The extended set is currently available for German and British English only. For connected text the extended feature set generally generally yields better performance. However, if the input comprises a high amount of proper names provoking erroneous part of speech tagging and morphologic analyses, than the standard feature set is more robust. pos: Option pos : N-HANS sample recording (RIFF WAVE *.wav) of the noise to be preserved in the signal (mode 'denoiser') or the target speaker to be preserved in the signal (mode 'separator'). The 'pos' sample is applied to all processed input signals; do not upload more than 2sec of clean signal, and make sure that the relevant signal is present within the very first second; 'clean signal' means that the sample should not contain any traces of the main voice (mode 'denoiser') nor of the 'pos' noise sample (modes 'denoiser' and 'separator'). The upload of the 'pos' sample is mandatory for N-HANS mode 'separator' and optional for mode 'denoiser' (see option 'NHANS'). APHONE: Option APHONE: the string used to mask phonetic/phonologic labels for anonymized terms. If not set, the service will use the label 'nib' for masking encodings in SAMPA, and the label '(.)' for encodings in IPA. If set to another label, this label is used to mask in all encodings. INSPROB: Option INSPROB: The option INSPROB influences the probability of deletion of segments. It is a constant factor (a constant value added to the log likelihood score) after each segment. Therefore, a higher value of INSPROB will cause the probability of segmentations with more segments go up, thus decreasing the probability of deletions (and increasing the probability of insertions, which are rarely modelled in the rule sets). This parameter has been evaluated on parts of the German Verbmobil data set (27425 segments) which were segmented and labelled manually (MAUS DEV set) and found to have its optimum at 0.0 (which is nice). Therefore we set the default value of INSPROB to 0.0. INSPROB was also tested against the MAUS TEST set to confirm the value of 0.0. It had an optimum at 0.0 as well. Note that this might NOT be the optimal value for other MAUS tasks. OUTSYMBOL: [x-sampa, ipa, manner, place] Option Output Encoding (OUTSYMBOL): Defines the encoding of phonetic symbols in output. If set to 'x-sampa' (default), phonetic symbols in output are encoded in X-SAMPA (with some minor differences in languages Norwegian/Icelandic in which the retroflex consonants are encoded as 'rX' instead of X-SAMPA 'X_r'); use service runMAUSGetInventar with option LANGUAGE=sampa to get a list of symbols and their mapping to IPA. If set to 'ipa', the service produces UTF-8 IPA output in annotation tiers MAU (MAUS last module in PIPE) or in KAS/MAS (PHO2SYL last module in PIPE). Just for pipes with MAUS as the last module: if set to 'manner', the service produces Manner of articulation for each segment; possible values are: silence, vowel, diphthong, plosive, nasal, fricative, affricate, approximant, lateral-approximant, ejective; if set to 'place', the service produces Place of articulation for each segment; possible values are: silence, labial, dental, alveolar, post-alveolar, palatal, velar, uvular, glottal, front, central, back. RULESET: MAUS rule set file; UTF-8 encoded; one rule per line; there are two different file types defined by the extension: 1. Phonological rule set without statistical information '*.nrul', synopsis is: 'leftContext-match-rightContext>leftContext-replacement-rightContext', e.g. 't,s-e:-n>t,s-@-n'. 2. Rule set with statistical information '*.rul', synopsis is: 'leftContext,match,rightContext>leftContext,replacement,rightContext ln(P(replacement|match) 0.0000', e.g. 'P9,n,@,n,#>P9,# -3.761200 0.000000'; 'P(replacement|match)' is the conditional probability that 'match' is being replaced by 'replacement'; the sum over all conditional probabilities with the same condition 'match' must be less than 1; the difference between the sum and 1 is the conditional probability 'P(match|match)', i.e. no for no change. 'leftContext/rightContext/match/replacememt' = comma separated lists of SAMPA symbols or empty lists (for *.rul the leftContext/rightContext must be exactly one symbol!); special SAMPA symbols in contexts are: '#' = word boundary between words, and '<' = utterance begin (may be used instead of a phonemic symbol); digits in SAMPA symbols must be preceded by 'P' (e.g. '2:' -> 'P2:'); all used SAMPA symbols must be defined in the language specific SAMPA set (see service runMAUSGetInventar). Examples for '*.rul' : 'P9,n,@,n,#>P9,# = 'the word final syllable /n@n/ is deleted, if preceded by /9/', '#,k,u:>#,g,u:' = 'word intial /k/ is replaced by /g/ if followed by the vowel /u:/'. Examples for '*.nrul' : '-->-N,k-' = 'insert /Nk/ at arbitrary positions', '#-?,E,s-#>#-s-#' = 'delete /?E/ in word /?Es/', 'aI-C-s,t,#>aI-k-s,t,#' = 'replace /C/ in word final syllable /aICst/ by /k/'. maxSpeakNumber: [0.0, 999999.0] Option maxSpeakNumber defines a hard upper bound of the number of detected speakers. If set to 0 (default), no upper bound. allowOverlaps: [true, false] Option allowOverlaps: If set to true, the un-altered output of PyAnnote is returned in the SPD tier (note that overlaps cannot be handled by most annotation formats; only use if you really need to detect overlaps!); if set to false (default), overlaps, missing silence intervals etc. are resolved in the output tier SPD, making this output compatible with all annotation formats. The postprocessing works as follows: 1. all silence intervals are removed. 2. all speaker segments that are 100% within another (larger) speaker segment are removed. 3. If an overlap occurs the earlier segment(s) are truncated to the start of the new segment. 4. all remaining gaps in the segmentation are filled with silence intervals. minchunkduration: [0.0, 999999.0] Lower bound for output chunk duration in seconds. Note that the chunker does not guarantee an upper bound on chunk duration. SIGNAL: Mandatory parameter SIGNAL: mono sound file or video file containing the speech signal to be processed; PCM 16 bit resolution; any sampling rate. Although the mimetype of this input file is restricted to RIFF AUDIO audio/x-wav (extension wav), most pipes will also process NIST/SPHERE (nis|sph) and video (mp4|mpeg|mpg|avi|flv). stress: [yes, no] yes/no decision whether or not word stress is to be added to the canonical transcription (KAN tier). Stress is marked by a single apostroph (') that is inserted before the syllable nucleus into the transcription. imap: Customized mapping table from orthography to phonology. If pointing to a valid mapping table, the pipeline service will automatically set the LANGUAGE option for service G2P to 'und' (undefined) while leaving the commandline option LANGUAGE for the remaining services unchanged (most likely 'sampa'). This mapping table is used then to translate the input text into phonological symbols. See https://www.bas.uni-muenchen.de/Bas/BASWebServices/DOCS/readme_g2p_mappingTable.txt for details about the format of the mapping table. MODUS: [default, standard, align] Option MODUS: Operation modus of MAUS: default is to use the language dependent default modus; the two possible modi are: 'standard' which is the segmentation and labelling using the MAUS technique as described in Schiel ICPhS 1999, and 'align', a forced alignment is performed on the input SAM-PA string defined in the KAN tier of the BPF (the same effect as the deprecated former option CANONLY=true). RELAXMINDUR: [true, false] Option Relax Min Duration (RELAXMINDUR) changes the default minimum duration of 30msec for consonants and short/lax vowels and of 40msec for tense/long vowels and diphthongs to 10 and 20msec respectively. This is not optimal for general segmentation because MAUS will start to insert many very short vowels/glottal stops where they are not appropriate. But for some special investigations (e.g. the duration of /t/) it alleviates the ceiling problem at 30msec duration. ATERMS: Option ATERMS: file encoded in UTF-8 containing the terms that are to be anonymized by the service. One term per line; terms may contain blanks, in which case only consecutive occurances of the words within the term are anonymized. RELAXMINDURTHREE: [true, false] Alternative option to Relax Min Duration (RELAXMINDUR): changes the minimum duration for all models to 3 states (= 30msec with standard frame rate)to 30msec. This can be useful when comparing the duration of different phone groups. STARTWORD: [0.0, 999999.0] Option STARTWORD: If set to a value n>0, this option causes maus to start the segmentation with the word number n (word numbering in BPF starts with 0). This is useful if the input signal file is just a segment within a longer transcript. See also option ENDWORD. INSYMBOL: [sampa, ipa] Option INSYMBOL: Defines the encoding of phonetic symbols in input. If set to 'sampa' (default), phonetic symbols are encoded in X-SAMPA (with some coding differences in Norwegian/Icelandic); use service runMAUSGetInventar with option LANGUAGE=sampa to get a list of symbols and their mapping to IPA). If set to 'ipa', the service expects blank-separated UTF-8 IPA. PRESEG: [true, false] Option PRESEG: If set to true, a pre-segmentation using the wav2trn tool is done by the webservice on-the-fly; this is useful, if the input signal (or processed chunks within the signal) has leading and/or trailing silence. AWORD: Option AWORD: the string used to mask word labels for anonymized terms. USETRN: [true, false, force] Option USETRN: If the pipe produces/processes a chunk segmentation (CHUNKER/CHUNKPREP), this option is set automatically. If set to true, MAUS searches the input BPF for a TRN tier (turn/chunk segmentation, see http://www.bas.uni-muenchen.de/forschung/Bas/BasFormatsdeu.html#TRN). The synopsis for a TRN entry is: 'TRN: (start-sample) (duration-sample) (word-link-list) (label)', e.g. 'TRN: 23654 56432 0,1,2,3,4,5,6 sentence1' (the speech within the recording 'sentence1' starts with sample 23654, last for 56432 samples and covers the words 0-6). If only one TRN entry is found, the segmentation is restricted within a time range given by this TRN tier entry; this is useful, if there exists a reliable pre-segmentation of the recorded utterance, i.e. the start and end of speech within the recording is known. If more than one TRN entry is found, the webservice performs an segmentation for each 'chunk' defined by a TRN entry and aggregates all individual results into a single results file; this is useful if the input consists of long recordings, for which a manual chunk segmentation is available. If USETRN is set to 'force' (deprecated since maus 4.11; use PRESEG=true instead!, a pre-segmentation using the wav2trn tool is done by the webservice on-the-fly; this is useful, if the input BPF does not contain a TRN entry and the input signal has leading and/or trailing silence. MAUSSHIFT: Option MAUSSHIFT: If set to n, this option causes the calculated MAUS segment boundaries to be shifted by n msec (default: 0) into the future. Most likely this systematic shift is caused by a boundary bias in the training material's segmentation. The default should work for most cases. HIGHF: [0.0, 30000.0] Option HIGHF: upper filter edge in Hz. If set >0Hz and LOWF is 0Hz, a low pass filter with HIGHF Hz is applied; if set >0Hz and LOWF is set lower than HIGHF, a band pass between LOWF and HIGHF is applied; if set >0Hz and LOWF is set higher than 0Hz but higher than HIGHF, a reject band pass between HIGHF and LOWF is applied. E.g. HIGHF = 3000 LOWF = 300 is telephone band; HIGHF = 45 LOWF = 55 filters out a 50Hz hum. silenceonly: [0.0, 999999.0] If set to a value greater than 0, the chunker will only place chunk boundaries in regions where it has detected a silent interval of at least that duration (in ms). Else, silent intervals are prioritized, but not to the exclusion of word boundaries without silence. On speech that has few silent pauses (spontaneous speech or speech with background noise), setting this parameter to a number greater than 0 is likely to hinder the discovery of chunk boundaries. On careful and noise-free speech (e.g. audio books) on the other hand, setting this parameter to a sensible value (e.g. 200) may reduce chunkin errors. boost_minanchorlength: [2.0, 8.0] If you are using the boost phase, you can set its minimum anchor length independently of the general minimum anchor length. Setting this parameter to a low value (e.g. 2-3) means that the boost phase has a greater chance of finding preliminary chunk boundaries, which is essential for speeding up the chunking process. On the other hand, high values (e.g. 5-6) lead to more conservative and more reliable chunking decisions. If boost is set to false, this option is ignored. ADDSEGPROB: [true, false] Option Add Viterbi likelihoods (ADDSEGPROB) causes that the frame-normalized natural-log total Viterbi likelihood of an aligned segment is appended to the segment label in the output annotation (the MAU tier). This might be used as a 'quasi quality measure' on how good the acoustic signal in the aligned segment has been modelled by the combined acoustical and pronunciation model of MAUS. Note that the values are not probabilities but likelihood densities, and therefore are not comparable for different signal segments; they are, however, comparable for the same signal segment. Warning: this option breaks the BPF standard for the MAU tier and must not be used, if the resulting MAU tier should be further processed, e.g. in a pipe). Implemented only for output phoneme symbol set SAMPA (default). Output: A XML response containing the elements "success", "downloadLink", "output" and "warning". "success" states if the processing was successful or not, "downloadLink" specifies the location where the output file of the pipeline can be found (the format of the file depends on the option selected in OUTFORMAT), "output" contains the output that is mostly useful during debugging errors and "warning" lists warnings, if any occured during the processing. Depending on input parameter OUTFORMAT the output file in "downloadlink" can be of several different file formats; see mandatory parameter OUTFORMAT for details. ---------------------------------------------------------------- ---------------------------------------------------------------- runTextEnhance ------------------ Description: This service reads an arbitrary encoded text file and returns a normalized UTF-8 UNIX style text file that is suitable for processing within the BAS WebServices. It also allows to map bracketted markers (e.g. '{Laughing loud}') in the input text to a form that is recognized (and passed through) by the BAS WebServices (e.g. '<{Laughing_loud}>'). Example curl call is: curl -v -X POST -H 'content-type: multipart/form-data' -F left-bracket=# -F replace-whitespace-char= -F infile=@ -F brackets= 'https://clarin.phonetik.uni-muenchen.de/BASWebServices/services/runTextEnhance' Parameters: [left-bracket] [replace-whitespace-char] infile [brackets] Parameter description: left-bracket: One or more characters which mark comments reaching until the end of the line (default: '#'). E.g. if your input text contains comment lines that begin with ';', set this option to ';' to avoid that these comments are treated as spoken text. If you want to suppress the default '#' comment character, set this option to 'NONE'. If you are using comment lines in your input text, you must be absolutely sure that the comment character appears nowhere in the text except in comment lines! Note 1: the characters '&', '|' and '=' do not work as comment characters. Note 2: for technical reasons the value for this option cannot be empty. Note 3: the default character '#' cannot be combined with other characters, e.g. if you define this option as ';#', the '#' will be ignored. Note 4 (sorry): for the service 'Subtitle' comment lines must be terminated with a so called 'final punctuation sign', i.e. one of '.!?:…'; otherwise, an immediately following speaker marker will not be recognized. replace-whitespace-char: The character that whitespace in comments should be substituted by (default: none). The BAS WebService G2P requires that annotation markers or comment lines in input texts do not contain white spaces. This option let you decide which character should be used to replace the white spaces. For further processing in G2P it is recommended to set this option to '_'. If set to the string 'NONE' no replacement takes place (default). CAUTION: the characters '&' and '=' do not work as replacements. infile: Input text file. Most common formats and encodings will be recognized automatically. brackets: One or more pairs of characters which bracket annotation markers in the input. E.g. if your input text contains markers '{Lachen}' and '[noise]' that should be passed as markers and not as spoken text, set this option to '{}[]'. Note that blanks replacement within such markers (see next option 'replace-whitespace-char') only takes place in markern/comments that are defined here. Output: An XML response containing the tags "success", "downloadLink", "output" and "warning". "success" states whether the processing was successful or not, "downloadLink" specifies the location where the output file is provided; depending on parameter 'outformat' this can be BPF file (*.par), a SubRip subtitle format (*.srt), or a SubViewer subtitle format (*.sub). The BPF contains the content of the input BPF (option "bpf") with appended TRO and TRN tier (existing TRO/TRN tiers in the BPF input are over-written). The TRO tier contains the mapping from the ORT tier to the input transcription; the TRN tier contains the subtitle grouping. ---------------------------------------------------------------- ---------------------------------------------------------------- runVoiceActivityDetection ------------------ Description: This service automatically segments the input signal into speech and silence intervals. The result is a simple annotation file with one segmentation layer (called 'VAD'). The algorithm is a keras based DNN to classify each signal frame (10msec) into speech or silence, followed by a smoothing stage that accumulates silence/speech frames into more realistic stretches of silence (labelled wuth 'p:') and speech (labelled with 'speech'); see options 'aggressivity', 'minSilence' and 'minVoice' to influence this process. Output labels can be augmented by a confidence measure about the decision (option 'showConfidence'). Example curl call is: curl -v -X POST -H 'content-type: multipart/form-data' -F input=@ -F minVoice=100 -F minSilence=100 -F aggressivity=50 -F showConfidence=false -F outformat=bpf 'https://clarin.phonetik.uni-muenchen.de/BASWebServices/services/runVoiceActivityDetection' Parameters: input [minVoice] [minSilence] [aggressivity] [showConfidence] [outformat] Parameter description: input: Input file containing the speech signal. All media formats that AudioEnhance supports. minVoice: [0.0, 9999.0] Minimum length of a speech interval. Shorter detected speech intervals are removed. If set to a large value, the output will contain one single silence segment. minSilence: [1.0, 999999.0] Minimum length of a silence interval. Shorter detected silence intervals are removed. If set to a large value, the output will contain one single speech segment. aggressivity: [1.0, 999999.0] Aggressivity in percent of classification (1-99). Higher values make the resulting classification more prone to classifying silence. Smaller values make it more prone to classifying voice. Technically, this value is the threshold for the output probabilities of the DNN to decide for silence (DNN probability higher than this value) or for speech (DNN probability lower) regarding a single speech frame. Do not change this from the default of 50 (prob. is 0.5), if you are also calculating confidence measures, because other values than 50 will grossly distort the confidence measures. showConfidence: [true, false] If set to true, the labels are augmented by a ';' followed by a confidence measure that expresses the mean confidence of the classification. Technically, the confidence measure is the distance of the frame-wise DNN output probability to the threshold (see option 'aggessivity') averaged over all frames of the segment. outformat: [bpf, csv, eaf, emuDB, exb, TextGrid, tei] Output format: default is BAS Partitur Format; the tier name is VAD (see http://www.bas.uni-muenchen.de/forschung/Bas/BasFormatseng.html for details about the BPF format); other supported annotation formats are: all formats supported by BAS WebService 'AnnotConv'. Silence intervals are labelled with 'p:', speech intervals with 'speech'. Output: An XML response containing the tags 'success', 'downloadLink', 'output' and 'warning'. 'success' states whether the processing was successful or not, 'downloadLink' specifies the location where the output file is provided; depending on parameter 'outformat' this can be BPF file (*.par) or any other format supported by BAS WebService AnnotConv. The output annotation contains the content of the input BPF (option 'bpf') together with the appended VAD tier. ---------------------------------------------------------------- ---------------------------------------------------------------- runASRGetQuota ------------------ Description: Returns a XML element 'basQuota' with four sub-elements: 'ASRType' : the value of parameter 'ASRType (see below); 'secsAvailable' : the still available ASR quota for the ASR service in secs (0 if monthly quotas has been expired, 999999 if umlimited); 'monthlyQuota' : the monthly quota in secs (999999 if umlimited quota); 'error' : the error message of the backend script (if empty, no error occurred). Example curl call is: curl -v -X GET -H 'content-type: application/x-www-form-urlencoded' 'https://clarin.phonetik.uni-muenchen.de/BASWebServices/services/runASRGetQuota?ASRType=' Parameters: [ASRType] Parameter description: ASRType: The name of the ASR service (name of backend script, e.g. 'callGoogleASR') for which the quotas are requested. Values are: callAmberscriptASR, callEMLASR, callFraunhoferASR, callGoogleASR, callLSTDutchASR, callLSTEnglishASR, callWatsonASR. Output: The number of sec free quota time as string. ---------------------------------------------------------------- ---------------------------------------------------------------- getLoadIndicatorXML ------------------ Description: Returns an indicator how high the server load is - 0 (for low load, i.e., less than 50 percent load), 1 (for middle load, i.e., between 50 and 100 percent load), and 2 (for high load, i.e., more than 100 percent load). Additionally the last 20 values of the un-normalized server load in 20sec steps is returned in a list. Example curl call is: curl -v -X GET -H 'content-type: multipart/form-data' 'https://clarin.phonetik.uni-muenchen.de/BASWebServices/services/getLoadIndicatorXML' Parameter description: Output: XML snippet that characterises the load of the server. ---------------------------------------------------------------- ---------------------------------------------------------------- runPho2Syl ------------------ Description: Syllabification of canonical (phonological) or spontanous (phonetic) speech transcriptions. Example curl call is: curl -v -X POST -H 'content-type: multipart/form-data' -F wsync=no -F lng=deu-DE -F tier=MAU -F rate=0 -F outsym=sampa -F i=@ -F oform=bpf 'https://clarin.phonetik.uni-muenchen.de/BASWebServices/services/runPho2Syl' Parameters: [wsync] [lng] [tier] [rate] [outsym] i [oform] Parameter description: wsync: [yes, no] Yes/no decision, whether each word boundary is considered as syllable boundary. Only relevant for phonetic transcription input from MAU, PHO, or SAP tiers (for input from the KAN tier this option is always set to 'yes'). If set to 'yes', each syllable is assigned to exactly one word index. If set to 'no', syllables can be part of more than one word. lng: [aus-AU, afr-ZA, sqi-AL, eus-ES, eus-FR, cat-ES, cze-CZ, nld-NL, eng-AU, eng-GB, eng-NZ, eng-US, ekk-EE, fin-FI, fra-FR, kat-GE, deu-DE, gsw-CH, gsw-CH-BE, gsw-CH-BS, gsw-CH-GR, gsw-CH-SG, gsw-CH-ZH, hat-HT, hun-HU, isl-IS, ita-IT, jpn-JP, gup-AU, ltz-LU, mlt-MT, nor-NO, fas-IR, pol-PL, ron-RO, rus-RU, slk-SK, spa-ES, swe-SE, tha-TH, guf-AU, und] RCFC5646 locale language code of the speech to be syllabified; defines the possible SAMPA phoneme symbol set in input; we use the RFC5646 sub-structure 'iso639-3 - iso3166-1 [ - iso3166-2], e.g. 'eng-US' for American English, 'deu-AT-1' for Austrian German spoken in 'Oberoesterreich'; alternatively, Iso 639-3 char language code is supported; non-standard language codes:. 'nze' stands for New Zealand English, 'use' for American English. 'und' (undefined) can be used to syllabify X-SAMPA input independent of a language (experimental). tier: [KAN, MAU, PHO, SAP] Name of tier in the input BPF file, whose content is to be syllabified. Currently only the BPF tiers MAU, PHO, SAP (producing a new tier MAS), and KAN (producing a new tier KAS) are supported; phonemic encodimng must be (X-)SAMPA. Tier KAN contains the canonical phonological transcript usually created by the service G2P. Tier MAU contains the phontic transcript (and segmentation) generated by service MAUS. Tiers PHO and SAP contain different phonetic segmentation formats based on extended (X-)SAMPA, e.g. with symbols indicating morpheme boundaries and deviations from the canonical pronunciation. The Non-SAMPA content of PHO and SAP is removed before syllabification. rate: [0.0, 999999.0] Option rate: Only needed for TextGrid output format, if the input BPF does not contain this information (header entry 'SAM:'). Sample rate is needed to convert sample values from BAS Partitur Format to seconds in TextGrid. outsym: [sampa, ipa] Ouput phoneme symbol inventory. Default is X-SAMPA ('sampa') compatible to the input encoding; alternative is IPA ('ipa') encoded in UTF-8 (not BPF conform!). i: Input BAS Partitur Format file (BPF) containing the tier (specified by the input parameter 'tier') to be syllabified. See http://www.bas.uni-muenchen.de/forschung/Bas/BasFormatseng.html for a detailed description of BPF. oform: [bpf, exb, csv, tg, emuDB, eaf, tei] Output format: 'bpf' is BAS Partitur Format (see http://www.bas.uni-muenchen.de/forschung/Bas/BasFormatseng.html for a detailed description of BPF); 'tg' is praat TextGrid (see http://www.praat.org); 'csv' is a spreadsheet table; 'emuDB' is an annotation file of the EMU-SDMS (see https://ips-lmu.github.io/EMU.html); eaf is a ELAN EAF; tei is a Iso TEI document. Output: A BAS Partiture or TextGrid file additionally containing the syllabified output. If the input parameter 'tier' is set to MAU, SAP, or PHO, a tier MAS is generated that contains for each syllable analogously to tier MAU the time onset, duration, word index, and syllable string information. If 'tier' is set to KAN, a tier KAS is generated that contains for each word analogously to KAN the canonical transcription with phonemes segmented by blanks and syllables separated by dots. ---------------------------------------------------------------- ---------------------------------------------------------------- runFormantAnalysis ------------------ Description: This services reads pair(s) of signal + annotation files of a single speaker, creates an EMU database with a phonetic segmentation, and performs an automatic formant analysis on selected vowels. The formant analysis is performed only in the webinterface version of this service; therefore this service has no public REST interface. SIGNAL must be one of wav,nis,nist,sph,mp4,mpeg,mpg; TEXT must be one of txt,par,TextGrid,eaf,csv,_annot.json; OUTFORMAT is accepted by the script but ignored (for comptibility with Web API: the Web API needs this to allow the assemblence of a emuDB) The service calls runPipeline with PIPE=G2P_MAUS_PHO2SYL, if both, SIGNAL and TEXT, are given, or PIPE=G2P_CHUNKER_MAUS_PHO2SYL, if the number of words in TEXT exceeds 3000 words, or PIPE=CHUNKPREP_G2P_MAUS_PHO2SYL, if TEXT is of type TextGrid|EAF|CSVi, or simply assembles the input *_annot.json files in an emuDB (without checking the structure!). runPipeline is called with the following options (other than defaults): OUTFORMAT=emuDB diarization=true [InputTierName=InputTierName] The service accepts options for the formant analysis which are ignored by the RESTful service but passed on to the bpf-to-emuDB converter of the web-interface. This service only makes sense when called from the BAS WebService Web API; therefore this service cannot be called as a RESTful service like other BAS services. Example curl call is: curl -v -X POST -H 'content-type: multipart/form-data' -F SIGNAL=@ -F LANGUAGE=deu-DE -F imap=@ -F gender=unknown -F TEXT=@ -F emuRDBname=FORMANTANALYSISOUTPUT -F sounds=a:,e:,i:,o: -F computeERatio=false -F midpoint=false -F InputTierName=unknown -F outlierMetric=euclid -F outlierThreshold=250 'https://clarin.phonetik.uni-muenchen.de/BASWebServices/services/runFormantAnalysis' Parameters: SIGNAL [LANGUAGE] [imap] [gender] TEXT [emuRDBname] [sounds] [computeERatio] [midpoint] [InputTierName] [outlierMetric] [outlierThreshold] Parameter description: SIGNAL: Mandatory parameter SIGNAL: mono sound file or video file containing the speech signal to be processed; PCM 16 bit resolution; any sampling rate. Although the mimetype of this input file is restricted to RIFF AUDIO audio/x-wav (extension wav), NIST/SPHERE (nis|nist|sph), video (mp4|mpeg|mpg). LANGUAGE: [aus-AU, afr-ZA, sqi-AL, arb, eus-ES, eus-FR, cat-ES, nld-NL, nld-NL-GN, nld-NL-OH, nld-NL-PR, eng-US, eng-AU, eng-GB, eng-GB-OH, eng-GB-OHFAST, eng-GB-LE, eng-SC, eng-NZ, ekk-EE, fin-FI, fra-FR, kat-GE, deu-AT, deu-CH, deu-DE, deu-DE-OH, gsw-CH-BE, gsw-CH-BS, gsw-CH-GR, gsw-CH-SG, gsw-CH-ZH, gsw-CH, hun-HU, isl-IS, ita-IT, jpn-JP, gup-AU, sampa, ltz-LU, mlt-MT, nor-NO, fas-IR, pol-PL, ron-RO, rus-RU, spa-ES, swe-SE, tha-TH, guf-AU] Language: RCFC5646 locale code of the processed speech; defines the phoneme set of input and the orthographic system of input text (if any); we use the RFC5646 sub-structure 'iso639-3 - iso3166-1 [ - iso3166-2], e.g. 'eng-US' for American English, 'deu-AT-1' for Austrian German spoken in 'Oberoesterreich'; the code 'sampa' ('Language independent') allows the user to upload a customized mapping from orthographic to phonologic form (see option 'imap'). Special languages: 'gsw-CH' denotes text written in Swiss German 'Dieth' transcription (https://en.wikipedia.org/wiki/Swiss_German); 'gsw-CH-*' are localized varieties in larger Swiss cities; 'jpn-JA' (Japanese) accepts Kanji or Katakana or a mixture of both, but the tokenized output will contain only the Katakana version of the input; 'aus-AU' (Australian Aboriginal languages, including Kunwinjku, Yolnu Matha) accept so called 'Modern Practical Orthography' (https://en.wikipedia.org/wiki/Transcription_of_Australian_Aboriginal_languages); 'fas-IR' (Persian) accepts a romanized version of Farsi developped by Elisa Pellegrino and Hama Asadi (see http://www.bas.uni-muenchen.de/Bas/BASWebServices/DOCS/PersianRomanizationTable.pdf for details); 'arb' is a macro language covering all Arabic varieties; the input must be encoded in a broad phonetic romanization developped by Jalal Tamimi and colleagues (see http://www.bas.uni-muenchen.de/Bas/BASWebServices/DOCS/TamimiRomanization.pdf for details). The language code is passed to all services of the pipeline, thus influencing the way these services will process the speech. If one member of the PIPE does not support the language, the service will try to determine another suitable language (WARNING is issued) or, if that is not possible, an ERROR is returned. If the service does currently not support the language, an ERROR is returned. imap: Only needed if the option 'Language' is set to 'Independent' (undefined); then you must provide a G2P mapping table from orthography to phonology through this option. This mapping table is used to translate the input text into phonological symbols. See https://www.bas.uni-muenchen.de/Bas/BASWebServices/DOCS/readme_g2p_mappingTable.txt for details about the format of the mapping table. gender: [unknown, female, male] Gender of speaker; if set to 'unknown' the service will determine the gender from the fundamental frequency. OUTFORMAT: Option OUTFORMAT: the output format: always 'emuDB (= a *_annot.json file). This parameter is only here to signal the Web API that there is emuDB output; it cannot be changed via this API. TEXT: Optional parameter TEXT: The (optional) textual input corrsponding to SIGNAL; usually some form of text or transcript. If this option is omitted, the service will apply automatic transcription using the runASR service. The input can be a plain text (txt), a praat TextGrid (TextGrid), or a BAS Partitur Format (par) file. See http://www.bas.uni-muenchen.de/forschung/Bas/BasFormatseng.html for detailed description of the BPF. emuRDBname: This option sets the data base name of the output EMU database. sounds: The list of extracted and analysed vowels; use SAMPA symbols as defined for the selected language and separate each symbol by a comma; for a list of valid SAMPA symbols in the chosen language click on the button 'Show inventory' left of the option 'Language'. computeERatio: [true, false] If set, for each sound token the eRatio to all other vowel group centers is calculated and output in a separate table. Beware: if the number of sound classes is more than 4 (see option 'List of sounds') this can take a very long time and produce a very large table. midpoint: [true, false] If set, the formant value of a sound token will be taken from the exact midpoint of each token track, not averaged from the 50% midpoint section of the track. InputTierName: Only needed, if TEXT is in TextGrid|EAF format: the name of the annotation tier, that contains the input chunk segmentation. outlierMetric: [euclid, mahalanobis] The metric, euclid or mahalanobis, used to calculated outlier distances from the vowel group center. outlierThreshold: The threshold for removing sound tokens as outliers: if the distance of the token in formant space is larger than this value in Hz, the token is treated as a outlier and not shown in output plots/tables that have a 'NoOutliers' in the file name. Output: A XML response containing the elements "success", "downloadLink", "output" and "warning". "success" states if the processing was successful or not, "downloadLink" specifies the location where result file (emuDB annotation) can be found (the format of the file depends on the option selected in OUTFORMAT), "output" contains the output that is mostly useful during debugging errors and "warning" lists warnings, if any occured during the processing. ---------------------------------------------------------------- ---------------------------------------------------------------- runChannelSeparator ------------------ Description: This services reads a RIFF WAVE audio file with 2 or more channels that each contains the recording of a speaker and performs a version of Volker Dellwo's 'Frankensteins Channel Segregator', and then returns the resulting audio file with the same number of channels that are now completely separated from each other. Complete separation here means that only one channel (the channel of the dominant speaker) contains a signal at every time point while all other channels are muted. If this is done without errors, the channels then contain only cross talk of other speakers when speakers overlap, but the usual crosstalk when more than one speaker are being recorded in the same acoustic environment and only one speaker speaks is gone. This can have a positive effect on automatic speech recognition. Example curl call is: curl -v -X POST -H 'content-type: multipart/form-data' -F WINLEN=16 -F NONORM=false -F WAV=@ 'https://clarin.phonetik.uni-muenchen.de/BASWebServices/services/runChannelSeparator' Parameters: [WINLEN] [NONORM] WAV Parameter description: WINLEN: [1.0, 200.0] Length of window in msec in which the short time energy is calculated (frame length) for channel comparison. NONORM: [true, false] Switches off channel normalisation. if set true, the default channel normalisation is switched off. The channel normalisation tries to make the channels comparable with each other, even if one channel has a much lower recording gain than the others. This improves the agorithm's ability to decide in each time point which speaker is the one who is currently speaking. WAV: The input RIFF WAVE audio file to be processed. Only multi-channel audio files are accepted; each found channel is assumend to contain the microphone signal assigned to one speaker. Output: A XML response containing the elements "success", "downloadLink", "output" and "warning". "success" states if the processing was successful or not, "downloadLink" specifies the location where the output annotation file can be found, "output" contains the output that is mostly useful during debugging errors and "warning" lists warnings, if any occured during processing. ---------------------------------------------------------------- ---------------------------------------------------------------- runEMUMagic ------------------ Description: This services reads either single signal file(s) or a pair(s) of signal + annotation and creates a emuDB with the best achievable annotation. SIGNAL can be one of wav,nis,nist,sph,mp4,mpeg,mpg; TEXT can be one of txt,par,TextGrid,eaf,csv; Output is always a single _annot.json file with the same base name as SIGNAL or as defined in OUT. OUTFORMAT is accepted by the script but ignored (for comptibility with Web API: the Web API needs this to allow the assemblence of a emuDB) The only option is LANGUAGE and imap. The script calls runPipeline with PIPE=G2P_MAUS_PHO2SYL, if both, SIGNAL and TEXT, are given, or PIPE=ASR_G2P_MAUS_PHO2SYL, if only SIGNAL is given, or PIPE=[ASR_]G2P_CHUNKER_MAUS_PHO2SYL, if the number of words in TEXT or the ASR result exceeds 3000 words, or PIPE=CHUNKPREP_G2P_MAUS_PHO2SYL, if TEXT is of type TextGrid|EAF|CSV. runPipeline is called with the following options (other than defaults): OUTFORMAT=emuDB ASRType=autoSelect diarization=true [InputTierName=InputTierName] This service is restricted for academic use only; therefore this service cannot be called as a RESTful service like other BAS services, and the Web API to this service is protected by AAI Shiboleth authentification. Example curl call is: curl -v -X POST -H 'content-type: multipart/form-data' -F SIGNAL=@ -F LANGUAGE=deu-DE -F imap=@ -F TEXT=@ -F emuRDBname=MAUSOUTPUT -F InputTierName=unknown 'https://clarin.phonetik.uni-muenchen.de/BASWebServices/services/runEMUMagic' Parameters: SIGNAL [LANGUAGE] [imap] [TEXT] [emuRDBname] [InputTierName] Parameter description: SIGNAL: Mandatory parameter SIGNAL: mono sound file or video file containing the speech signal to be processed; PCM 16 bit resolution; any sampling rate. Although the mimetype of this input file is restricted to RIFF AUDIO audio/x-wav (extension wav), NIST/SPHERE (nis|nist|sph), video (mp4|mpeg|mpg). LANGUAGE: [aus-AU, afr-ZA, sqi-AL, arb, eus-ES, eus-FR, cat-ES, nld-NL, nld-NL-GN, nld-NL-OH, nld-NL-PR, eng-US, eng-AU, eng-GB, eng-GB-OH, eng-GB-OHFAST, eng-GB-LE, eng-SC, eng-NZ, ekk-EE, fin-FI, fra-FR, kat-GE, deu-AT, deu-CH, deu-DE, deu-DE-OH, gsw-CH-BE, gsw-CH-BS, gsw-CH-GR, gsw-CH-SG, gsw-CH-ZH, gsw-CH, hun-HU, isl-IS, ita-IT, jpn-JP, gup-AU, sampa, ltz-LU, mlt-MT, nor-NO, fas-IR, pol-PL, ron-RO, rus-RU, spa-ES, swe-SE, tha-TH, guf-AU] Language: RCFC5646 locale code of the processed speech; defines the phoneme set of input and the orthographic system of input text (if any); we use the RFC5646 sub-structure 'iso639-3 - iso3166-1 [ - iso3166-2], e.g. 'eng-US' for American English, 'deu-AT-1' for Austrian German spoken in 'Oberoesterreich'; the code 'sampa' ('Language independent') allows the user to upload a customized mapping from orthographic to phonologic form (see option 'imap'). Special languages: 'gsw-CH' denotes text written in Swiss German 'Dieth' transcription (https://en.wikipedia.org/wiki/Swiss_German); 'gsw-CH-*' are localized varieties in larger Swiss cities; 'jpn-JA' (Japanese) accepts Kanji or Katakana or a mixture of both, but the tokenized output will contain only the Katakana version of the input; 'aus-AU' (Australian Aboriginal languages, including Kunwinjku, Yolnu Matha) accept so called 'Modern Practical Orthography' (https://en.wikipedia.org/wiki/Transcription_of_Australian_Aboriginal_languages); 'fas-IR' (Persian) accepts a romanized version of Farsi developped by Elisa Pellegrino and Hama Asadi (see http://www.bas.uni-muenchen.de/Bas/BASWebServices/DOCS/PersianRomanizationTable.pdf for details); 'arb' is a macro language covering all Arabic varieties; the input must be encoded in a broad phonetic romanization developped by Jalal Tamimi and colleagues (see http://www.bas.uni-muenchen.de/Bas/BASWebServices/DOCS/TamimiRomanization.pdf for details). The language code is passed to all services of the pipeline, thus influencing the way these services will process the speech. If one member of the PIPE does not support the language, the service will try to determine another suitable language (WARNING is issued) or, if that is not possible, an ERROR is returned. imap: Customized mapping table from orthography to phonology. If the option 'Language' is set to 'Independent' (undefined), you must provide a G2P mapping table through this option. This mapping table is then used to translate the input text into phonological symbols. See https://www.bas.uni-muenchen.de/Bas/BASWebServices/DOCS/readme_g2p_mappingTable.txt for details about the format of the mapping table. OUTFORMAT: Option OUTFORMAT: the output format: always 'emuDB (= a *_annot.json file). This parameter is only here to signal the Web API that there is emuDB output; it cannot be changed via this API. TEXT: Optional parameter TEXT: The (optional) textual input corrsponding to SIGNAL; usually some form of text or transcript. If this option is omitted, the service will apply automatic transcription using the runASR service. The input can be a plain text (txt), a praat TextGrid (TextGrid), or a BAS Partitur Format (par) file. See http://www.bas.uni-muenchen.de/forschung/Bas/BasFormatseng.html for detailed description of the BPF. emuRDBname: This option sets the data base name in the output EMU database. InputTierName: Only needed, if TEXT is in TextGrid|EAF format: the name of the annotation tier, that contains the input chunk segmentation. Output: A XML response containing the elements "success", "downloadLink", "output" and "warning". "success" states if the processing was successful or not, "downloadLink" specifies the location where the result emuDB annotation file can be found, "output" contains the output that is mostly useful during debugging errors and "warning" lists warnings, if any occured during the processing. ---------------------------------------------------------------- ---------------------------------------------------------------- runAnonymizer ------------------ Description: This services reads a signal file (sound, video) + BAS Partitur Format annotation + a list of terms to be anonymized in both inputs, masks all occurances in the signal and in the annotation, and returns the two anonymized files in a ZIP archive; or just the anonymized annotation in a ZIP file, if ANNOTONLY=true. SIGNAL can be one of wav,nis,nist,sph,mp4,mpeg,mpg,avi,fvl or can be omitted (if ANNOTONLY=true); BPF must be a BPF file *.par with at least the ORT tier and one of MAU,SAP,PHO tiers (see https://www.bas.uni-muenchen.de/forschung/Bas/BasFormatseng.html#Partitur for details regarding the BPF); Output is either a ZIP (default) containing the masked (by noise or a beep or silence, see ASIGNAL) signal (sound files keep the same properties as input, while video input is re-coded into MP4 with h264 and aac encoding), and the input annotation (in a format given by the option OUTFORMAT) with all word label occurances replaced by the string given in option AWORD (default: 'ANONYMIZED') and all phonetic label occurances replaced by the string given in APHONE (default: '' for SAMPA, '(.)' for IPA encodings); or the output ZIP contains just the anonymized annotation file (ANNOTONLY=true). The (required) input list of terms to be anonymized (ATERMS) must be encoded in UTF-8 and have one term per line; terms may contain blanks, in which case only consecutive occurances of the words within the term are anonymized. Example curl call is: curl -v -X POST -H 'content-type: multipart/form-data' -F SIGNAL=@ -F ASIGNAL=brownNoise -F OUTFORMAT=bpf -F rate=1 -F AWORD=ANONYMIZED -F ANNOTONLY=false -F BPF=@ -F APHONE= -F ATERMS=@ 'https://clarin.phonetik.uni-muenchen.de/BASWebServices/services/runAnonymizer' Parameters: SIGNAL [ASIGNAL] [OUTFORMAT] [rate] [AWORD] [ANNOTONLY] BPF [APHONE] [ATERMS] Parameter description: SIGNAL: Optional input SIGNAL: sound or video file containing the speech signal to be anonymized. Although the mimetype of this input file is restricted to RIFF AUDIO audio/x-wav (extension wav), NIST/SPHERE (nis|nist|sph) and video (mp4|mpeg|mpg|avi|fvl) are accepted. EQUALNAMES: Option EQUALNAMES: If set to true, the output annotation file BPF has the same basename as the SIGNAL file; this is useful e.g. in a pipe ASIGNAL: [brownNoise, beep, silence] Option ASIGNAL: the type of signal to mask anonymized terms in the signal. 'brownNoise' is brown noise; 'beep' is a 500Hz sinus; 'silence' is total silence (zero signal); masking signals have an amplitude of -10dB of the maximum amplitude and are faded in and out with a very short sinoid function. OUTFORMAT: [bpf, exb, csv, TextGrid, emuDB, eaf, tei, par] Option 'Output format' (OUTFORMAT): the output format of the anonymized input BPF. TextGrid - a praat compatible TextGrid file; bpf - the input BPF file with now anonymized tiers; csv - a spreadsheet (CSV table) that contains most input tiers in table form; emuDB - an Emu compatible *_annot.json file; eaf - an ELAN compatible annotation file; exb - an EXMARaLDA compatible annotation file; tei - Iso TEI document (XML). For a description of BPF see http://www.bas.uni-muenchen.de/forschung/Bas/BasFormatseng.html. for a description of Emu see https://github.com/IPS-LMU/emuR. Note 1: using 'emuDB' will first produce only single annotation file *_annot.json; in the WebMAUS interface (https://clarin.phonetik.uni-muenchen.de/BASWebServices) you can process more than one file and than download a zipped Emu database; in this case don't forget to change the default name of the emuDB 'MAUSOUTPUT' using the R function emuR::rename_emuDB(). Note 2: if you need the same result in more than one format, select 'bpf' to produce a BPF file, and then convert this file with the service runAnnotConv ('AnnotConv') into the desired formats. Note 3: some format conversions are not loss-less; select 'bpf' to be sure that no information is lost. rate: [0.0, 999999.0] Option sample rate of signal file: if the sample rate cannot be determined automatically from SIGNAL and is not given in the input BPF either, you can provide the sampling rate via this option. Usually you can leave it to the default value of 1. AWORD: Option AWORD: the string used to mask word labels for anonymized terms. ANNOTONLY: [true, false] Option ANNOTONLY: If set to true, only the input BPF is anonymized and produced; the SIGNAL will not be used (can be omitted). The ouput ZIP file then contains only the anonymized annotation file. BPF: Mandatory input BPF: BAS Partitur Format (BPF) file (*.par or *.bpf) that contains at least the ORT tier and one of MAU,PHO,SAP or IPA, which shall be anonymized. See http://www.bas.uni-muenchen.de/forschung/Bas/BasFormatseng.html for detailed description of the BPF. APHONE: Option APHONE: the string used to mask phonetic/phonologic labels for anonymized terms. If not set, the service will use the label 'nib' for masking encodings in SAMPA, and the label '(.)' for encodings in IPA. If set to another label, this label is used to mask in all encodings. ATERMS: Mandatory Option ATERMS: file encoded in UTF-8 containing the terms that are to be anonymized by the service. One term per line; terms may contain blanks, in which case only consecutive occurances of the words within the term are anonymized. Output: A XML response containing the elements "success", "downloadLink", "output" and "warning". "success" states if the processing was successful or not, "downloadLink" specifies the location where the output ZIP file can be found which contains the anonymized copy of the input signal and the anonymized annotation file (the format of the annotation file depends on the option selected in OUTFORMAT), "output" contains the output that is mostly useful during debugging errors and "warning" lists warnings, if any occured during the processing. ---------------------------------------------------------------- ----------------------------------------------------------------