BAS CLARIN Repository FAQ
In a nutshell: What is this repository?
The BAS CLARIN repository contains corpora of spoken language archived in the Bavarian Archive for Speech Signals (BAS) at the university of Munich. All corpora shown here have been validated, fitted with metadata and PIDs according to CLARIN standards. Most resources (marked with 'free for science' (ACA) or 'public' (PUB) can be downloaded for free by academic users (see licence metadata tag on landing pages). You can browse all metadata, download complete corpora, download single recording sessions of corpora or even search cross corpora for recording sessions and download the resulting heterogeneous data sets. In case you need help, click on the 'help desk' symbol top right.
What is 'AAI' and how can I get access to the resources of this repository?
AAI is an international academic authentification scheme based on Shibboleth ('single-sign-on'). It allows services like the BAS CLARIN repository to check credibility of academic users without actually managing user accounts. All you need is an official account of your home university, and your university must be a member of one of the Identity Provider Federations of AAI.
Please click on the link 'Login via your institution' above and provide your account and password of your home university. If you cannot find your home university, please click here to register with CLARIN for an account. In the field 'What kind of expertise can you provide?' fill in the text: 'I am an academic and want to access CLARIN resources via AAI.' After you receive your account (this may take a few days), try again to login via AAI and choose 'clarin eu website account' as institution. Once you are successfully logged in, click on the corpus you are interested in, scroll down to the 'download' button and follow the instructions there.
If you encounter any problems, click here to send us a message stating the problem.
What are the license classes 'PUB', 'ACA' and 'RES'?
The BAS uses the three standard CLARIN license classes:
- PUB : the resource can be accessed without login and be used for any purpose.
- ACA : the resource can be accessed by SSO login at your home university (click on Login button and select your home institution and log in), and be used for educational/scientific usage only.
- RES : the resource is restricted by the owner and can only be accessed and used according to an individual permission of the owner(s). Contact email@example.com for inquiries.
While downloading data I get the message 'We are sorry, the maximum download size of 100 GB is exceeded. Please contact the BAS.'
The selected package is too large for download. Go to the search form, select the corpus you are intested in and then try downloading it in smaller portions, e.g. by first selecting only the female speakers and then the male speakers. Alternatively, go to the corpus landing page and download the recording sessions one by one. This way you can download a corpus that is larger than 100 GB in smaller portions.
How can I publish my own speech corpus with BAS CLARIN?
First, check out our External Resources Policy to see whether your data and your legal situation allow a publication in the BAS CLARIN Repository. Then contact firstname.lastname@example.org to get support from the BAS staff. BAS will help you with the documentation, the validation and metadata generation of your resource.
What is meant by 'download as emuDB'?
If you see this download button, the selected corpus set can be downloaded as an emuDB. emuDB is a special speech database format that allows the encoding of arbitrary complex hierarchical annotation structures. It is the standard file format of the EMU Speech Management Database System (EMU-SMDS) developped by Raphael Winkelmann. Click here for more details about EMU-SMDS.
Databases encoded in EMU allow very powerful queries, such as "Find all word-medial fronted vowels that are contained in accented syllables folowed by an un-accented syllable.", as well as powerful analyses on derived speech features.
Can I preview corpora before downloading them?
All corpora that are distributed in emuDB format can be previewed online. For this purpose, we use the EMU webApp, which is the annotation and segmentation tool of the EMU Speech Management Database System. To preview a corpus, you must first login (see above). In the next step, open the landing page of the corpus you are interested in. If the corpus is available for preview, you will find a link in the 'View online' section.
What is a PID and how can I use it?
Every version of a corpus or session in the BAS CLARIN Repository has a PID (= 'persistent identifier') of the 'handle system' (www.handle.net). This is a persistent and unique ID that unambiguously refers to the resource in question (like a DOI). Even if the repository were to move to a different URL, the PIDs would remain the same. Also, the fact that new versions of our resources get new PIDs means that the data used in older publications can be traced back, even after the resource has been updated in the repository.
You can find a resource's handle PID by visiting its landing page and searching for the metadata field PID. Alternatively, you can also find it in the resource's CMDI file. In order to access a resource via its PID, type http://hdl.handle.net/<PID> into your browser, replacing <PID> with the PID of the resource in question. This turns the PID into a handle URL, which is redirected to the desired landing page. You are encouraged (and obliged) to use this handle PID to cite BAS resources (see below).
I downloaded a corpus but the structure looks different from the original corpus or different from the description in the documentation.
BAS CLARIN repository corpora have a fixed hierarchical structure consisting of corpus and sessions within the corpus. If a speech corpus has a more
elaborated hierarchy or is organized in a differnet way, this structure was 'flattened' to the BAS CLARIN Repo structure. This can lead to mismatches between the original documenation and the downloaded data set.
The 'aGender' corpus in its original form was structured in three data sets TRAIN, DEV and TEST, which in turn were structured in speaker IDs, which in turn were structured in recording sessions of the speaker. The BAS CLARIn version consists only of sessions were the session name is a combination of speaker ID and session number.
How do I cite resources of the BAS repository?
There are two ways of citing a resource found in this repository. If you are citing a full corpus, please check its documentation archive (included when downloading); if it contains a publication presenting the corpus, you may use this publication. Alternatively/additionally, you can use the resource handle PID (see above) for citation; this has the advantage that the reader also has a always valid URL to the BAS resource.
How do I cite the BAS CLARIN Repository?
When publishing results based on resources that you have downloaded from the BAS CLARIN Repository, please cite us using our center handle: http://hdl.handle.net/11022/1009-0000-0001-231F-6