_/_/_/_/ _/_/ _/_/_/_/ _/ _/ _/ _/ _/ _/ _/ _/ _/ _/ _/ _/ _/ _/ _/ _/ _/_/_/_/ _/_/_/_/ _/_/_/ _/ _/ _/ _/ _/ _/ _/ _/ _/ _/ _/ _/ _/ _/ _/ _/ _/_/_/_/ _/ _/ _/_/_/_/ BAVARIAN ARCHIVE FOR SPEECH SIGNALS University of Munich, Institute of Phonetics Schellingstr. 3/II, 80799 Munich, Germany bas@bas.uni-muenchen.de COPYRIGHT University of Munich 2010, 2016. All rights reserved. This corpus and software may not be disseminated further - not even partly - without a written permission of the copyright holders. ---------------------------------------------------------------------- ASD - Audioatlas Siebenbuergisch-Saechsischer Dialekte Version 1.1 - 2016-09-02 ---------------------------------------------------------------------- Documentation of the speech corpus 'ASD' ------------------- Contents of this dir ------------------------------ README : Documentation (this file) ------------------- Contents of this file ------------------------------ General Information File naming Speaker recruitment / Geo information Speaker prompting Recording conditions Annotation Meta data Online Portal History ------------------------ General Information -------------------------- ASD - Audioatlas Siebenbuergisch-Saechsischer Dialekte ASD consists of a set of 2264 historical recordings (approx. 360h) of spoken dialectal German (Saxonian) recorded in Transilvania (Romania) in approx. 250 different locations. This up-to-now un-published material has been collected on analog tape in the 1960s and 70s by different linguists based at the universities of Bukarest, Hermannstadt, and Klausenburg. Later, these tapes have been digitized, and in 2009 - with the kind support of Prof. Dr. Stefan Sienerth, director of the 'Institut für deutsche Kultur und Geschichte Südosteuropas (IKGS)' - transferred to the University of Munich (LMU). The corpus comprises different recording strategies and discourse types: on the one hand the classic German 'Wenker' sentences, on the other hand also fairy tales, song texts and free story-telling. Insofar, the corpus not only provides historical linguistic data but also input for ethnographical and historic disciplins. Since the age of the informants varies over a large range (5-93 years), this gives another dimension reflected in the metadata of the corpus. Further corpus features: geo reference of all recording sites; phonetic transcription of 'Wenker' sentence recordings; orthographic transcription of spontaneous speech recordings (approx. 450.000 running words); (partly) phonetic transcription of spontaneous speech; semantic labelling ('ontology'); extension to middle Bavarian recordings from the area 'Wassertal/Oberwischau'. In 2016, the present corpus version was created at the BAS CLARIN center of the University of Munich (LMU) for indefinite archivation and distribution. This CD or DVD contains copyrighted material. Do not distribute without the written consent of the copyright holders: Romanische Philologie, IT-Gruppe Geisteswissenschaften (ITG) Ludwig-Maximilian University of Munich Geschwister-Scholl-Platz 1 D-80539 München Germany ------------------------ File naming ---------------------------------- There is no systematic naming structure; recordings are named by arbitrary, but unique alpha/digit sequences originated by the various collectors. Sound files: .wav RIFF WAVE mono 22050Hz, 16bit Annotation files: .TextGrid.utf8.orth.txt praat compatible TextGrid .TextGrid.utf8.phon.txt praat compatible TextGrid Number of recordings: 2264 Number of annotated recordings: 716 Number of orthographic annotations: 436 Number of phonetic annotations: 352 Some recordings have both, an orthographic and a phonetic transcription; some have neither. ------------------------ Speaker recruitment -------------------------- The 1805 informants stem from 199 locations in Transilvania (2191 recordings) and a minority from Bavaria, Wassertal/Oberwischau (74 recordings), but only the location of the recording has been encoded in the CMDI metadata. However, in the majority of cases the recording location should also be the place of living. The oldest recording is from 1963, the youngest from 1983. Speaker age ranges from 5 to 93; for the 383 speakers whose age is unknown, a '0' is given in the CMDI metadata. CMDI metadata can be found in METADATA/. ------------------------ Speaker prompting -------------------------- Each informant was, if possible, interviewed in a spontaneous, informal talk and in a directed talk (often in the same recording). The latter included the 44 Wenker sentences, which the informant should read in her/his dialect. The list of the Wenker sentences is available under http://www.asd.gwi.uni-muenchen.de/index.php?wenkersaetze=true A copy of the list is located in TABLE/PROMPTS.TBL ------------------------ Recording conditions ------------------------ Signals files are RIFF WAVE format, encoding is PCM, 16bits per sample, sample rate 22050Hz, small endian. ------------------------ Annotation ---------------------------------- The software Praat was used for transcription. Praat TextGrids are stored in TEXTGRIDS/ and named by the recording .TextGrid.utf8.orth|phon.txt 1. Phonetic Transcripts *.phon.*, UTF-8 Tier "" (e.g. "Mary"): Contains narrow IPA transcript, segmented in chunks (intervals), words separated by blank, lexical accent (primary and secondary); Wenker Sentences: The start of read Wenker sentences is marked by a short segment with label '{WS Anfang}'; the last one with trailing '{WS Anfang}' (not consistent, sometimes missing); before each Wenker sentence either the speaker utters the number, which is labelled '{WS NN} ' (NN = Wenker sentence number), or if the speaker does not utter the number, simply a short segment with label '{WS NN}'; the Wenker sentence is then transcribed in one segment starting with '##'. Tier "", e.g. "John": Only the parts of the interviewer in narrow IPA Tier "Kommentare": Comments of the transcriber, e.g. '{unverstae|a#ndlich}', '{Gerae', e.g. 'gehabt', of two words by leading '<...>' and trailing '', e.g. 'wie du'; Proper names replaced by '###' (but no place names!) Romanian words are enclosed by '', e.g. 'nare nimica' German words are enclosed by '<, e.g. '<je nachdem>' Hungarian words are enclosed by '', e.g. 'Veszprem' German Umlauts are written as 'ae ue oe' or 'a# u# o#', German sharp 's' is written as 's#' or 'sz'; no punctuation. If both TextGrid types are present, they are not synchronized, i.e. segments do not match. ------------------------- Metadata ----------------------------------- Metadata of the recording sessions and speakers according to SpeechDat conventions are available in the CMDI files in METADATA/ ------------------------ Online Portal -------------------------------- The online portal to the ASD corpus can be found in http://www.asd.gwi.uni-muenchen.de/ ------------------------ Funding -------------------------------------- Parts of the corpus curation, and the development of the web portal were funded by the Beauftragter der Bundesregierung für Kultur und Medien based on a resolution of the German Federal Parliament (2009-2014). This particularly included: Web-based interface to recordings and transcription; simple search function, geo reference of all recording sites; phonetic transcription of 'Wenker' sentence recordings; orthographic transcription of spontaneous speech recordings (approx. 450.000 running words); (partly) phonetic transcription of spontaneous speech; semantic labelling ('ontology'); extension to middle Bavarian recordings from the area 'Wassertal/Oberwischau'. ------------------------ History -------------------------------------- 2016-08-18 Version 1.0 : BAS CLARIN edition 2016-09-02 Version 1.1 : Changed level (tier) names in TextGrids: Informant: *.phon.txt : 'Mary, satz, Satz,...' -> 'phon_informant' Informant: *.orth.txt : 'Mary, satz, Satz,...' -> 'orth_informant' Interviewer:*.phon.txt : 'John' -> 'phon_interviewer' Interviewer:*.orth.txt : 'John' -> 'orth_interviewer' Comments: *.phon.txt : 'Kommentar, Anmerkungen, ...' -> 'phon_comment' Comments: *.orth.txt : 'Kommentar, Anmerkungen, ...' -> 'orth_comment' Unused tier:*.phon.txt : 'bell' -> 'phon_xxx' Unused tier:*.orth.txt : 'bell' -> 'orth_xxx'