Validation Report for the ASD Corpus

Authors Johanna Cronenberg
Affiliation Bayerisches Archiv für Sprachsignale (BAS)
Institute for Phonetics and Speech Processing
University of Munich (LMU)
Postal address Schellingstr. 3
D-80799 Munich
Email Johanna.Cronenberg@campus.lmu.de
Telephone /
Fax /
Corpus Version 1.1
Date 25.01.2017
Status Corpus validated, status OK
Comment /
Validation Guidelines Florian Schiel: The Validation of Speech Corpora, Bastard Verlag, 2003.
Can be found here

Validation results

Summary

The speech corpus Audioatlas Siebenbuergisch-Saechsischer Dialekte (ASD) has been validated against general principles of good practice. The validation covered completeness, formal checks, and manual checks of a subsample.

Introduction and Corpus Description

This document summarizes the results of an in-house validation of the speech corpus ASD conducted in 2016/2017. The speech corpus as published was created in 2016, the recordings were begun in the 1960s. ASD consists of 2264 recordings (approx. 360h) of spoken dialectal German (Saxonian) recorded in Romania and Bavaria in approx. 250 different locations.
The corpus includes the following discourse types: As the corpus has been recorded by different linguists from the universities of Bukarest, Hermannstadt, and Klausenburg, it comprises different recording strategies.
However, the material has first been recorded on analog tape in the 1960s and 70s and was digitalized later on.
Furthermore, the corpus includes the following features:

1. Validation of Documentation

The general documentation directory vdata/BAS/ASD/ contains the following documentation for the ASD corpus:
  1. Directory DATA/
  2. Directory DOC/
  3. Directory GARBAGE/
  4. Directory METADATA/
  5. Directory TABLE/
  6. Directory TEXTGRIDS/

Administrative Information

Validating person Johanna Cronenberg
Date of validation 25.01.2017
Contact for requests regarding the corpus Bayerisches Archiv für Sprachsignale (BAS)
Institute for Phonetics and Speech Processing
University of Munich (LMU)
Schellingstr. 3
D-80799 Munich
Number and type of medium 1 folder (DATA/), potentially 5 CDs (approx. 360h)
Content of each medium: Directories DATA/, METADATA/, TABLE/, TEXTGRIDS/
Copyright statement and intellectual property rights (IPR) This CD or DVD contains copyrights material. Do not distribute without the written consent of the copyright holders:
Romanische Philologie, IT-Gruppe Geisteswissenschaften (ITG)
Ludwig-Maximilians-Universität München
Geschwister-Scholl-Platz 1
D-80539 München
See COPYRIGHT.TXT

Layout of Media

File or Directory Name Contents of File or Directory
COPYRIGHT.TXT File containing copyright information
DATA/ Directory containing 2264 wav files
DOC/ Directory containing README file (documentation in English) and the archive DOCU.zip:
  • COPYRIGHT.TXT: the same file as in ASD/
  • DOC/: directory containing the same README file as DOC/
  • TABLE/: the same directory as in ASD/
GARBAGE/ Directory containing 3 txt files
METADATA/ Directory containing 2264 cmdi files
README.maintenance File containing contact and corpus information, and status of validation
TABLE/ Directory containing PROMPTS.TBL with the orthographic transcription of 44 "Wenker" sentences
TEXTGRIDS/ Directory containing 788 TextGrid files

Basic Information About wav Files

Parameter Explanation of Parameter (optional) Value Acceptance of Value
File nomenclature Explanation of used codes No coherent file nomenclature NOT OK
Settings of recording sessions No coherent recording settings
Channel 1 OK
Format of signals and annotation files If non standard formats are used, it is common to give a full description or to convert into a standard format Audio files: .wav, Annotation files: .TextGrid.utf8.phon.txt or .TextGrid.utf8.orth.txt OK
Sample Coding 16-bit sample integer PCM OK
Compression Not compressed (wav) OK
Sampling rate 22050 Hz OK
Valid bits per sample Others than 8, 16 or 24 bits should be reported 16 bits OK
Multiplexed signals Exact de-multiplexing algorithm and tools n.a.

Database Contents

Parameter Explanation of Parameter (optional) Value Acceptance of Value
Clearly stated purpose of the recordings No information provided NOT OK
Speech type(s) Multi-party conversations, human-human dialogues, read sentences, connected and/or isolated digits, isolated words etc. Read sentences or free story telling OK
Instruction to speakers in full copy Not provided

Linguistic Contents of Prompted Speech

Parameter Value Acceptance of Value
Specifications of the individual text items Spontaneous informal speech or elicited speech (reading sentences) OK
Specification for the prompt sheet design or specification of the design of the speech prompts / not applicable (n.a.)
Example prompt sheet or example sound file from the speech prompting / n.a.

Linguistic Contents of Non-Prompted Speech

Parameter Explanation of Parameter (optional) Value Acceptance of Value
Multi-party Number of speakers, topics discussed, type of setting, formal/informal One (or sometimes more) speaker(s) informally talking in an interview or reading sentences in his/her dialect OK
Human-human dialogues Type of dialogues, e.g. problem solving, information seeking, chat etc., relation between speakers, topic(s) discussed, type of setting, scenarios Interviews (informal chat) about various topics, among them customs, occupation of the speaker, etc. OK
Human-machine dialogues Domain(s), topic(s), dialogues strategy followed by the machine, e.g. system driven, mixed initiative, type of system, e.g. test, operational service, Wizard-of-Oz / n.a.

Speaker Information

Parameter Value Acceptance of Value
Speaker recruitment strategies The 1805 informants stem from 199 locations in Transylvania and some locations in Bavaria, but only the location of the recording has been encoded in the CMDI metadata. However, in the majority of cases the recording location should also be the place of living. OK
Number of speakers 1805 OK
Distribution of speakers over sex, age, dialect regions Sex distribution: 1176 female, 629 male. Age range: 5-93, but 383 of unknown age. All of a Saxonian dialect in Romania (Transylvania) or Bavaria (Wassertal/Oberwischau), 239 of unknown location Apart from missing entries: OK
Description/definition of dialect regions Definition by address (city), region, country, and continent OK

Recording Platform and Recording Conditions

Parameter Explanation of Parameter (optional) Value Acceptance of Value
Recording platform No information provided
Position and type of microphones No information provided
Position of speakers Distance to microphone No information provided
Bandwidth Should be reported if other than zero to half of sampling rate Not provided
Number of channels and channel separation 1 channel per wav file OK
Acoustical environment No information provided

Annotation (TextGrid)

Parameter Explanation of Parameter (optional) Value Acceptance of Value
Unambiguous spelling standard used in annotations Standard orthography, no punctuation OK
Labeling symbols Label #1# for speaker, #2# for interviewer, more speakers labeled with #3# and so forth, unintelligible words are marked as {unversta#|aendlich}, other noise is marked as {Gera#|aeusch} (see README.txt); However, transcription and orthography conventions are not always consistent (see Manual Validation). Apart from inconsistencies (see Manual Validation): OK
List of non-standard spellings Dialectal variation, names etc. Dialectal words orthographically transcribed in brackets (<...>), umlauts have been transliterated to "a#|ae, o#|oe, u#|ue", German sharp "s" has been transliterated to "s#|sz", proper names (not place names) have been replaced by ###, Romanian, Standard German, and Hungarian words are enclosed by brackets, see README.txt Apart from inconsistencies (see Manual Validation): OK
Distinction of homographs which are no homophones See phonetic transcriptions in TEXTGRIDS/ OK
Character set used in annotations Standard Latin alphabet for orthographic transcriptions, IPA character set for phonetic transcriptions OK
Any other language dependent information Abbreviations, etc. n.a.
Annotation manual, guidelines, instructions Provided in README.txt OK
Description of quality assurance procedures Not provided
Selection of annotators No information provided
Training of annotators No information provided
Annotation tools used Praat OK

Lexicon

The resource does not contain a lexicon.

Transcription

Parameter Value Acceptance of Value
Text-to-phoneme procedure No information provided
Explanation or reference to the phoneme set Narrow IPA transcript, segmented in chunks (intervals), words separated by blank, lexical accent (primary and secondary), see README.txt OK
Phonological or higher order phenomena accounted in the phonemic transcriptions / n.a.

Statistical Information

The resource does not contain statistical information.

Other Information

Parameter Value Acceptance of Value
Any other essential language-dependent information or convention / n.a.
Indication of how many files were double-checked by the producer together with percentage of detected errors No information provided
Status of documentation Not provided

2. Automatic Validation

Validation Steps with Methodology and Results

Parameter Method Result Acceptance of Result
Completeness of signal files Script All expected files are present OK
Completeness of metadata files Script All expected information is present OK
Completeness of annotation files Script All expected files are present OK
Correctness of file names Script All files are wav files, there is no coherent nomenclature to be checked OK
Empty files Script No empty files OK
Status of signal, annotation and metadata files Not provided
Signal durations Script No information about (average) signal durations provided, all signal durations were larger than 0 OK
Duration cross checks Script 17 out of 352 matching wav and TextGrid.utf8.phon.txt files showed different durations: see table below NOT OK
Cross checks of meta information Script All audio files are mentioned in viewclarinsession.csv and viewclarinmedia.csv, all annotation files are mentioned in viewclarinannot.csv, column Location.Address in viewclarinsession.csv misses 239 entries, column Age in viewclarinactors.csv misses 383 entries Apart from missing entries: OK
Cross checks of summary listings n.a.
Annotation contents n.a.
Annotation tier nomenclature Script All tiers have the expected names: phon_informant, phon_interviewer, phon_comment, phon_xxx, orth_informant, orth_interviewer, orth_comment, orth_xxx OK
Annotation texts Script In 170 out of 436 files occur other than the expected annotation labels, the script reported 118 erroneous patterns NOT OK
Duration Cross Checks
Filename Duration TextGrid Duration wav Difference
1454c-04 2935 2934 1
648b-09 1318 1317 1
932-04 429 428 1
666-01 99 706 -607
946-02 1140 1139 1
654-06 497 496 1
23-02 183 182 1
1472a-07 98 175 -77
149-05 99 197 -98
1474-09 98 247 -149

3. Manual Validation

For a randomly chosen subsample of the ASD corpus, the matching wav and TextGrid.utf8.orth.txt files were manually checked using Praat.
In summary, the manual validation showed flaws in the orthographic transcriptions concerning the following points:
Point of Criticism Expected Value Examples: [interval] "Erroneous Value" - "Correct Value"
Orthography Consistent spelling without spelling mistakes 1414.TextGrid.utf8.orth.txt:
  • [15] "eien" - "einen"
  • [19] "Eltrn" - "Eltern"

1330.TextGrid.utf8.orth.txt:
  • [6] "klleines" - "kleines"
  • [14] "Klasen" - "Klassen"
  • [27] "Waser" - "Wasser"
  • [28] "maht" - "macht"
Inconsistencies in Spelling Conventions: German sharp "s" Ideally transcribed consistently, but has been transcribed as "s#|sz" 1169a-02.TextGrid.utf8.orth.txt:
  • [9] "blosz"
N_11.TextGrid.utf8.orth.txt:
  • [1] "Ma#"
Inconsistencies in Spelling Conventions: German Umlauts Ideally transcribed consistently, but have been transcribed as "a# u# o#|ae ue oe" 1169a-02.TextGrid.utf8.orth.txt:
  • [8] "verwuesteter"
  • [6] "aermer"
  • [7] "Hoefchen"
N_11.TextGrid.utf8.orth.txt:
  • [6] "Goldstu#cke"
  • [5] "tra#gt"
  • No example of "o#" found in the 22 manually checked TextGrid files
Inconsistencies in Spelling Conventions: Unintelligible Words Ideally transcribed consistently, but have been transcribed as {?} | {unversta#|aendlich} N_11.TextGrid.utf8.orth.txt:
  • [1, 2, 6, 12, 20, 22] "{?unversta#ndlich}" - "{unversta#ndlich}"
1169a-02.TextGrid.utf8.orth.txt:
  • [56] "?" - "{?}"
Turn-Taking Tags One speaker per interval would be ideal; speaker as "#1#", interviewer as "#2#", more speakers as "#3#" and so forth 928b-02.TextGrid.utf8.orth.txt:
  • [20] "#1# #1#" - "#1#"
  • Mixed turn-taking in 10 out of 154 intervals: [61, 84, 89, 92, 94, 98, 99, 112, 117, 146]
1169a-02.TextGrid.utf8.orth.txt:
  • [47] "##" - "#1#"
  • [65] "#1" - "#1#"
1454c-04.TextGrid.utf8.orth.txt:
  • [181] "#1#" - "#2#"
  • [241] "#2#" - "#1#"
Inconsistencies in Transcription Conventions: Dialectal Words One dialectal word transcribed in brackets (<...>) right after original word, more than one dialectal words in between "<...</>" 1454c-04.TextGrid.utf8.orth.txt:
  • [2] "Hanklichbacken" - "German original word<Hanklichbacken>"
Inconsistencies in Transcription Conventions: Standard German, Hungarian & Romanian Words German words in between the tags "<<d>...</d>>" Hungarian words in between "<u>...</u>", and Romanian words in between "<r>...</r>" Standard German Words:
  • 1169a-02.TextGrid.utf8.orth.txt: [2] "<<d>je nachdem</d>>"
  • 1454c-04.TextGrid.utf8.orth.txt: [58] "<d>Herzlich Willkommem zu unserem Hochzeitsfest</d>"
Romanian Words:
  • 1169a-02.TextGrid.utf8.orth.txt: [46] "<<r>Presedinte</r>>" - "<r>Presedinte</r>"
  • 1169a-02.TextGrid.utf8.orth.txt: [48] "sine Lisaweta" - "<r>sine Lisaweta</r>"
Hungarian Words:
  • 33-12.TextGrid.utf8.orth.txt: [1] "<u>Veszprem</u>"

4. Other Relevant Observations

None.

5. Comments for Improvement

6. Results

The README.txt file has been updated: spelling mistakes corrected, annotation conventions completed with all found versions. Updated corpus is OK.