Validation report for the ASD corpus

Validation Report for the ASD Corpus

Authors	Johanna Cronenberg
Affiliation	Bayerisches Archiv für Sprachsignale (BAS) Institute for Phonetics and Speech Processing University of Munich (LMU)
Postal address	Schellingstr. 3 D-80799 Munich
Email	Johanna.Cronenberg@campus.lmu.de
Telephone	/
Fax	/
Corpus Version	1.1
Date	25.01.2017
Status	Corpus validated, status OK
Comment	/
Validation Guidelines	Florian Schiel: The Validation of Speech Corpora, Bastard Verlag, 2003. Can be found here

Validation results

Summary

The speech corpus Audioatlas Siebenbuergisch-Saechsischer Dialekte (ASD) has been validated against general principles of good practice. The validation covered completeness, formal checks, and manual checks of a subsample.

Introduction and Corpus Description

This document summarizes the results of an in-house validation of the speech corpus ASD conducted in 2016/2017. The speech corpus as published was created in 2016, the recordings were begun in the 1960s. ASD consists of 2264 recordings (approx. 360h) of spoken dialectal German (Saxonian) recorded in Romania and Bavaria in approx. 250 different locations.
The corpus includes the following discourse types:

The German "Wenker" sentences
Fairy tales, song texts, free story telling

As the corpus has been recorded by different linguists from the universities of Bukarest, Hermannstadt, and Klausenburg, it comprises different recording strategies.
However, the material has first been recorded on analog tape in the 1960s and 70s and was digitalized later on.
Furthermore, the corpus includes the following features:

Geo reference of all recording sites
Phonetic transcription of "Wenker" sentence recordings
Phonetic transcription of sponaneous speech (in parts)
Orthographic transcription of spontaneous speech recordings (approx. 450.000 words)
Semantic labeling ("ontology")

1. Validation of Documentation

The general documentation directory vdata/BAS/ASD/ contains the following documentation for the ASD corpus:

Directory DATA/
Directory DOC/
Directory GARBAGE/
Directory METADATA/
Directory TABLE/
Directory TEXTGRIDS/

Administrative Information

Validating person	Johanna Cronenberg
Date of validation	25.01.2017
Contact for requests regarding the corpus	Bayerisches Archiv für Sprachsignale (BAS) Institute for Phonetics and Speech Processing University of Munich (LMU) Schellingstr. 3 D-80799 Munich
Number and type of medium	1 folder (`DATA/`), potentially 5 CDs (approx. 360h)
Content of each medium:	Directories `DATA/`, `METADATA/`, `TABLE/`, `TEXTGRIDS/`
Copyright statement and intellectual property rights (IPR)	This CD or DVD contains copyrights material. Do not distribute without the written consent of the copyright holders: Romanische Philologie, IT-Gruppe Geisteswissenschaften (ITG) Ludwig-Maximilians-Universität München Geschwister-Scholl-Platz 1 D-80539 München See `COPYRIGHT.TXT`

Layout of Media

File or Directory Name	Contents of File or Directory
`COPYRIGHT.TXT`	File containing copyright information
`DATA/`	Directory containing 2264 wav files
`DOC/`	Directory containing `README` file (documentation in English) and the archive `DOCU.zip`: `COPYRIGHT.TXT`: the same file as in `ASD/` `DOC/`: directory containing the same `README` file as `DOC/` `TABLE/`: the same directory as in `ASD/`
`GARBAGE/`	Directory containing 3 txt files
`METADATA/`	Directory containing 2264 cmdi files
`README.maintenance`	File containing contact and corpus information, and status of validation
`TABLE/`	Directory containing `PROMPTS.TBL` with the orthographic transcription of 44 "Wenker" sentences
`TEXTGRIDS/`	Directory containing 788 TextGrid files

Basic Information About wav Files

Parameter	Explanation of Parameter (optional)	Value	Acceptance of Value
File nomenclature	Explanation of used codes	No coherent file nomenclature	NOT OK
Settings of recording sessions		No coherent recording settings
Channel		1	OK
Format of signals and annotation files	If non standard formats are used, it is common to give a full description or to convert into a standard format	Audio files: .wav, Annotation files: .TextGrid.utf8.phon.txt or .TextGrid.utf8.orth.txt	OK
Sample Coding		16-bit sample integer PCM	OK
Compression		Not compressed (wav)	OK
Sampling rate		22050 Hz	OK
Valid bits per sample	Others than 8, 16 or 24 bits should be reported	16 bits	OK
Multiplexed signals	Exact de-multiplexing algorithm and tools		n.a.

Database Contents

Parameter	Explanation of Parameter (optional)	Value	Acceptance of Value
Clearly stated purpose of the recordings		No information provided	NOT OK
Speech type(s)	Multi-party conversations, human-human dialogues, read sentences, connected and/or isolated digits, isolated words etc.	Read sentences or free story telling	OK
Instruction to speakers in full copy		Not provided

Linguistic Contents of Prompted Speech

Parameter	Value	Acceptance of Value
Specifications of the individual text items	Spontaneous informal speech or elicited speech (reading sentences)	OK
Specification for the prompt sheet design or specification of the design of the speech prompts	/	not applicable (n.a.)
Example prompt sheet or example sound file from the speech prompting	/	n.a.

Linguistic Contents of Non-Prompted Speech

Parameter	Explanation of Parameter (optional)	Value	Acceptance of Value
Multi-party	Number of speakers, topics discussed, type of setting, formal/informal	One (or sometimes more) speaker(s) informally talking in an interview or reading sentences in his/her dialect	OK
Human-human dialogues	Type of dialogues, e.g. problem solving, information seeking, chat etc., relation between speakers, topic(s) discussed, type of setting, scenarios	Interviews (informal chat) about various topics, among them customs, occupation of the speaker, etc.	OK
Human-machine dialogues	Domain(s), topic(s), dialogues strategy followed by the machine, e.g. system driven, mixed initiative, type of system, e.g. test, operational service, Wizard-of-Oz	/	n.a.

Speaker Information

Parameter	Value	Acceptance of Value
Speaker recruitment strategies	The 1805 informants stem from 199 locations in Transylvania and some locations in Bavaria, but only the location of the recording has been encoded in the CMDI metadata. However, in the majority of cases the recording location should also be the place of living.	OK
Number of speakers	1805	OK
Distribution of speakers over sex, age, dialect regions	Sex distribution: 1176 female, 629 male. Age range: 5-93, but 383 of unknown age. All of a Saxonian dialect in Romania (Transylvania) or Bavaria (Wassertal/Oberwischau), 239 of unknown location	Apart from missing entries: OK
Description/definition of dialect regions	Definition by address (city), region, country, and continent	OK

Recording Platform and Recording Conditions

Parameter	Explanation of Parameter (optional)	Value	Acceptance of Value
Recording platform		No information provided
Position and type of microphones		No information provided
Position of speakers	Distance to microphone	No information provided
Bandwidth	Should be reported if other than zero to half of sampling rate	Not provided
Number of channels and channel separation		1 channel per wav file	OK
Acoustical environment		No information provided

Annotation (TextGrid)

Parameter	Explanation of Parameter (optional)	Value	Acceptance of Value
Unambiguous spelling standard used in annotations		Standard orthography, no punctuation	OK
Labeling symbols		Label #1# for speaker, #2# for interviewer, more speakers labeled with #3# and so forth, unintelligible words are marked as {unversta#\|aendlich}, other noise is marked as {Gera#\|aeusch} (see `README.txt`); However, transcription and orthography conventions are not always consistent (see Manual Validation).	Apart from inconsistencies (see Manual Validation): OK
List of non-standard spellings	Dialectal variation, names etc.	Dialectal words orthographically transcribed in brackets (<...>), umlauts have been transliterated to "a#\|ae, o#\|oe, u#\|ue", German sharp "s" has been transliterated to "s#\|sz", proper names (not place names) have been replaced by ###, Romanian, Standard German, and Hungarian words are enclosed by brackets, see `README.txt`	Apart from inconsistencies (see Manual Validation): OK
Distinction of homographs which are no homophones		See phonetic transcriptions in `TEXTGRIDS/`	OK
Character set used in annotations		Standard Latin alphabet for orthographic transcriptions, IPA character set for phonetic transcriptions	OK
Any other language dependent information	Abbreviations, etc.		n.a.
Annotation manual, guidelines, instructions		Provided in `README.txt`	OK
Description of quality assurance procedures		Not provided
Selection of annotators		No information provided
Training of annotators		No information provided
Annotation tools used		Praat	OK

Lexicon

The resource does not contain a lexicon.

Transcription

Parameter	Value	Acceptance of Value
Text-to-phoneme procedure	No information provided
Explanation or reference to the phoneme set	Narrow IPA transcript, segmented in chunks (intervals), words separated by blank, lexical accent (primary and secondary), see `README.txt`	OK
Phonological or higher order phenomena accounted in the phonemic transcriptions	/	n.a.

Statistical Information

The resource does not contain statistical information.

Other Information

Parameter	Value	Acceptance of Value
Any other essential language-dependent information or convention	/	n.a.
Indication of how many files were double-checked by the producer together with percentage of detected errors	No information provided
Status of documentation	Not provided

2. Automatic Validation

Validation Steps with Methodology and Results

Parameter	Method	Result	Acceptance of Result
Completeness of signal files	Script	All expected files are present	OK
Completeness of metadata files	Script	All expected information is present	OK
Completeness of annotation files	Script	All expected files are present	OK
Correctness of file names	Script	All files are wav files, there is no coherent nomenclature to be checked	OK
Empty files	Script	No empty files	OK
Status of signal, annotation and metadata files		Not provided
Signal durations	Script	No information about (average) signal durations provided, all signal durations were larger than 0	OK
Duration cross checks	Script	17 out of 352 matching wav and TextGrid.utf8.phon.txt files showed different durations: see table below	NOT OK
Cross checks of meta information	Script	All audio files are mentioned in `viewclarinsession.csv` and `viewclarinmedia.csv`, all annotation files are mentioned in `viewclarinannot.csv`, column Location.Address in `viewclarinsession.csv` misses 239 entries, column Age in `viewclarinactors.csv` misses 383 entries	Apart from missing entries: OK
Cross checks of summary listings			n.a.
Annotation contents			n.a.
Annotation tier nomenclature	Script	All tiers have the expected names: phon_informant, phon_interviewer, phon_comment, phon_xxx, orth_informant, orth_interviewer, orth_comment, orth_xxx	OK
Annotation texts	Script	In 170 out of 436 files occur other than the expected annotation labels, the script reported 118 erroneous patterns	NOT OK

Duration Cross Checks

Filename	Duration TextGrid	Duration wav	Difference
1454c-04	2935	2934	1
648b-09	1318	1317	1
932-04	429	428	1
666-01	99	706	-607
946-02	1140	1139	1
654-06	497	496	1
23-02	183	182	1
1472a-07	98	175	-77
149-05	99	197	-98
1474-09	98	247	-149

3. Manual Validation

For a randomly chosen subsample of the ASD corpus, the matching wav and TextGrid.utf8.orth.txt files were manually checked using Praat.

The subsample consisted of 5% of the orthographically transcripted files, i.e. 22 wav and 22 corresponding TextGrid files (436/100*5 = 21.8).
Alltogether, the 22 checked files had a duration of 2.56 hours, i.e. 0.7% of the entire speech material (approx. 360h).
The manual validation showed that only 12 out of 22 files were correctly transcribed, although several of these 12 files contained spelling mistakes, missing or superfluous blank characters or were of (very) poor audio quality.

In summary, the manual validation showed flaws in the orthographic transcriptions concerning the following points:

Point of Criticism	Expected Value	Examples: [interval] "Erroneous Value" - "Correct Value"
Orthography	Consistent spelling without spelling mistakes	1414.TextGrid.utf8.orth.txt: [15] "eien" - "einen" [19] "Eltrn" - "Eltern" 1330.TextGrid.utf8.orth.txt: [6] "klleines" - "kleines" [14] "Klasen" - "Klassen" [27] "Waser" - "Wasser" [28] "maht" - "macht"
Inconsistencies in Spelling Conventions: German sharp "s"	Ideally transcribed consistently, but has been transcribed as "s#\|sz"	1169a-02.TextGrid.utf8.orth.txt: [9] "blosz" N_11.TextGrid.utf8.orth.txt: [1] "Ma#"
Inconsistencies in Spelling Conventions: German Umlauts	Ideally transcribed consistently, but have been transcribed as "a# u# o#\|ae ue oe"	1169a-02.TextGrid.utf8.orth.txt: [8] "verwuesteter" [6] "aermer" [7] "Hoefchen" N_11.TextGrid.utf8.orth.txt: [6] "Goldstu#cke" [5] "tra#gt" No example of "o#" found in the 22 manually checked TextGrid files
Inconsistencies in Spelling Conventions: Unintelligible Words	Ideally transcribed consistently, but have been transcribed as {?} \| {unversta#\|aendlich}	N_11.TextGrid.utf8.orth.txt: [1, 2, 6, 12, 20, 22] "{?unversta#ndlich}" - "{unversta#ndlich}" 1169a-02.TextGrid.utf8.orth.txt: [56] "?" - "{?}"
Turn-Taking Tags	One speaker per interval would be ideal; speaker as "#1#", interviewer as "#2#", more speakers as "#3#" and so forth	928b-02.TextGrid.utf8.orth.txt: [20] "#1# #1#" - "#1#" Mixed turn-taking in 10 out of 154 intervals: [61, 84, 89, 92, 94, 98, 99, 112, 117, 146] 1169a-02.TextGrid.utf8.orth.txt: [47] "##" - "#1#" [65] "#1" - "#1#" 1454c-04.TextGrid.utf8.orth.txt: [181] "#1#" - "#2#" [241] "#2#" - "#1#"
Inconsistencies in Transcription Conventions: Dialectal Words	One dialectal word transcribed in brackets (<...>) right after original word, more than one dialectal words in between "<...</>"	1454c-04.TextGrid.utf8.orth.txt: [2] "Hanklichbacken" - "German original word<Hanklichbacken>"
Inconsistencies in Transcription Conventions: Standard German, Hungarian & Romanian Words	German words in between the tags "<<d>...</d>>" Hungarian words in between "<u>...</u>", and Romanian words in between "<r>...</r>"	Standard German Words: 1169a-02.TextGrid.utf8.orth.txt: [2] "<<d>je nachdem</d>>" 1454c-04.TextGrid.utf8.orth.txt: [58] "<d>Herzlich Willkommem zu unserem Hochzeitsfest</d>" Romanian Words: 1169a-02.TextGrid.utf8.orth.txt: [46] "<<r>Presedinte</r>>" - "<r>Presedinte</r>" 1169a-02.TextGrid.utf8.orth.txt: [48] "sine Lisaweta" - "<r>sine Lisaweta</r>" Hungarian Words: 33-12.TextGrid.utf8.orth.txt: [1] "<u>Veszprem</u>"

4. Other Relevant Observations

None.

5. Comments for Improvement

Complete georeference, sex, and age information in metadata files if possible
Revise orthographic transcriptions and unify notation
Transcribe turn-taking more accurately (one speaker per interval)
Complete orthographic and phonetic transcriptions for all wav files
Unify file nomenclature
Mention purpose of the recordings (or collection of recordings as corpus)
Clarify recording conditions (platform, position and types of microphones, position of speakers, acoustical environment) if possible
Clarify information about annotators (selection, training) and quality assurance procedures if possible
Investigate non-matching durations of wav and TextGrid files (see table above)
Investigate errors in annotation texts (labeling symbols) (see table above)

6. Results

The README.txt file has been updated: spelling mistakes corrected, annotation conventions completed with all found versions. Updated corpus is OK.