Automating word segmentation

One of the more important components of the project is the ability to calculate statistics based on the time when each word was spoken. To achieve this, we need to align the transcription to the audio and denote precisely where each word occurs in the signal. We have chosen to complete this process in several steps:

  1. acquiring the audio
  2. orthographic transcription and endpointing
  3. automatic segmentation
  4. final verification

These steps will be discussed in more detail below. One thing to note first is that we need to do this for both interpreting directions (Polish into English and English into Polish in our case), which makes the process a bit more challenging than usual.

Audio acquisition

The recordings of the European Parliament are available in several formats and we chose to always use MP4. A single recording contains many streams, for example:

Input #0, mov,mp4,m4a,3gp,3g2,mj2, from 'VODUnit_20191125_17062500_17065000_-784f729016e9e67027f2ca.mp4':
  Metadata:
    major_brand     : 3gs6
    minor_version   : 1
    compatible_brands: isom3gs6
    creation_time   : 2019-11-24T16:39:46.000000Z
    copyright       : 
    copyright-eng   : 
  Duration: 00:00:25.31, start: 0.000000, bitrate: 1633 kb/s
    Stream #0:0(und): Video: h264 (High) (avc1 / 0x31637661), yuv420p, 512x288 [SAR 1:1 DAR 16:9], 580 kb/s, 25 fps, 25 tbr, 1k tbn, 20000k tbc (default)
    Metadata:
      creation_time   : 2019-11-24T16:39:46.000000Z
      handler_name    : GPAC ISO Video Handler
    Stream #0:1(und): Audio: aac (LC) (mp4a / 0x6134706D), 44100 Hz, mono, fltp, 31 kb/s (default)
    Metadata:
      creation_time   : 2019-11-24T16:39:46.000000Z
      handler_name    : GPAC ISO Audio Handler
    Stream #0:2(deu): Audio: aac (LC) (mp4a / 0x6134706D), 44100 Hz, mono, fltp, 31 kb/s (default)
    Metadata:
      creation_time   : 2019-11-24T16:39:46.000000Z
      handler_name    : GPAC ISO Audio Handler
    Stream #0:3(eng): Audio: aac (LC) (mp4a / 0x6134706D), 44100 Hz, mono, fltp, 32 kb/s (default)
    Metadata:
      creation_time   : 2019-11-24T16:39:46.000000Z
      handler_name    : GPAC ISO Audio Handler
    Stream #0:4(fra): Audio: aac (LC) (mp4a / 0x6134706D), 44100 Hz, mono, fltp, 31 kb/s (default)
    Metadata:
      creation_time   : 2019-11-24T16:39:46.000000Z
      handler_name    : GPAC ISO Audio Handler
(...)

First is the video stream followed by all the (hopefully synchronized) audio streams in various languages (with three letter codes in the brackets). One of the major points for our study is the ability to compare the time when each person spoke a particular word, so having all the streams stored together is an upside!

To extract a stream with a particular language (e.g. Polish), we use the ffmpeg program:

ffmpeg -i input.mp4 -map m:language:pol -ar 16k -ac 1 output.wav

We also automatically convert the audio to a mono PCM Wave file sampled at 16 kHz, as this is our preferred format for analyzing speech using the programs below.

Transcription

We have several sources of transcription for the European Parliament files. We can use the verbatim reports, i.e. transcripts available on the website, but these are often edited and corrected to be easier to read and as such do not constitute a fully verbatim reflection of the speech. For example:

Panie przewodniczący! Chciałem zapytać pana o ocenę odbytego niespełna dwa tygodnie temu szczytu UE-Ukraina. Przy okazji chciałbym poruszyć dwie kwestie. Będąc w Kijowie, dowiedziałem się, że strona europejska odmówiła wpisania do końcowej deklaracji sformułowania o europejskiej tożsamości Ukrainy. Nie ukrywam, że bardzo mnie to dziwi, ponieważ wydaje mi się, że kwestia europejskiej tożsamości Ukrainy nie powinna być sporna. Po drugie, dowiedziałem się w Kijowie, że nasza europejska delegacja odmówiła złożenia wieńców na pomniku ofiar głodu ukraińskiego, co jest dyplomatycznym zwyczajem wszystkich odwiedzających Ukrainę i nie ukrywam, że te dwa fakty bardzo mnie dziwią.

Another option is to use automatic speech recognition (ASR) systems. These are much better at presenting verbatim transcriptions, but unfortunately they don’t perform as well. For example (using Google Cloud Speech, through the WebMaus service [1]):

wiszący pan Michał Tomasz Kamiński grupa konserwatystów i reformatorów Bardzo proszę o niespełna 2 tygodnie temu Unia Europejska Ukraina chciałem zapytać pana Ocenę tego szczytu I przy okazji poruszyć dwie kwestie Otóż będąc w Kijowie dowiedziałem się ze strony Europejska odmówiła wpisania do końcowej deklaracji o Europejskiej tożsamości Ukrainy i nie ukrywam że bardzo mnie to dziwi chociaż Wydaje mi się że Europejska tożsamość Ukrainy Jest czymś co nie powinno być sporne a po drugie dowiedziałem się w Kijowie Europejska delegacja odmówiła złożenia wieńców na pomniku ofiar głodu ukraińskiego co jest w pewnym już dyplomatycznym zwyczajem wszystkich odwiedzających Ukrainy i te dwa fakty bardzo mnie dziwią Myślę że to było bardzo owocne spotkanie ze stroną ukraińską muszę przyznać że nasi przyjaciele

A final option is to transcribe the speech from scratch. The choice between using the available transcripts, automatic transcripts or working from scratch should be taken on a case-by-case basis. In our case, for Polish original speeches we used the already completed transcripts and for English interpretations we used ASR (rather than written translations of the transcripts available from the EP website). Regardless of the method, the text had to be verified to exactly match the audio and include all redundancies and false starts since these will be important in analysing the interpreters’ output. The final corrected transcript looks something like this:

Panie przewodniczący, chciałem za~ zapytać pana o odbyty niespełna +yy+ dwa tygodnie temu szczyt Unii Europejska-Ukraina. Chciałem zapytać pana o ocenę tego szczytu i przy okazji poruszyć dwie kwestie. Otóż będąc w Kijowie, dowiedziałem się, że strona europejska odmówiła wpisania do końcowej deklaracji sformułowania o europejskiej tożsamości Ukrainy i nie ukrywam, że bardzo mnie to dziwi, ponieważ wydaje mi się, że europejska tożsamość Ukrainy jest czymś, co nie powinno być sporne, a po drugie dowiedziałem się w Kijowie, że +e+ nasza europejska delegacja odmówiła złożenia wieńców na pomniku ofiar głodu ukraińskiego, co jest pewnym już dyplomatycznym zwyczajem wszystkich odwiedzających Ukrainę i nie ukrywam, te dwa fakty bardzo mnie dziwią.

To streamline this process, we set up a simple Corrector website, which is Python-based webapp I created a while ago specifically for this purpose[2]. It consists of a rich audio player (based on wavesurfer.js and controllable from the keyboard) and a simple text field. This text field can be pre-populated with the previous transcription or left empty if the transcription is done from scratch. Everything is backed up by a central database and many people can use the website at the same time. Each modification is marked to allow for simple monitoring and management of the transcription process.

Endpointing

An important step in the manual verification is the so-called endpointing – marking when the transcription starts and ends within the audio file (denoted by the pink areas in the screenshot above). Unfortunately, each audio recording begins with a portion of untranslated speech or recordings of untranscribed speech, such as the EP President giving the floor to a particular MEP whose speech is of primary interest in a given audio file. We use a mechanism where we allow the user to mark an area for deletion, so the tools below see only the audio that perfectly matches the transcritpion.

Segmentation

Segmentation is achieved using the automatic alignment algorithm. Speech alignment is a problem commonly solved by speech recognition software. Usually, speech recognition relies on stochastic modeling to predict unconstrained word sequences from a large vocabulary, but this can easily be modified to match a known sequence of words to the underlying audio. The reason this is such a common tool for ASR software is because it is a necessary step required to train such systems from the very beginning. Also, since the word sequence is known, this problem is much less prone to errors than the unconstrained full speech recognition. Still, errors do occur due to noisy audio, interrupted and inaccurate pronunciation, errors in the transcription. They can all cause the system to provide wildly incorrect results or even fail altogether – if the system cannot provide a correct alignment within the chosen time limit, it will give up and throw an error.

For our project we used a fairly simple GMM-based acoustic model within the Kaldi toolkit. For English, we used the free Librispeech models, and for Polish we used the models trained on the Polish Parliament data that we created ourselves. These models are also available for free [3].

Verification

The final step is to verify the quality of the segmentation. Usually, speech segmentation is edited using desktop applications like Praat or ELAN, but these can be inconvenient to work with (they require opening many files, one by one) and decentralized. Instead, we decided to use EMU-webApp [4] due to its many advantages to other tools mentioned above:

  • no setup or installation required (it works from the browser),
  • portable – works on any computer with internet connection and is consistent everywhere,
  • easy to use and has great documentation,
  • centralizes the data to one location – no need to copy files between participants,
  • results in a well organized database in EMU format.

The last point brings many advantages, like being able to easily analyze the database with just a few lines of R code, but that’s a story for another blog post. For now, the tool is a great way to manage the segmentation process between several people and we highly recommend it to anyone doing similar research.

One final note for this tool is the choice of storage for the files presented in the interface. There are generally three choices: you can load files locally from your computer, you can use a special websocket server developed specifically for this purpose or you can use Gitlab. We chose Gitlab for several reasons. First of all, its a free online storage service that is also safe and reliable. It logs the full history of changes and allows undoing any damage caused by a mistake. It’s easy to add/remove people to the project. Finally, we setup a private repository on gitlab.com. Once we’re finished with fixing the database, we can publish it with a single click for anyone to use.

EMU database setup

In order to get the above system working, a couple of things need to be established and implemented. The EMU SDMS [5] uses a collection of JSON files to store all the information within the corpora. These files can be generated using the standard R library, provided by the project, but we rolled out or own implementation using Python. This final section will describe how our database is designed and show some examples of the files we had to generate.

The most important file in the database is the main configuration file – called pinc_DBconfig.json:

{
    "name": "pinc", 
    "UUID": "ebc1e9ca-edb4-4a50-a2a7-7cc4e2590250", 
    "mediafileExtension": "wav", 
    "levelDefinitions": [
        {
            "name": "Word", 
            "type": "SEGMENT", 
            "attributeDefinitions": [
                {
                    "name": "Word", 
                    "type": "STRING"
                }
            ]
        } 
    ],
    "ssffTrackDefinitions": [
    ],
    "linkDefinitions": [
    ], 
    "EMUwebAppConfig": {
        "perspectives": [
            {
                "name": "default", 
                "signalCanvases": {
                    "order": [
                        "OSCI", 
                        "SPEC"
                    ], 
                    "assign": [], 
                    "contourLims": []
                }, 
                "levelCanvases": {
                    "order": [
                        "Word" 
                    ]
                }, 
                "twoDimCanvases": {
                    "order": []
                }
            } 
        ], 
        "restrictions": {
            "showPerspectivesSidebar": false,
	    "bundleComments": true,
	    "bundleFinishedEditing": true

        }, 
        "activeButtons": {
            "saveBundle": true, 
            "showHierarchy": false
        }
    }
}

This file describes both the database construction as well as some options to the EMU-webApp program used for editing. The database contains a single level of annotation for now: word level segmentation. Each word segment is described only by a string. The SSFF tracks are used to include information like pitch, formants and energy, but we don’t use any of them here. Likewise, since we have only one level of annotation, there are no links describing the hierarchy of the database.

Following that is the EMU-webApp configuration. We define one default perspective that shows an oscillator (ie. amplitude) and spectrogram canvases followed by the word level annotation and no two-dimensional canvases. We remove the button to choose different perspectives (since there’s only one) and enable the comments and „finished editing” checkboxes on the bundle list. Finally we, enable the save buttons for bundles and remove the disable the show hierarchy (since there’s nothing there). This file is stored directly in the database directory, which should be called „pinc_emuDB” (although our project on Gitlab has a different name and that doesn’t seem to be a problem).

Next, the directory contains a number of dir_ses driectories representing the various sessions (ie. groups of files) and inside each one, there are several name_bndl directories, each representing a single audio file with annotation. Inside such a directory is a single WAV file and a JSON annotation file with the same name as the bundle, eg name.wav and name_annot.json. The annotation file looks like this:

{
    "sampleRate": 16000.0,
    "annotates": "PL0001_pl.wav",
    "name": "PL0001_pl",
    "levels": [
        {
            "name": "Word",
            "type": "SEGMENT",
            "items": [
                {
                    "id": 0,
                    "sampleStart": 93920,
                    "sampleDur": 3200,
                    "labels": [
                        {
                            "name": "Word",
                            "value": "Panie"
                        }
                    ]
                },
                {
                    "id": 1,
                    "sampleStart": 97120,
                    "sampleDur": 11200,
                    "labels": [
                        {
                            "name": "Word",
                            "value": "przewodnicz\u0105cy,"
                        }
                    ]
                },
/* MANY LINES SKIPPED! */
                {
                    "id": 108,
                    "sampleStart": 790880,
                    "sampleDur": 4639,
                    "labels": [
                        {
                            "name": "Word",
                            "value": "dziwi\u0105."
                        }
                    ]
                }
            ]
        }
    ],
    "links": []
}

This file basically matches what we said in the DB configuration. Apart from naming the file and the audio it relates to, we define our word level as an array of word segments described by the start, duration and the label. One thing worth noting is that the id attribute should be unique to the whole file (ie. segments in different layers all have to have different ids). These ids are used to create links in the hierarchy (among other things). Since we don’t have any, our links array is empty. Just as a caveat, this file is obviously generated automatically from the output of our automatic speech segmentation software. That software (based on the Kaldi toolkit) generates a CTM file (simple CSV-like text file) and we use a simple Python script to convert it to the description above.

One final file is the so-called bundle list. In the root database directory, there is a sub-directory called bundeLists. It contains any number of name_bundleList.json files with the content similar to this:

[
    {
        "session": "PINC_p1",
        "name": "PL0001_pl",
        "comment": "",
        "finishedEditing": false
    },
    {
        "session": "PINC_p1",
        "name": "PL0002_pl",
        "comment": "",
        "finishedEditing": false
    },
    {
        "session": "PINC_p1",
        "name": "PL0003_pl",
        "comment": "",
        "finishedEditing": false
    },
/* MANY MORE ITEMS */
]

This list simply defines a subset of files that can be assigned to a person to work on. The name of the list is arbitrary and is defined within the URL described below. This list also contains the comments and „finished editing” flags mentioned in the configuration of the EMU-webApp above.

The final bit of information is related to the storage of the database. As is, the database can be shared as a directory and copied between computers using any means possible. However, as we mentioned in the beginning, we chose to use Gitlab as the method to store the database online. The exact description on how to set this up is given in chapter 23 of the manual [5]. To summarize, after setting up the project, getting its ID, setting up user accounts and getting their access tokens, all we needed to do is set up a URL to access the data, like this:

https://ips-lmu.github.io/EMU-webApp/?autoConnect=true&comMode=GITLAB&gitlabURL=https:%2F%2Fgitlab.com&projectID=<PROJECT_ID>&emuDBname=<DB_NAME>&bundleListName=<BUNDLE_LIST_NAME>&privateToken=<USER_PRIVATE_TOKEN>

We can modify the BUNDLE_LIST_NAME and USER_PRIVATE_TOKEN for each user and share them using chat or email. The users don’t need to install anything or set up anything special – it just works!

Links

[1] – https://clarin.phonetik.uni-muenchen.de/BASWebServices/interface/ASR (requires registration)

[2] – http://github.com/danijel3/Corrector

[3] – https://hub.docker.com/repository/docker/danijel3/clarin-pl-speechmodels (look under the „sejm” tag)

[4] – https://ips-lmu.github.io/EMU-webApp/

[5] – https://ips-lmu.github.io/The-EMU-SDMS-Manual/

Skomentuj

Wprowadź swoje dane lub kliknij jedną z tych ikon, aby się zalogować:

Logo WordPress.com

Komentujesz korzystając z konta WordPress.com. Wyloguj /  Zmień )

Zdjęcie z Twittera

Komentujesz korzystając z konta Twitter. Wyloguj /  Zmień )

Zdjęcie na Facebooku

Komentujesz korzystając z konta Facebook. Wyloguj /  Zmień )

Połączenie z %s