Counting basic statistics and pauses using EMU and R

This post will talk about how to utilize EMU to do some simple statistics on a corpus. For starters note that this is not the most time efficient approach for solving this particular issue, but it’s fairly easy to set-up and uses EMU how it was intended. It is also not the best use-case to demonstrate the real power of EMU, as it only computes some stats on the segments themselves and doesn’t correlate it with any low-level features. Nevertheless, it is something that was needed for the project and should demonstrate some basics into using EMU and R.

Preparing the database

One change that had to be done to the corpus, from our annotation guidelines, was to create the proper annotation for silent and filled pauses. Based on the literature, we defined a silent pause as any segment of non-speech longer than 250 ms. Filled pauses were pauses that include any of the annotated filler phrases like +yy+ or +um+. Silent pasues were annotated using the <sil> token, and filled pauses were annotate using <fil>.

Next, we loaded the database using the following command in R:

library(emuR)

db <- load_emuDB('<path-to>/pinc')

Listing files

The goal of this exercise was to generate a list of stats for each utterance for further processing. EMU uses the term „bundle” to describe a recording of a single utterance. This is because each „bundle” consists of several files: an audio file, an annotation file and several files containing things like track data (formants, pitch, etc.) To load a list of these files into a variable we use the following command:

files <- list_bundles(db)$name

File stats

This section lists all the stats computed for each file. Each stat uses a query where the bundle pattern is the name of the file from the list above.

Speech duration

The first stat is the duration of speech. Since the audio files usually contain extra speech at the start and the end of the file from other speakers, simply taking the length of the audio file would be a mistake. Instead, we subtract the end time of the last spoken segment (from the tail command) with the start time with the first spoken segment (from the head command).

The query command uses a regular expression (as denoted by the „~” symbol) to match any words in the utterance that are not pauses. Since all the pauses uses brackets < and >, we simply match the words that don’t contain these brackets:

segs <- query(db, 'Word!~<.*>', bundlePattern = name)
dur <-tail(segs, n = 1)$end - head(segs, n = 1)$start

Word count

We can use the same list of segments above to calculate the number of words:

num_words <- nrow(segs)

Syllables

To calculate the syllables, we use the excellent sylly library. This library already contains a set of hyphenation rules for several most popular languages, so using it in English works out-of-the-box:

library(sylly)
library(sylly.en)

syl_fun_en <- function(word) {
    hyphen(word, hyph.pattern = "en", quiet = TRUE)@desc$num.syll
}

Hyphenation is used to automatically insert hyphens in long words that reach the end of a line, so these rules are generally ubiqutous in many editing and typesetting programs. The Polish set of hyphenation rules can be downloaded from the Internet, as shown below. After that, it can be used similarly to English.

url.pl.pattern <- url("http://tug.ctan.org/tex-archive/language/hyph-utf8/tex/generic/hyph-utf8/patterns/txt/hyph-pl.pat.txt")

hyph.pl <- read.hyph.pat(url.pl.pattern, lang = "pl")
close(url.pl.pattern)

syl_fun_pl <- function(word) {
    hyphen(word, hyph.pattern = hyph.pl, quiet = TRUE)@desc$num.syll
}

To get the number of syllables for the whole text, we need to apply the above functions to each word separately and then sum the results:

num_syls <- sum(unlist(lapply(segs$labels, syl_fun_en)))

Silence count and duration

Finding silent pauses is just as easy. We count their number and sum of durations.

segs <- query(db, 'Word==<sil>', bundlePattern = name)
num_sil <- nrow(segs)
dur_sil <- sum(segs$end - segs$start)

Likewise, for filled pauses:

segs <- query(db, 'Word==<fil>', bundlePattern = name)
num_fil <- nrow(segs)
dur_fil <- sum(segs$end - segs$start)

Applying to database

In order to compute all the stats to all the utterances, we need to create a function that calculates the above stats. We return the values as a list of individual stats:

stats_fun <- function(name) {
    # all of the stats
    
    return(list(name, dur, num_words, num_syls, num_sil, dur_sil, num_fil, dur_fil))
}

Next we used the lapply method to apply this function to each utterance. Instead of using the regular apply method, the pbapply library provides a progress bar to monitor its progress:

library(pbapply)

stats_list <- pblapply(files, stats_fun)

The above command takes a little while to compute (about 50 minutes in our case), so again, this isn’t the most time-efficient approach, but it’s easy to prepare and modify if neccessary.

After completing the above command, we receive a list of individual stats. We convert this list into a matrix and then into a dataframe. Next we change the titles of the columns:

stats_df <- as.data.frame(t(matrix(unlist(stats_list), nrow = 8)))
names(stats_df) <-c('name',dur',num_words',num_syls',num_sil',dur_sil',num_fil',dur_fil')

And finally, write the results into an Excel file:

library(writexl)

write_xlsx(stats_df, 'out.xlsx')

All the above simple stats will later be used to compute other measures, such as speech rate, articulation rate, text compression rate, etc. These measures are calculated on the basis of simple formulas in an xls file.

Skomentuj

Wprowadź swoje dane lub kliknij jedną z tych ikon, aby się zalogować:

Logo WordPress.com

Komentujesz korzystając z konta WordPress.com. Wyloguj /  Zmień )

Zdjęcie z Twittera

Komentujesz korzystając z konta Twitter. Wyloguj /  Zmień )

Zdjęcie na Facebooku

Komentujesz korzystając z konta Facebook. Wyloguj /  Zmień )

Połączenie z %s