Adding linguistic annotation to EMU using the spaCy toolkit

EMU based corpora are often annotated on multiple levels. Each word can contain orthographic and phonetic annotation, but linguistic annotation is rarely found in practice. The addition of linguistic information in the corpus can provide the option to test hypotheses like „what happens with a word that is a particular POS type”, or „what happens with pairs of words that contain a particular sentence dependency relationship”.

This state is even more surprising when these days there are several easy to use NLP toolkits out there. For this example we will use the excellent spaCy toolkit, but there are other alternatives like NLTK, CoreNLP, OpenNLP and others.

To annotate the EMU corpus, we will take the orthographic transcription of the whole recording and use spaCy to annotate the whole text at once. In our case, we made a conscious decision to include the punctuation and capitalization in the corpus. That is, each word is annotated as a separate segment, but the punctuation is concatenated to the previous word each time (we decided to replace all hyphens with colons to make things easier).

First step is to load a spaCy model:

nlp = spacy.load('en_core_web_md')

A list of models for different languages can be found here. Each model has to be downloaded and installed before use. The command to install the model is given on the page above.

Different languages have different capabilities, but generally we can find the following linguistic features. We are mostly interested in the following:

  • part-of-speech (POS) – this tells us the word category, eg: noun, verb, adjective, etc.
  • lemma – this is the canonical form of the word, before any inflection is applied to it
  • sentence dependency – this tells us the syntactic function of the word, things like sentence subject, object, etc.

To get these features, all we need to do is feed the transcription through the object created above and then we can iterate through all the words:

doc = nlp(trans)
for word in doc:
    print(f'{word.orth_} {word.lemma_} {word.pos_} {word.dep_}')

To get the full list of parameters related to a word, refer to the API here.

In order to streamline the process, I have created a simple script that modifies an EMU annotation files and adds the linguistic annotation to it:

Sample usage

After annotating a corpus using the above code, we can load it in R:


db <- load_emuDB('/home/danijel/Desktop/PINC/pinc')

Looking at the level defintions:


We get the following result:

  name    type nrOfAttrDefs           attrDefNames
1 Word SEGMENT            4 Word; lemma; POS; DEP;

For our purposes, we want to count the number of lemmatized words in the original (untranslated) portion of the corpus, but include only nouns that are also subjects in the sentence. We begin with constructing the following query:

nouns<-query(db,"[ POS =~ '^NOUN' & DEP =~ 'nsubj' ]",bundlePattern = "EN.*en")

Next we replace the POS tags with the lemmatization of each word:


Next we remove all capitatlization and punctuation from the words:



Finally, we count the words using the table command and write the result to an Excel file:


write_xlsx(data.frame(table(nouns)), 'counts.xlsx')

The output looks something like this:



Wprowadź swoje dane lub kliknij jedną z tych ikon, aby się zalogować:


Komentujesz korzystając z konta Wyloguj /  Zmień )

Zdjęcie z Twittera

Komentujesz korzystając z konta Twitter. Wyloguj /  Zmień )

Zdjęcie na Facebooku

Komentujesz korzystając z konta Facebook. Wyloguj /  Zmień )

Połączenie z %s