Speaker Identification

Speaker identification is a process of determining the person who spoke a particular piece of recorded speech. In some cases, it may be a single long recording, but in other situations, we can have people exchanging roles frequently within a single recording session – in that case we first segment the speech into portions where hopefully only one person speaks at a time and apply the aforementioned procedure to each segment separately.

Speech identification is often referred to as speaker verification. Although usually solved in a very similar manner (using the same tools and models), those two words aren’t necessarily synonymous. Verification implies that we are trying to confirm the identity of a particular person given a recording, whereas identification tries to match a set of recordings to a set of identities.

Another very similar process is known as speaker diarization. This, however, is a completely different problem. The key difference between identification and diarization is that the former assumes the existence of a collection of voice samples for each person (known as the enrollment data) whereas the latter has no information about the actual identity of the people it is trying to recognize. Diarization simply assigns random or sequential anonymous labels to each of the determined voices, while also making sure that the same voice gets the same label, regardless of where it occurs in the audio. Technically, whereas speaker identification can be solved using regular classification methods, speaker diarization is phrased as a problem requiring clustering due to its unsupervised nature.

In our project, we wanted to assign speaker labels to recordings of interpretations of parliamentary speeches. These recordings usually contain the voice of a single person and if there are interjections by other speakers (usually in the front or the back of each file), we analyze only the voice of the person who is speaking for the majority of the time. Furthermore, we were able to collect a small database of samples of each person we wished to identify, i.e. interpreters, which was the main reason for attempting to use machine learning as a method for solving the problem. The identity of the actual parliament speakers (i.e. politicians) is usually known and in most cases announced at the start of each recording. The identity of the interpreters is not included in the recordings of the European Parliament sessions available online, but it is very relevant for our study of the interpreting process.

One final important note to mention is that this problem is generally regarded as completely language independent. That is good news when working on a problem concerning speech in many languages, like that in our project. It also means, that we can use models trained on large amounts of high quality data, regardless of the language we have to analyze.

Tools and methods

Speaker identification by voice audio analysis is an area with a long tradition and many applications in various disciplines including research, government or commerce. The NIST SRE events have been organized since 1996 and in 2018 attracted 48 teams from all around the world. This has obviously always been a big deal to many people and the substantial attention resulted in many methods being developed throughout the years.

For the much of the first half of this decade, the prevailing approach used what is known as the i-vector. This technique uses factor analysis to convert any recording into a low-dimensional representation of the speaker known as the identity vector. The idea of using vector representations for large-scale data is not isolated to speaker analysis – the popularity of embeddings made the method very attractive in many situations where speaker information may be useful. Apart from the aforementioned speaker identification and diarization, a vector like that is very easy to integrate into any model like that for speech recognition. As shown below, it can also be easy to visualize, which makes it more accessible to people with different backgrounds.

It didn’t take long for others to try and improve on that approach by utilizing another technique that works well with generating vector output, the deep neural network. Apparently the idea of using ANNs for speaker recognition is not new, however due to recent advancements in methods and available data made it possible to achieve better results than ever before. This created a form of competition where people came up with original names like the d-vector and x-vector to assure their prominence in the field. Regardless of their motive, the idea is always similar to that of its progenitor, at least from the perspective the user – each recording, regardless of size, is converted into a constant length, low-dimensional embedding in the vector space representing speakers from a chosen population present in the data.

Speaking of data, these novel approaches wouldn’t be possible without the presence of large annotated speech databases. Oftentimes, speech recognition corpora like Librispeech and CommonVoice could be used for this purpose, but the two most popular corpora mentioned in competitions like SRE are Speakers-In-The-Wild and VoxCeleb.

This is a good place to mention the nomenclature used for describing data subsets in these circles. Normally in machine learning, we use the terms „training data” (used for training the model) as something completely separate from „test data” (aka evaluation, used for assessing the final performance of the model) and „development data” (aka validation, used for tuning the hyper-parameters of the model). In speaker id competitions, these terms often mean something else. Training data usually stands for the enrollment data used to designate the speakers we are trying to identify, test data is the unidentified collection of recordings we need to identify and measure our results on and development data are the large corpora (mentioned above) used to train the model about the various common aspects of voices in the general population. It is not uncommon for people to use more than one set of „train/test” datasets for developing their model. One is made available during competition, but if you want to make sure you are making progress while preparing for the competition, you better make a train/test set of your own, or use one from a previous competition, if available.

An important note to mention is that the attractiveness of the vector representation does not guarantee that their utilization in problems like speaker identification is completely trivial. The whole algorithm, which was optimized for use in the competitions, has several steps:

  1. voice activity detection – it is important to discard any audio that may belong to something else that is not the voice of the person we are trying to identify, as that would skew the results.
  2. ?-vector extraction – the process of the extraction can be tuned in many ways and produce a series of vectors for short segments that are later averaged out or a single vector for one long piece of audio.
  3. training data – each speaker can have several samples (i.e. recordings) of their voice available in the enrollment data. We need to average the vectors to obtain one per speaker we wish to identify. It may also be useful to know the amount of vectors that were used to obtain each average, as that can be used in the final classification step.
  4. global mean subtraction and normalization – this is a normal processing step useful for any classification task, but here it’s worth noting that the mean is usually computed on the much larger development set, rather than just the testing phase.
  5. LDA transformation – dimensionality reduction is also a common approach used in classification. This is also optimized on a larger data set and tuned to maximize the performance of the classifier.
  6. PLDA classification – this is the most popular classification algorithm used in speaker id challenges. It performs a one-vs-one classification on all speaker-test pairs and provides a score for each one. Using the maximum score, we can easily determine the winner, but looking at the value can also help determine if no speaker matches the provided sample.

Our attempt

We collected a small sample of voices of 32 interpreters. Each interpreter had between 1-3 recordings, where each of the 57 recordings was between 21 and 192 seconds long. Altogether, the whole data set was ~83 minutes long.

After extracting the XVectors of each recording, they were subsequently averaged to get 32 speaker vectors – each with 128 dimensions. Below, we used the popular t-SNE algorithm to visualize this multidimensional space in 2D:

Each color dot represents files belonging to individual speakers (we assigned fake names to the speakers’ samples). The annotated crosses are the vectors representing the „average” for each speaker. We can observe several potential issues with the data. The samples of different speakers are not always close to their average. Furthermore, several speakers are very close to each other. This means they will be difficult to differentiate. As an aside, note how the method separated male speakers from female ones. This was not intentional and simply stems from similarity of different voices.

The next step was to take a collection of unlabeled audio files and use the method on them. For this demonstration, we took 246 files, but this time we always extract a 30 second segment from the middle of each file. For the visualization, we use the same crosses as in the plot above, but the dots represent the still unidentified files.

We can use this information to try and assign each file to a speaker, but there are a few observations we can make first. We can see that not all speakers are present in this collection of files (some crosses don’t have any dots close to them), but also there are a few files (in the center, around coordinates [7,-7]) that don’t match any of the speakers we have in our enrollment database (dots have no crosses close to them). To make the matching possible, we will assign a score between each speaker and each file (so 7872 different scores) and for each file choose the speaker with the highest score. However, if the score is too low (usually below 0), we can assume that the file doesn’t match any of the speakers in our enrollment database.

Results

Initially, our set of files was judged by two human experts „by ear” to match the voices to actual identities. It was noted that this was not an easy task. Out of the 246 files, there were 23 mismatches between the human judges and the automatic results. These 23 recordings were then re-verified by the experts and the automatic result was deemed correct in all but one recording, which turned out to be outside of enrollment set.

While we are very pleased with our results, there are a few things that can be done to improve the outcome in case it doesn’t work as intended:

  • improve the quality of the data – eg. clean up the audio, fix endpointing, make sure voice activity detection works properly
  • retrain the normalization, transformation and PLDA classifier to our data – this isn’t very difficult, but may depend on the amount of samples available
  • adapt the XVector models – this would be very difficult to develop and also very time consuming

How it was done

Here we will explain step-by-step how to obtain the above results using Kaldi. Installing it is not too difficult, if you’re used to opensource projects. In fact, Kaldi is designed to be as self-sustaining as possible, so it minimizes the number of components that affect the system as a whole and mostly lives within a specific chosen directory. This was done to ease the deployment in cluster computing environments. The idea is to clone the official repository: https://github.com/kaldi-asr/kaldi and follow the INSTALL instructions therein. They are targeted mostly at Linux and Mac, but a Windows setup also exists in the windows sub-directory.

The models we used are available on the project’s official model page: http://kaldi-asr.org/models.html We used the model with the code M8 which seemed to be trained on the largest amount of data (at the time of publishing). We obviously used the version 1a, which is the XVector model referred above. After unpacking the folder, inside you will find the exp/xvector_nnet_1a folder containing the XVector and PLDA models and everything else is safe to ignore and delete. To make things work you will also need to copy or symlink the following folders from the Kaldi distribution:

  • $KALDI_HOME/egs/sitw/v2/path.sh – you need to copy and edit this file to point to the location where you installed Kaldi
  • $KALDI_HOME/egs/sitw/v2/conf – contains the configurations files with parameters used to train the model
  • $KALDI_HOME/egs/sitw/v2/sid – contains scripts related specifically to speaker identification
  • $KALDI_HOME/egs/sitw/v2/steps – contains generic scripts to different procedures used in Kaldi
  • $KALDI_HOME/egs/sitw/v2/utils – contains small utility scripts useful for data manipulation and error checking

Next we create the data directory and inside two subdirectories enrollment and test. Each of them contains the following files:

  • wav.scp – contains the list of audio files and their paths, eg:
utt1 /mnt/audio/utt1.wav
utt2 /mnt/audio/utt2.wav
utt3 /mnt/audio/utt3.wav
utt4 /mnt/audio/utt4.wav
  • text – this would normally host transcriptions for the above files, but since we don’t need it for this task, we can leave the transcriptions empty. All you need is a list of the above files. You can easily get that using the following command:
cut -f1 -d' ' wav.scp > text
  • utt2spk – a mapping of files to the names of the speakers. For enrollment this could look like this:
utt1 spk1
utt2 spk1
utt3 spk2
utt4 spk3

For test the speakers aren’t known, so we can map each file to a new unknown speaker. It’s easier to simply use the name of the file as the name of the speaker, eg:

test1 test1
test2 test2
test3 test3
test4 test4
  • spk2utt – the inverse of the above mapping. This can be easily obtained using the following utility script:
./utils/utt2spk_to_spk2utt.pl data/enrollment/utt2spk > data/enrollment/spk2utt

Once these are created, we can check their correctness using the following script:

./utils/validate_data_dir.sh --no-feats data/enrollment

If everything is okay, you should get a „Successfully validated data-directory” message. Otherwise, a detailed information on the error should be provided. Note that the above and all the following examples will use the enrollment dir as the example, but they should be also repeated on the test dir as well.

After that we can start by extracting the features from the files. This script will automatically use the options provided in the conf directory. Note that the files should be saved as uncompressed WAV files, with one channel and using the sampling frequency of 16 kHz. Otherwise, this step will likely fail:

./steps/make_mfcc.sh --nj 10 data/enrollment

The nj option stands for „number of jobs” and allows speeding up the computation using multiple processes. It cannot be larger than the number of speakers in the directory and should be similar to the number of CPUs available on your system.

The next step is to compute the VAD:

./sid/compute_vad_decision.sh --nj 10 data/enrollment

As mentioned earlier, this is a pretty trivial, energy-based detector that is sufficient for our files (which were recorded in a booth and have very little background noise). For more challenging recordings, you should look at the neural-net based solutions, like the one present in the model M4 on the Kaldi models page.

Hopefully now, we can extract the XVectors using the following command:

./sid/nnet3/xvector/extract_xvectors.sh exp/xvector_nnet_1a data/enrollment exp/enrollment_xvectors

The last argument is the destination directory, where the XVector files are going to be stored. To visualize them, we will first need to do some preprocessing on them:

ivector-subtract-global-mean $plda/mean.vec  scp:$xvector/xvector.scp ark:- | transform-vec $plda/transform.mat ark:- ark:- | ivector-normalize-length ark:- ark,t:xvectors.txt

Where the $plda points to the exp/xvector_nnet_1a/xvectors_train_combined_200k directory and $xvector points to the chosen XVectors directory, eg. exp/enrollment_xvectors. The resulting xvectors.txt file will contain a list of XVectors that can easily be parsed in Python:

def load(file):
    label=[]
    vec=[]
    with open(file) as f:
        for l in f:
            tok=l.strip().split()
            label.append(tok[0])
            vec.append([float(v) for v in tok[2:-1]])
    return label,np.array(vec)

Using tSNE is easy in scikit-learn:

from sklearn.manifold import TSNE
emb=TSNE(n_components=2).fit_transform(data)

Note that if you want to compare different sets (eg. enrollment and test), you should use the above command on all your data together, as doing it separately can generate different embedding for each set and then they wouldn’t match in the final drawing. To draw the result, we can use matplotlib’s scatterplot function:

import matplotlib.pyplot as P
P.scatter(emb[:,0],emb[:,1])

Finally, we want to compute the scores and find a speaker for each file. First we need to create a trials files describing what scores we want to compute:

cut -f1 -d' ' data/enrollment/spk2utt | while read spk ; do
	cut -f1 -d' ' data/test/utt2spk | while read utt ; do
		echo $spk $utt
	done
done > trials

We can then compute the scores using the following command:

ivector-plda-scoring --normalize-length=true --num-utts=ark:exp/enrollment_xvectors/num_utts.ark \
	$plda/plda \
	"ark:ivector-mean ark:data/enrollment/spk2utt scp:exp/enrollment_xvectors/xvector.scp ark:- | \
		ivector-subtract-global-mean $plda/mean.vec ark:- ark:- | \
		transform-vec $plda/transform.mat ark:- ark:- | \
		ivector-normalize-length ark:- ark:- |" \
	"ark:ivector-subtract-global-mean $plda/mean.vec scp:exp/test_xvectors/xvector.scp ark:- | \
		transform-vec $plda/transform.mat ark:- ark:- | \
		ivector-normalize-length ark:- ark:- |" \
	trials scores

This command takes all the files we created earlier and generates a text file called scores, which contains a score for each speaker to utterance mapping we provided in the trials file. To convert it into something more readable, we used this Python program to generate an Excel file:

import argparse
import xlsxwriter

if __name__ == '__main__':
    parser=argparse.ArgumentParser()
    parser.add_argument('scores')
    parser.add_argument('output')
    
    args=parser.parse_args()
    
    spks={}
    utts={}
    with open(args.scores) as f:
        for l in f:
            tok=l.strip().split()
            spk=tok[0]
            utt=tok[1]
            score=float(tok[2])
            
            if spk not in spks:
                spks[spk]=len(spks)
            
            if utt not in utts:
                utts[utt]={}
            utts[utt][spk]=score
    
    workbook = xlsxwriter.Workbook(args.output)
    ws = workbook.add_worksheet()
    
    ws.write(0,0,'Utterance')
    ws.write(0,1,'Best speaker')
    ws.write(0,2,'Best score')
    for i,spk in enumerate(spks.keys()):
        ws.write(0,3+i,spk)
    
    for r,(utt,scs) in enumerate(utts.items()):
        ws.write(r+1,0,utt)
        best_spk=''
        best_sc=-999999
        for spk,sc in scs.items():
            c=spks[spk]+3
            ws.write(r+1,c,sc)
            if best_sc<sc:
                best_sc=sc
                best_spk=spk
        ws.write(r+1,1,best_spk)
        ws.write(r+1,2,best_sc)
    
    workbook.close()

Skomentuj

Wprowadź swoje dane lub kliknij jedną z tych ikon, aby się zalogować:

Logo WordPress.com

Komentujesz korzystając z konta WordPress.com. Wyloguj /  Zmień )

Zdjęcie z Twittera

Komentujesz korzystając z konta Twitter. Wyloguj /  Zmień )

Zdjęcie na Facebooku

Komentujesz korzystając z konta Facebook. Wyloguj /  Zmień )

Połączenie z %s