create i-凯发k8网页登录
create i-vector system
since r2021a
description
i-vectors are compact statistical representations of identity extracted from audio
signals. ivectorsystem
creates a trainable i-vector system to extract
i-vectors and perform classification tasks such as speaker recognition, speaker diarization,
and sound classification. you can also determine thresholds for open set tasks and enroll
labels into the system for both open and closed set classification.
creation
properties
inputtype
— type of input
"audio"
(default) | "features"
input type, specified as "audio"
or
"features"
.
"audio"
–– the i-vector system accepts mono audio signals as input. the audio data is processed to extract 20 mel frequency cepstral coefficients (mfccs), delta mfccs, and delta-delta mfccs for 60 coefficients per frame.if
inputtype
is set to"audio"
when the i-vector system is created, the training data can be:a cell array of single-channel audio signals, each specified as a column vector with underlying type
single
ordouble
.an
audiodatastore
object or a object that points to a data set of mono audio signals.a with an underlying
audiodatastore
or that points to a data set of mono audio signals. the output from calls toread
from the transform datastore must be mono audio signals with underlying data typesingle
ordouble
.
"features"
–– the i-vector accepts pre-extracted audio features as input.if
inputtype
is set to"features"
when the i-vector system is created, the training data can be:a cell array of matrices with underlying type
single
ordouble
. the matrices must consist of audio features where the number of features (columns) is locked the first time is called and the number of hops (rows) is variable-sized. the number of features input in any subsequent calls to any of the object functions must be equal to the number of features used when callingtrainextractor
.a object with an underlying
audiodatastore
or whoseread
function has output as described in the previous bullet.a object whose
read
function has output as described in the first bullet.
example: ivs = ivectorsystem(inputtype="audio")
data types: char
| string
samplerate
— sample rate of audio input in hz
16000
(default) | positive scalar
sample rate of the audio input in hz, specified as a positive scalar.
note
the "samplerate"
property applies only when
inputtype
is set to "audio"
.
example: ivs =
ivectorsystem(inputtype="audio",samplerate=48000)
data types: single
| double
detectspeech
— apply speech detection
true
(default) | false
apply speech detection, specified as true
or
false
. with detectspeech
set to
true
, the i-vector system extracts features only from regions where
speech is detected.
note
the detectspeech
property applies only when
inputtype
is set to "audio"
.
ivectorsystem
uses the
function to detect regions of speech.
example: ivs =
ivectorsystem(inputtype="audio",detectspeech=true)
data types: logical
| single
| double
verbose
— display training progress
true
(default) | false
display training progress, specified as true
or
false
. with verbose
set to
true
, the i-vector system displays the training progress in the
command window or the live editor.
tip
to toggle between verbose and non-verbose behavior, use dot notation to set the
verbose
property between object function calls.
example: ivs =
ivectorsystem(inputtype="audio",verbose=false)
data types: logical
| single
| double
enrolledlabels
— table containing enrolled labels
0
-by-2
table (default)
this property is read-only.
table containing enrolled labels, specified as a table. table row names correspond to labels and column names correspond to the template i-vector and the number of individual i-vectors used to generate the template i-vector. the number of i-vectors used to generate the template i-vector may be viewed as a measure of confidence in the template.
use to enroll new labels or update existing labels.
use to remove labels from the system.
data types: table
object functions
train i-vector extractor | |
train i-vector classifier | |
train i-vector system calibrator | |
enroll labels | |
unenroll labels | |
evaluate binary classification system | |
verify label | |
identify label | |
extract i-vector | |
return training configuration and data info | |
add custom information about i-vector system | |
allow property values and input characteristics to change |
examples
train speaker verification system
use the pitch tracking database from graz university of technology (ptdb-tug) . the data set consists of 20 english native speakers reading 2342 phonetically rich sentences from the timit corpus. download and extract the data set.
downloadfolder = matlab.internal.examples.downloadsupportfile("audio","ptdb-tug.zip"); datafolder = tempdir; unzip(downloadfolder,datafolder) dataset = fullfile(datafolder,"ptdb-tug");
create an audiodatastore
object that points to the data set. the data set was originally intended for use in pitch-tracking training and evaluation and includes laryngograph readings and baseline pitch decisions. use only the original audio recordings.
ads = audiodatastore([fullfile(dataset,"speech data","female","mic"),fullfile(dataset,"speech data","male","mic")], ... includesubfolders=true, ... fileextensions=".wav");
the file names contain the speaker ids. decode the file names to set the labels in the audiodatastore
object.
ads.labels = extractbetween(ads.files,"mic_","_"); counteachlabel(ads)
ans=20×2 table
label count
_____ _____
f01 236
f02 236
f03 236
f04 236
f05 236
f06 236
f07 236
f08 234
f09 236
f10 236
m01 236
m02 236
m03 236
m04 236
m05 236
m06 236
⋮
read an audio file from the data set, listen to it, and plot it.
[audioin,audioinfo] = read(ads); fs = audioinfo.samplerate; t = (0:size(audioin,1)-1)/fs; sound(audioin,fs) plot(t,audioin) xlabel("time (s)") ylabel("amplitude") axis([0 t(end) -1 1]) title("sample utterance from data set")
separate the audiodatastore
object into four: one for training, one for enrollment, one to evaluate the detection-error tradeoff, and one for testing. the training set contains 16 speakers. the enrollment, detection-error tradeoff, and test sets contain the other four speakers.
speakerstotest = categorical(["m01","m05","f01","f05"]); adstrain = subset(ads,~ismember(ads.labels,speakerstotest)); ads = subset(ads,ismember(ads.labels,speakerstotest)); [adsenroll,adstest,adsdet] = spliteachlabel(ads,3,1);
display the label distributions of the audiodatastore
objects.
counteachlabel(adstrain)
ans=16×2 table
label count
_____ _____
f02 236
f03 236
f04 236
f06 236
f07 236
f08 234
f09 236
f10 236
m02 236
m03 236
m04 236
m06 236
m07 236
m08 236
m09 236
m10 236
counteachlabel(adsenroll)
ans=4×2 table
label count
_____ _____
f01 3
f05 3
m01 3
m05 3
counteachlabel(adstest)
ans=4×2 table
label count
_____ _____
f01 1
f05 1
m01 1
m05 1
counteachlabel(adsdet)
ans=4×2 table
label count
_____ _____
f01 232
f05 232
m01 232
m05 232
create an i-vector system. by default, the i-vector system assumes the input to the system is mono audio signals.
speakerverification = ivectorsystem(samplerate=fs)
speakerverification = ivectorsystem with properties: inputtype: 'audio' samplerate: 48000 detectspeech: 1 verbose: 1 enrolledlabels: [0×2 table]
to train the extractor of the i-vector system, call trainextractor
. specify the number of universal background model (ubm) components as 128 and the number of expectation maximization iterations as 5. specify the total variability space (tvs) rank as 64 and the number of iterations as 3.
trainextractor(speakerverification,adstrain, ... ubmnumcomponents=128,ubmnumiterations=5, ... tvsrank=64,tvsnumiterations=3)
calculating standardization factors ....done. training universal background model ........done. training total variability space ......done. i-vector extractor training complete.
to train the classifier of the i-vector system, use trainclassifier
. to reduce dimensionality of the i-vectors, specify the number of eigenvectors in the projection matrix as 16. specify the number of dimensions in the probabilistic linear discriminant analysis (plda) model as 16, and the number of iterations as 3.
trainclassifier(speakerverification,adstrain,adstrain.labels, ... numeigenvectors=16, ... pldanumdimensions=16,pldanumiterations=3)
extracting i-vectors ...done. training projection matrix .....done. training plda model ......done. i-vector classifier training complete.
to calibrate the system so that scores can be interpreted as a measure of confidence in a positive decision, use calibrate
.
calibrate(speakerverification,adstrain,adstrain.labels)
extracting i-vectors ...done. calibrating css scorer ...done. calibrating plda scorer ...done. calibration complete.
to inspect parameters used previously to train the i-vector system, use info
.
info(speakerverification)
i-vector system input input feature vector length: 60 input data type: double trainextractor train signals: 3774 ubmnumcomponents: 128 ubmnumiterations: 5 tvsrank: 64 tvsnumiterations: 3 trainclassifier train signals: 3774 train labels: f02 (236), f03 (236) ... and 14 more numeigenvectors: 16 pldanumdimensions: 16 pldanumiterations: 3 calibrate calibration signals: 3774 calibration labels: f02 (236), f03 (236) ... and 14 more
split the enrollment set.
[adsenrollpart1,adsenrollpart2] = spliteachlabel(adsenroll,1,2);
to enroll speakers in the i-vector system, call enroll
.
enroll(speakerverification,adsenrollpart1,adsenrollpart1.labels)
extracting i-vectors ...done. enrolling i-vectors .......done. enrollment complete.
when you enroll speakers, the read-only enrolledlabels
property is updated with the enrolled labels and corresponding template i-vectors. the table also keeps track of the number of signals used to create the template i-vector. generally, using more signals results in a better template.
speakerverification.enrolledlabels
ans=4×2 table
ivector numsamples
_____________ __________
f01 {16×1 double} 1
f05 {16×1 double} 1
m01 {16×1 double} 1
m05 {16×1 double} 1
enroll the second part of the enrollment set and then view the enrolled labels table again. the i-vector templates and the number of samples are updated.
enroll(speakerverification,adsenrollpart2,adsenrollpart2.labels)
extracting i-vectors ...done. enrolling i-vectors .......done. enrollment complete.
speakerverification.enrolledlabels
ans=4×2 table
ivector numsamples
_____________ __________
f01 {16×1 double} 3
f05 {16×1 double} 3
m01 {16×1 double} 3
m05 {16×1 double} 3
to evaluate the i-vector system and determine a decision threshold for speaker verification, call detectionerrortradeoff
.
[results, eerthreshold] = detectionerrortradeoff(speakerverification,adsdet,adsdet.labels);
extracting i-vectors ...done. scoring i-vector pairs ...done. detection error tradeoff evaluation complete.
the first output from detectionerrortradeoff
is a structure with two fields: css and plda. each field contains a table. each row of the table contains a possible decision threshold for speaker verification tasks, and the corresponding false alarm rate (far) and false rejection rate (frr). the far and frr are determined using the enrolled speaker labels and the data input to the detectionerrortradeoff
function.
results
results = struct with fields:
plda: [1000×3 table]
css: [1000×3 table]
results.css
ans=1000×3 table
threshold far frr
__________ _______ ___
2.3259e-10 1 0
2.3965e-10 0.99964 0
2.4693e-10 0.99928 0
2.5442e-10 0.99928 0
2.6215e-10 0.99928 0
2.701e-10 0.99928 0
2.783e-10 0.99928 0
2.8675e-10 0.99928 0
2.9545e-10 0.99928 0
3.0442e-10 0.99928 0
3.1366e-10 0.99928 0
3.2318e-10 0.99928 0
3.3299e-10 0.99928 0
3.431e-10 0.99928 0
3.5352e-10 0.99928 0
3.6425e-10 0.99892 0
⋮
results.plda
ans=1000×3 table
threshold far frr
__________ _______ ___
3.2661e-40 1 0
3.6177e-40 0.99964 0
4.0072e-40 0.99964 0
4.4387e-40 0.99964 0
4.9166e-40 0.99964 0
5.4459e-40 0.99964 0
6.0322e-40 0.99964 0
6.6817e-40 0.99964 0
7.4011e-40 0.99964 0
8.198e-40 0.99964 0
9.0806e-40 0.99964 0
1.0058e-39 0.99964 0
1.1141e-39 0.99964 0
1.2341e-39 0.99964 0
1.3669e-39 0.99964 0
1.5141e-39 0.99964 0
⋮
the second output from detectionerrortradeoff
is a structure with two fields: css
and plda
. the corresponding value is the decision threshold that results in the equal error rate (when far and frr are equal).
eerthreshold
eerthreshold = struct with fields:
plda: 0.0398
css: 0.9369
the first time you call detectionerrortradeoff
, you must provide data and corresponding labels to evaluate. subsequently, you can get the same information, or a different analysis using the same underlying data, by calling detectionerrortradeoff
without data and labels.
call detectionerrortradeoff
a second time with no data arguments or output arguments to visualize the detection-error tradeoff.
detectionerrortradeoff(speakerverification)
call detectionerrortradeoff
again. this time, visualize only the detection-error tradeoff for the plda scorer.
detectionerrortradeoff(speakerverification,scorer="plda")
depending on your application, you may want to use a threshold that weights the error cost of a false alarm higher or lower than the error cost of a false rejection. you may also be using data that is not representative of the prior probability of the speaker being present. you can use the mindcf
parameter to specify custom costs and prior probability. call detectionerrortradeoff
again, this time specify the cost of a false rejection as 1, the cost of a false acceptance as 2, and the prior probability that a speaker is present as 0.1.
costfr = 1;
costfa = 2;
priorprob = 0.1;
detectionerrortradeoff(speakerverification,scorer="plda",mindcf=[costfr,costfa,priorprob])
call detectionerrortradeoff
again. this time, get the mindcf
threshold for the plda scorer and the parameters of the detection cost function.
[~,mindcfthreshold] = detectionerrortradeoff(speakerverification,scorer="plda",mindcf=[costfr,costfa,priorprob])
mindcfthreshold = 0.4709
test speaker verification system
read a signal from the test set.
adstest = shuffle(adstest); [audioin,audioinfo] = read(adstest); knownspeakerid = audioinfo.label
knownspeakerid = 1×1 cell array
{'f01'}
to perform speaker verification, call verify
with the audio signal and specify the speaker id, a scorer, and a threshold for the scorer. the verify
function returns a logical value indicating whether a speaker identity is accepted or rejected, and a score indicating the similarity of the input audio and the template i-vector corresponding to the enrolled label.
[tf,score] = verify(speakerverification,audioin,knownspeakerid,"plda",eerthreshold.plda); if tf fprintf('success!\nspeaker accepted.\nsimilarity score = %0.2f\n\n',score) else fprinf('failure!\nspeaker rejected.\nsimilarity score = %0.2f\n\n',score) end
success! speaker accepted. similarity score = 1.00
call speaker verification again. this time, specify an incorrect speaker id.
possiblespeakers = speakerverification.enrolledlabels.properties.rownames; imposteridx = find(~ismember(possiblespeakers,knownspeakerid)); imposter = possiblespeakers(imposteridx(randperm(numel(imposteridx),1)))
imposter = 1×1 cell array
{'m05'}
[tf,score] = verify(speakerverification,audioin,imposter,"plda",eerthreshold.plda); if tf fprintf('failure!\nspeaker accepted.\nsimilarity score = %0.2f\n\n',score) else fprintf('success!\nspeaker rejected.\nsimilarity score = %0.2f\n\n',score) end
success! speaker rejected. similarity score = 0.00
references
[1] signal processing and speech communication laboratory. accessed 12 dec. 2019.
train speaker identification system
use the census database (also known as an4 database) from the cmu robust speech recognition group . the data set contains recordings of male and female subjects speaking words and numbers. the helper function in this example downloads the data set for you and converts the raw files to flac, and returns two audiodatastore
objects containing the training set and test set. by default, the data set is reduced so that the example runs quickly. you can use the full data set by setting reducedataset
to false.
[adstrain,adstest] = helperan4download(reducedataset=true);
split the test data set into enroll and test sets. use two utterances for enrollment and the remaining for the test set. generally, the more utterances you use for enrollment, the better the performance of the system. however, most practical applications are limited to a small set of enrollment utterances.
[adsenroll,adstest] = spliteachlabel(adstest,2);
inspect the distribution of speakers in the training, test, and enroll sets. the speakers in the training set do not overlap with the speakers in the test and enroll sets.
summary(adstrain.labels)
fejs 13 fmjd 13 fsrb 13 ftmj 13 fwxs 12 mcen 13 mrcb 13 msjm 13 msjr 13 msmn 9
summary(adsenroll.labels)
fvap 2 marh 2
summary(adstest.labels)
fvap 11 marh 11
create an i-vector system that accepts feature input.
fs = 16e3;
iv = ivectorsystem(samplerate=fs,inputtype="features");
create an audiofeatureextractor
object to extract the gammatone cepstral coefficients (gtcc), the delta gtcc, the delta-delta gtcc, and the pitch from 50 ms periodic hann windows with 45 ms overlap.
afe = audiofeatureextractor(gtcc=true,gtccdelta=true,gtccdeltadelta=true,pitch=true,samplerate=fs);
afe.window = hann(round(0.05*fs),"periodic");
afe.overlaplength = round(0.045*fs);
afe
afe = audiofeatureextractor with properties: properties window: [800×1 double] overlaplength: 720 samplerate: 16000 fftlength: [] spectraldescriptorinput: 'linearspectrum' featurevectorlength: 40 enabled features gtcc, gtccdelta, gtccdeltadelta, pitch disabled features linearspectrum, melspectrum, barkspectrum, erbspectrum, mfcc, mfccdelta mfccdeltadelta, spectralcentroid, spectralcrest, spectraldecrease, spectralentropy, spectralflatness spectralflux, spectralkurtosis, spectralrolloffpoint, spectralskewness, spectralslope, spectralspread harmonicratio, zerocrossrate, shorttimeenergy to extract a feature, set the corresponding property to true. for example, obj.mfcc = true, adds mfcc to the list of enabled features.
create transformed datastores by adding feature extraction to the read
function of adstrain
and adsenroll
.
trainlabels = adstrain.labels; adstrain = transform(adstrain,@(x)extract(afe,x)); enrolllabels = adsenroll.labels; adsenroll = transform(adsenroll,@(x)extract(afe,x));
train both the extractor and classifier using the training set.
trainextractor(iv,adstrain, ... ubmnumcomponents=64, ... ubmnumiterations=5, ... tvsrank=32, ... tvsnumiterations=3);
calculating standardization factors ....done. training universal background model ........done. training total variability space ......done. i-vector extractor training complete.
trainclassifier(iv,adstrain,trainlabels, ... numeigenvectors=16, ... ... pldanumdimensions=16, ... pldanumiterations=5);
extracting i-vectors ...done. training projection matrix .....done. training plda model ........done. i-vector classifier training complete.
to calibrate the system so that scores can be interpreted as a measure of confidence in a positive decision, use calibrate
.
calibrate(iv,adstrain,trainlabels)
extracting i-vectors ...done. calibrating css scorer ...done. calibrating plda scorer ...done. calibration complete.
enroll the speakers from the enrollment set.
enroll(iv,adsenroll,enrolllabels)
extracting i-vectors ...done. enrolling i-vectors .....done. enrollment complete.
evaluate the file-level prediction accuracy on the test set.
numcorrect = 0; reset(adstest) for index = 1:numel(adstest.files) features = extract(afe,read(adstest)); results = identify(iv,features); truelabel = adstest.labels(index); predictedlabel = results.label(1); ispredictioncorrect = truelabel==predictedlabel; numcorrect = numcorrect ispredictioncorrect; end display("file accuracy: " round(100*numcorrect/numel(adstest.files),2) " (%)")
"file accuracy: 100 (%)"
references
[1] "cmu sphinx group - audio databases." http://www.speech.cs.cmu.edu/databases/an4/. accessed 19 dec. 2019.
train environmental sound classification system
download and unzip the environment sound classification data set. this data set consists of recordings labeled as one of 10 different audio sound classes (esc-10).
loc = matlab.internal.examples.downloadsupportfile("audio","esc-10.zip"); unzip(loc,pwd)
create an audiodatastore
object to manage the data and split it into training and validation sets. call counteachlabel
to display the distribution of sound classes and the number of unique labels.
ads = audiodatastore(pwd,includesubfolders=true,labelsource="foldernames");
counteachlabel(ads)
ans=10×2 table
label count
______________ _____
chainsaw 40
clock_tick 40
crackling_fire 40
crying_baby 40
dog 40
helicopter 40
rain 40
rooster 38
sea_waves 40
sneezing 40
listen to one of the files.
[audioin,audioinfo] = read(ads); fs = audioinfo.samplerate; sound(audioin,fs) audioinfo.label
ans = categorical
chainsaw
split the datastore into training and test sets.
[adstrain,adstest] = spliteachlabel(ads,0.8);
create an audiofeatureextractor
to extract all possible features from the audio.
afe = audiofeatureextractor(samplerate=fs, ... window=hamming(round(0.03*fs),"periodic"), ... overlaplength=round(0.02*fs)); params = info(afe,"all"); params = structfun(@(x)true,params,uniformoutput=false); set(afe,params); afe
afe = audiofeatureextractor with properties: properties window: [1323×1 double] overlaplength: 882 samplerate: 44100 fftlength: [] spectraldescriptorinput: 'linearspectrum' featurevectorlength: 862 enabled features linearspectrum, melspectrum, barkspectrum, erbspectrum, mfcc, mfccdelta mfccdeltadelta, gtcc, gtccdelta, gtccdeltadelta, spectralcentroid, spectralcrest spectraldecrease, spectralentropy, spectralflatness, spectralflux, spectralkurtosis, spectralrolloffpoint spectralskewness, spectralslope, spectralspread, pitch, harmonicratio, zerocrossrate shorttimeenergy disabled features none to extract a feature, set the corresponding property to true. for example, obj.mfcc = true, adds mfcc to the list of enabled features.
create two directories in your current folder: train and test. extract features from the training and the test data sets and write the features as mat files to the respective directories. pre-extracting features can save time when you want to evaluate different feature combinations or training configurations.
if ~isdir("train") mkdir("train") mkdir("test") outputtype = ".mat"; writeall(adstrain,"train",writefcn=@(x,y,z)writefeatures(x,y,z,afe)) writeall(adstest,"test",writefcn=@(x,y,z)writefeatures(x,y,z,afe)) end
create signal datastores to point to the audio features.
sdstrain = signaldatastore("train",includesubfolders=true); sdstest = signaldatastore("test",includesubfolders=true);
create label arrays that are in the same order as the signaldatastore
files.
labelstrain = categorical(extractbetween(sdstrain.files,"esc-10" filesep,filesep)); labelstest = categorical(extractbetween(sdstest.files,"esc-10" filesep,filesep));
create a transform datastore from the signal datastores to isolate and use only the desired features. you can use the output from info
on the audiofeatureextractor
to map your chosen features to the index in the features matrix. you can experiment with the example by choosing different features.
featureindices = info(afe)
featureindices = struct with fields:
linearspectrum: [1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 … ]
melspectrum: [663 664 665 666 667 668 669 670 671 672 673 674 675 676 677 678 679 680 681 682 683 684 685 686 687 688 689 690 691 692 693 694]
barkspectrum: [695 696 697 698 699 700 701 702 703 704 705 706 707 708 709 710 711 712 713 714 715 716 717 718 719 720 721 722 723 724 725 726]
erbspectrum: [727 728 729 730 731 732 733 734 735 736 737 738 739 740 741 742 743 744 745 746 747 748 749 750 751 752 753 754 755 756 757 758 759 760 761 762 763 764 765 766 767 768 769]
mfcc: [770 771 772 773 774 775 776 777 778 779 780 781 782]
mfccdelta: [783 784 785 786 787 788 789 790 791 792 793 794 795]
mfccdeltadelta: [796 797 798 799 800 801 802 803 804 805 806 807 808]
gtcc: [809 810 811 812 813 814 815 816 817 818 819 820 821]
gtccdelta: [822 823 824 825 826 827 828 829 830 831 832 833 834]
gtccdeltadelta: [835 836 837 838 839 840 841 842 843 844 845 846 847]
spectralcentroid: 848
spectralcrest: 849
spectraldecrease: 850
spectralentropy: 851
spectralflatness: 852
spectralflux: 853
spectralkurtosis: 854
spectralrolloffpoint: 855
spectralskewness: 856
spectralslope: 857
spectralspread: 858
pitch: 859
harmonicratio: 860
zerocrossrate: 861
shorttimeenergy: 862
idxtouse = [... featureindices.harmonicratio ... ,featureindices.spectralrolloffpoint ... ,featureindices.spectralflux ... ,featureindices.spectralslope ... ]; tdstrain = transform(sdstrain,@(x)x(:,idxtouse)); tdstest = transform(sdstest,@(x)x(:,idxtouse));
create an i-vector system that accepts feature input.
soundclassifier = ivectorsystem(inputtype="features");
train the extractor and classifier using the training set.
trainextractor(soundclassifier,tdstrain,ubmnumcomponents=128,tvsrank=64);
calculating standardization factors ....done. training universal background model .....done. training total variability space ......done. i-vector extractor training complete.
trainclassifier(soundclassifier,tdstrain,labelstrain,numeigenvectors=32,pldanumiterations=0)
extracting i-vectors ...done. training projection matrix .....done. i-vector classifier training complete.
enroll the labels from the training set to create i-vector templates for each of the environmental sounds.
enroll(soundclassifier,tdstrain,labelstrain)
extracting i-vectors ...done. enrolling i-vectors .............done. enrollment complete.
calibrate the i-vector system.
calibrate(soundclassifier,tdstrain,labelstrain)
extracting i-vectors ...done. calibrating css scorer ...done. calibration complete.
use the identify
function on the test set to return the system's inferred label.
inferredlabels = labelstest; inferredlabels(:) = inferredlabels(1); for ii = 1:numel(labelstest) features = read(tdstest); tableout = identify(soundclassifier,features,"css",numcandidates=1); inferredlabels(ii) = tableout.label(1); end
create a confusion matrix to visualize performance on the test set.
uniquelabels = unique(labelstest); cm = zeros(numel(uniquelabels),numel(uniquelabels)); for ii = 1:numel(uniquelabels) for jj = 1:numel(uniquelabels) cm(ii,jj) = sum((labelstest==uniquelabels(ii)) & (inferredlabels==uniquelabels(jj))); end end labelstrings = replace(string(uniquelabels),"_"," "); heatmap(labelstrings,labelstrings,cm) colorbar off ylabel("true labels") xlabel("predicted labels") accuracy = mean(inferredlabels==labelstest); title(sprintf("accuracy = %0.2f %%",accuracy*100))
release the i-vector system.
release(soundclassifier)
supporting functions
function writefeatures(audioin,info,~,afe) % convet to single-precision audioin = single(audioin); % extract features features = extract(afe,audioin); % replace the file extension of the suggested output name with mat. filename = strrep(info.suggestedoutputname,".wav",".mat"); % save the mfcc coefficients to the mat file. save(filename,"features") end
train acoustic fault recognition system
download and unzip the air compressor data set . this data set consists of recordings from air compressors in a healthy state or one of seven faulty states.
loc = matlab.internal.examples.downloadsupportfile("audio", ... "aircompressordataset/aircompressordataset.zip"); unzip(loc,pwd)
create an audiodatastore
object to manage the data and split it into training and validation sets.
ads = audiodatastore(pwd,includesubfolders=true,labelsource="foldernames");
[adstrain,adstest] = spliteachlabel(ads,0.8,0.2);
read an audio file from the datastore and save the sample rate. listen to the audio signal and plot the signal in the time domain.
[x,fileinfo] = read(adstrain); fs = fileinfo.samplerate; sound(x,fs) t = (0:size(x,1)-1)/fs; plot(t,x) xlabel("time (s)") title("state = " string(fileinfo.label)) axis tight
create an i-vector system with detectspeech
set to false
. turn off the verbose behavior.
faultrecognizer = ivectorsystem(samplerate=fs,detectspeech=false, ...
verbose=false)
faultrecognizer = ivectorsystem with properties: inputtype: 'audio' samplerate: 16000 detectspeech: 0 verbose: 0 enrolledlabels: [0×2 table]
train the i-vector extractor and the i-vector classifier using the training datastore.
trainextractor(faultrecognizer,adstrain, ... ubmnumcomponents=80, ... ubmnumiterations=3, ... ... tvsrank=40, ... tvsnumiterations=3) trainclassifier(faultrecognizer,adstrain,adstrain.labels, ... numeigenvectors=7, ... ... pldanumdimensions=32, ... pldanumiterations=5)
calibrate the scores output by faultrecognizer
so they can be interpreted as a measure of confidence in a positive decision. turn the verbose behavior back on. enroll all of the labels from the training set.
calibrate(faultrecognizer,adstrain,adstrain.labels) faultrecognizer.verbose = true; enroll(faultrecognizer,adstrain,adstrain.labels)
extracting i-vectors ...done. enrolling i-vectors ...........done. enrollment complete.
use the read-only property enrolledlabels
to view the enrolled labels and the corresponding i-vector templates.
faultrecognizer.enrolledlabels
ans=8×2 table
ivector numsamples
____________ __________
bearing {7×1 double} 180
flywheel {7×1 double} 180
healthy {7×1 double} 180
liv {7×1 double} 180
lov {7×1 double} 180
nrv {7×1 double} 180
piston {7×1 double} 180
riderbelt {7×1 double} 180
use the identify
function with the plda scorer to predict the condition of machines in the test set. the identify
function returns a table of possible labels sorted in descending order of confidence.
[audioin,audioinfo] = read(adstest); truelabel = audioinfo.label
truelabel = categorical
bearing
predictedlabels = identify(faultrecognizer,audioin,"plda")
predictedlabels=8×2 table
label score
_________ __________
bearing 0.99997
flywheel 2.265e-05
piston 8.6076e-08
liv 1.4237e-15
nrv 4.5529e-16
riderbelt 3.7359e-16
lov 6.3025e-19
healthy 4.2094e-30
by default, the identify
function returns all possible candidate labels and their corresponding scores. use numcandidates
to reduce the number of candidates returned.
results = identify(faultrecognizer,audioin,"plda",numcandidates=3)
results=3×2 table
label score
________ __________
bearing 0.99997
flywheel 2.265e-05
piston 8.6076e-08
references
[1] verma, nishchal k., et al. “intelligent condition based monitoring using acoustic signals for air compressors.” ieee transactions on reliability, vol. 65, no. 1, mar. 2016, pp. 291–309. doi.org (crossref), doi:10.1109/tr.2015.2459684.
train speech emotion recognition system
download the berlin database of emotional speech . the database contains 535 utterances spoken by 10 actors intended to convey one of the following emotions: anger, boredom, disgust, anxiety/fear, happiness, sadness, or neutral. the emotions are text independent.
url = "http://emodb.bilderbar.info/download/download.zip"; downloadfolder = tempdir; datasetfolder = fullfile(downloadfolder,"emo-db"); if ~exist(datasetfolder,"dir") disp("downloading emo-db (40.5 mb) ...") unzip(url,datasetfolder) end
create an audiodatastore
that points to the audio files.
ads = audiodatastore(fullfile(datasetfolder,"wav"));
the file names are codes indicating the speaker id, text spoken, emotion, and version. the website contains a key for interpreting the code and additional information about the speakers such as gender and age. create a table with the variables speaker
and emotion
. decode the file names into the table.
filepaths = ads.files; emotioncodes = cellfun(@(x)x(end-5),filepaths,"uniformoutput",false); emotions = replace(emotioncodes,{'w','l','e','a','f','t','n'}, ... {'anger','boredom','disgust','anxiety','happiness','sadness','neutral'}); speakercodes = cellfun(@(x)x(end-10:end-9),filepaths,"uniformoutput",false); labeltable = table(categorical(speakercodes),categorical(emotions),variablenames=["speaker","emotion"]); summary(labeltable)
variables: speaker: 535×1 categorical values: 03 49 08 58 09 43 10 38 11 55 12 35 13 61 14 69 15 56 16 71 emotion: 535×1 categorical values: anger 127 anxiety 69 boredom 81 disgust 46 happiness 71 neutral 79 sadness 62
labeltable
is in the same order as the files in audiodatastore
. set the labels
property of the audiodatastore
to labeltable
.
ads.labels = labeltable;
read a signal from the datastore and listen to it. display the speaker id and emotion of the audio signal.
[audioin,audioinfo] = read(ads); fs = audioinfo.samplerate; sound(audioin,fs) audioinfo.label
ans=1×2 table
speaker emotion
_______ _________
03 happiness
split the datastore into a training set and a test set. assign two speakers to the test set and the remaining to the training set.
testspeakeridx = ads.labels.speaker=="12" | ads.labels.speaker=="13"; adstrain = subset(ads,~testspeakeridx); adstest = subset(ads,testspeakeridx);
read all the training and testing audio data into cell arrays. if your data can fit in memory, training is usually faster to input cell arrays to an i-vector system rather than datastores.
trainset = readall(adstrain); trainlabels = adstrain.labels.emotion; testset = readall(adstest); testlabels = adstest.labels.emotion;
create an i-vector system that does not apply speech detection. when detectspeech
is set to true
(the default), only regions of detected speech are used to train the i-vector system. when detectspeech
is set to false
, the entire input audio is used to train the i-vector system. the usefulness of applying speech detection depends on the data input to the system.
emotionrecognizer = ivectorsystem(samplerate=fs,detectspeech=false)
emotionrecognizer = ivectorsystem with properties: inputtype: 'audio' samplerate: 16000 detectspeech: 0 verbose: 1 enrolledlabels: [0×2 table]
call trainextractor
using the training set.
rng default trainextractor(emotionrecognizer,trainset, ... ubmnumcomponents =256, ... ubmnumiterations =5, ... ... tvsrank = 128, ... tvsnumiterations =5);
calculating standardization factors .....done. training universal background model ........done. training total variability space ........done. i-vector extractor training complete.
copy the emotion recognition system for use later in the example.
sentimentrecognizer = copy(emotionrecognizer);
call trainclassifier
using the training set.
rng default trainclassifier(emotionrecognizer,trainset,trainlabels, ... numeigenvectors =32, ... ... pldanumdimensions =16, ... pldanumiterations =10);
extracting i-vectors ...done. training projection matrix .....done. training plda model .............done. i-vector classifier training complete.
call calibrate
using the training set. in practice, the calibration set should be different than the training set.
calibrate(emotionrecognizer,trainset,trainlabels)
extracting i-vectors ...done. calibrating css scorer ...done. calibrating plda scorer ...done. calibration complete.
enroll the training labels into the i-vector system.
enroll(emotionrecognizer,trainset,trainlabels)
extracting i-vectors ...done. enrolling i-vectors ..........done. enrollment complete.
you can use detectionerrortradeoff
as a quick sanity check on the performance of a multilabel closed-set classification system. however, detectionerrortradeoff
provides information more suitable to open-set binary classification problems, for example, speaker verification tasks.
detectionerrortradeoff(emotionrecognizer,testset,testlabels)
extracting i-vectors ...done. scoring i-vector pairs ...done. detection error tradeoff evaluation complete.
for a more detailed view of the i-vector system's performance in a multilabel closed set application, you can use the identify
function and create a confusion matrix. the confusion matrix enables you to identify which emotions are misidentified and what they are misidentified as. use the supporting function plotconfusion
to display the results.
truelabels = testlabels; predictedlabels = truelabels; scorer = "plda"; for ii = 1:numel(testset) tableout = identify(emotionrecognizer,testset{ii},scorer); predictedlabels(ii) = tableout.label(1); end plotconfusion(truelabels,predictedlabels)
call info
to inspect how emotionrecognizer
was trained and evaluated.
info(emotionrecognizer)
i-vector system input input feature vector length: 60 input data type: double trainextractor train signals: 439 ubmnumcomponents: 256 ubmnumiterations: 5 tvsrank: 128 tvsnumiterations: 5 trainclassifier train signals: 439 train labels: anger (103), anxiety (56) ... and 5 more numeigenvectors: 32 pldanumdimensions: 16 pldanumiterations: 10 calibrate calibration signals: 439 calibration labels: anger (103), anxiety (56) ... and 5 more detectionerrortradeoff evaluation signals: 96 evaluation labels: anger (24), anxiety (13) ... and 5 more
next, modify the i-vector system to recognize emotions as positive, neutral, or negative. update the labels to only include the categories negative, positive, and categorical.
trainlabelssentiment = trainlabels; trainlabelssentiment(ismember(trainlabels,categorical(["anger","anxiety","boredom","sadness","disgust"]))) = categorical("negative"); trainlabelssentiment(ismember(trainlabels,categorical("happiness"))) = categorical("postive"); trainlabelssentiment = removecats(trainlabelssentiment); testlabelssentiment = testlabels; testlabelssentiment(ismember(testlabels,categorical(["anger","anxiety","boredom","sadness","disgust"]))) = categorical("negative"); testlabelssentiment(ismember(testlabels,categorical("happiness"))) = categorical("postive"); testlabelssentiment = removecats(testlabelssentiment);
train the i-vector system classifier using the updated labels. you do not need to retrain the extractor. recalibrate the system.
rng default trainclassifier(sentimentrecognizer,trainset,trainlabelssentiment, ... numeigenvectors =64, ... ... pldanumdimensions =32, ... pldanumiterations =10);
extracting i-vectors ...done. training projection matrix .....done. training plda model .............done. i-vector classifier training complete.
calibrate(sentimentrecognizer,trainset,trainlabels)
extracting i-vectors ...done. calibrating css scorer ...done. calibrating plda scorer ...done. calibration complete.
enroll the training labels into the system and then plot the confusion matrix for the test set.
enroll(sentimentrecognizer,trainset,trainlabelssentiment)
extracting i-vectors ...done. enrolling i-vectors ......done. enrollment complete.
truelabels = testlabelssentiment; predictedlabels = truelabels; scorer = "plda"; for ii = 1:numel(testset) tableout = identify(sentimentrecognizer,testset{ii},scorer); predictedlabels(ii) = tableout.label(1); end plotconfusion(truelabels,predictedlabels)
an i-vector system does not require the labels used to train the classifier to be equal to the enrolled labels.
unenroll the sentiment labels from the system and then enroll the original emotion categories in the system. analyze the system's classification performance.
unenroll(sentimentrecognizer) enroll(sentimentrecognizer,trainset,trainlabels)
extracting i-vectors ...done. enrolling i-vectors ..........done. enrollment complete.
truelabels = testlabels; predictedlabels = truelabels; scorer = "plda"; for ii = 1:numel(testset) tableout = identify(sentimentrecognizer,testset{ii},scorer); predictedlabels(ii) = tableout.label(1); end plotconfusion(truelabels,predictedlabels)
supporting functions
function plotconfusion(truelabels,predictedlabels) uniquelabels = unique(truelabels); cm = zeros(numel(uniquelabels),numel(uniquelabels)); for ii = 1:numel(uniquelabels) for jj = 1:numel(uniquelabels) cm(ii,jj) = sum((truelabels==uniquelabels(ii)) & (predictedlabels==uniquelabels(jj))); end end heatmap(uniquelabels,uniquelabels,cm) colorbar off ylabel('true labels') xlabel('predicted labels') accuracy = mean(truelabels==predictedlabels); title(sprintf("accuracy = %0.2f %%",accuracy*100)) end
references
[1] burkhardt, f., a. paeschke, m. rolfes, w.f. sendlmeier, and b. weiss, "a database of german emotional speech." in proceedings interspeech 2005. lisbon, portugal: international speech communication association, 2005.
train word recognition system
an i-vector system consists of a trainable front end that learns how to extract i-vectors based on unlabeled data, and a trainable backend that learns how to classify i-vectors based on labeled data. in this example, you apply an i-vector system to the task of word recognition. first, evaluate the accuracy of the i-vector system using the classifiers included in a traditional i-vector system: probabilistic linear discriminant analysis (plda) and cosine similarity scoring (css). next, evaluate the accuracy of the system if you replace the classifier with bidirectional long short-term memory (bilstm) network or a k-nearest neighbors classifier.
create training and validation sets
download the free spoken digit dataset (fsdd) [1]. fsdd consists of short audio files with spoken digits (0-9).
loc = matlab.internal.examples.downloadsupportfile("audio","fsdd.zip"); unzip(loc,pwd)
create an audiodatastore
to point to the recordings. get the sample rate of the data set.
ads = audiodatastore(pwd,includesubfolders=true); [~,adsinfo] = read(ads); fs = adsinfo.samplerate;
the first element of the file names is the digit spoken in the file. get the first element of the file names, convert them to categorical, and then set the labels
property of the audiodatastore
.
[~,filenames] = cellfun(@(x)fileparts(x),ads.files,uniformoutput=false); ads.labels = categorical(string(cellfun(@(x)x(1),filenames)));
to split the datastore into a development set and a validation set, use spliteachlabel
. allocate 80% of the data for development and the remaining 20% for validation.
[adstrain,adsvalidation] = spliteachlabel(ads,0.8);
evaluate traditional i-vector backend performance
create an i-vector system that expects audio input at a sample rate of 8 khz and does not perform speech detection.
wordrecognizer = ivectorsystem(detectspeech=false,samplerate=fs)
wordrecognizer = ivectorsystem with properties: inputtype: 'audio' samplerate: 8000 detectspeech: 0 verbose: 1 enrolledlabels: [0×2 table]
train the i-vector extractor using the data in the training set.
trainextractor(wordrecognizer,adstrain, ... ubmnumcomponents=64, ... ubmnumiterations=5, ... ... tvsrank=32, ... tvsnumiterations=5);
calculating standardization factors ....done. training universal background model ........done. training total variability space ........done. i-vector extractor training complete.
train the i-vector classifier using the data in the training data set and the corresponding labels.
trainclassifier(wordrecognizer,adstrain,adstrain.labels, ... numeigenvectors=10, ... ... pldanumdimensions=10, ... pldanumiterations=5);
extracting i-vectors ...done. training projection matrix .....done. training plda model ........done. i-vector classifier training complete.
calibrate the scores output by wordrecognizer
so they can be interpreted as a measure of confidence in a positive decision. enroll labels into the system using the entire training set.
calibrate(wordrecognizer,adstrain,adstrain.labels)
extracting i-vectors ...done. calibrating css scorer ...done. calibrating plda scorer ...done. calibration complete.
enroll(wordrecognizer,adstrain,adstrain.labels)
extracting i-vectors ...done. enrolling i-vectors .............done. enrollment complete.
in a loop, read audio from the validation datastore, identify the most-likely word present according to the specified scorer, and save the prediction for analysis.
truelabels = adsvalidation.labels; predictedlabels = truelabels; reset(adsvalidation) scorer = "plda"; for ii = 1:numel(truelabels) audioin = read(adsvalidation); to = identify(wordrecognizer,audioin,scorer); predictedlabels(ii) = to.label(1); end
display a confusion chart of the i-vector system's performance on the validation set.
figure(units="normalized",position=[0.2 0.2 0.5 0.5]) confusionchart(truelabels,predictedlabels, ... columnsummary="column-normalized", ... rowsummary="row-normalized", ... title=sprintf('accuracy = %0.2f (%%)',100*mean(predictedlabels==truelabels)))
evaluate deep learning backend performance
next, train a fully-connected network using i-vectors as input.
ivectorstrain = (ivector(wordrecognizer,adstrain))'; ivectorsvalidation = (ivector(wordrecognizer,adsvalidation))';
define a fully connected network.
layers = [ ... featureinputlayer(size(ivectorstrain,2),normalization="none") fullyconnectedlayer(128) dropoutlayer(0.4) fullyconnectedlayer(256) dropoutlayer(0.4) fullyconnectedlayer(256) dropoutlayer(0.4) fullyconnectedlayer(128) dropoutlayer(0.4) fullyconnectedlayer(numel(unique(adstrain.labels))) softmaxlayer classificationlayer];
define training parameters.
minibatchsize = 256; validationfrequency = floor(numel(adstrain.labels)/minibatchsize); options = trainingoptions("adam", ... maxepochs=10, ... minibatchsize=minibatchsize, ... plots="training-progress", ... verbose=false, ... shuffle="every-epoch", ... validationdata={ivectorsvalidation,adsvalidation.labels}, ... validationfrequency=validationfrequency);
train the network.
net = trainnetwork(ivectorstrain,adstrain.labels,layers,options);
evaluate the performance of the deep learning backend using a confusion chart.
predictedlabels = classify(net,ivectorsvalidation); truelabels = adsvalidation.labels; figure(units="normalized",position=[0.2 0.2 0.5 0.5]) confusionchart(truelabels,predictedlabels, ... columnsummary="column-normalized", ... rowsummary="row-normalized", ... title=sprintf('accuracy = %0.2f (%%)',100*mean(predictedlabels==truelabels)))
evaluate knn backend performance
train and evaluate i-vectors with a k-nearest neighbor (knn) backend.
use fitcknn
to train a knn model.
classificationknn = fitcknn(... ivectorstrain, ... adstrain.labels, ... distance="euclidean", ... exponent=[], ... numneighbors=10, ... distanceweight="squaredinverse", ... standardize=true, ... classnames=unique(adstrain.labels));
evaluate the knn backend.
predictedlabels = predict(classificationknn,ivectorsvalidation); truelabels = adsvalidation.labels; figure(units="normalized",position=[0.2 0.2 0.5 0.5]) confusionchart(truelabels,predictedlabels, ... columnsummary="column-normalized", ... rowsummary="row-normalized", ... title=sprintf('accuracy = %0.2f (%%)',100*mean(predictedlabels==truelabels)))
references
[1] jakobovski. "jakobovski/free-spoken-digit-dataset." github, may 30, 2019. https://github.com/jakobovski/free-spoken-digit-dataset
.
references
[1] reynolds, douglas a., et al. “speaker verification using adapted gaussian mixture models.” digital signal processing, vol. 10, no. 1–3, jan. 2000, pp. 19–41. doi.org (crossref), doi:10.1006/dspr.1999.0361.
[2] kenny, patrick, et al. “joint factor analysis versus eigenchannels in speaker recognition.” ieee transactions on audio, speech and language processing, vol. 15, no. 4, may 2007, pp. 1435–47. doi.org (crossref), doi:10.1109/tasl.2006.881693.
[3] kenny, p., et al. “a study of interspeaker variability in speaker verification.” ieee transactions on audio, speech, and language processing, vol. 16, no. 5, july 2008, pp. 980–88. doi.org (crossref), doi:10.1109/tasl.2008.925147.
[4] dehak, najim, et al. “front-end factor analysis for speaker verification.” ieee transactions on audio, speech, and language processing, vol. 19, no. 4, may 2011, pp. 788–98. doi.org (crossref), doi:10.1109/tasl.2010.2064307.
[5] matejka, pavel, ondrej glembek, fabio castaldo, m. j. alam, oldrich plchot, patrick kenny, lukas burget, and jan cernocky. “full-covariance ubm and heavy-tailed plda in i-vector speaker verification.” 2011 ieee international conference on acoustics, speech and signal processing (icassp), 2011. https://doi.org/10.1109/icassp.2011.5947436.
[6] snyder, david, et al. “x-vectors: robust dnn embeddings for speaker recognition.” 2018 ieee international conference on acoustics, speech and signal processing (icassp), ieee, 2018, pp. 5329–33. doi.org (crossref), doi:10.1109/icassp.2018.8461375.
[7] signal processing and speech communication laboratory. accessed december 12, 2019. .
[8] variani, ehsan, et al. “deep neural networks for small footprint text-dependent speaker verification.” 2014 ieee international conference on acoustics, speech and signal processing (icassp), ieee, 2014, pp. 4052–56. doi.org (crossref), doi:10.1109/icassp.2014.6854363.
[9] dehak, najim, réda dehak, james r. glass, douglas a. reynolds and patrick kenny. “cosine similarity scoring without score normalization techniques.” odyssey (2010).
[10] verma, pulkit, and pradip k. das. “i-vectors in speech processing applications: a survey.” international journal of speech technology, vol. 18, no. 4, dec. 2015, pp. 529–46. doi.org (crossref), doi:10.1007/s10772-015-9295-3.
[11] d. garcía-romero and c. espy-wilson, “analysis of i-vector length normalization in speaker recognition systems.” interspeech, 2011, pp. 249–252.
[12] kenny, patrick. "bayesian speaker verification with heavy-tailed priors". odyssey 2010 - the speaker and language recognition workshop, brno, czech republic, 2010.
[13] sizov, aleksandr, kong aik lee, and tomi kinnunen. “unifying probabilistic linear discriminant analysis variants in biometric authentication.” lecture notes in computer science structural, syntactic, and statistical pattern recognition, 2014, 464–75. .
[14] rajan, padmanabhan, anton afanasyev, ville hautamäki, and tomi kinnunen. “from single to multiple enrollment i-vectors: practical plda scoring variants for speaker verification.” digital signal processing 31 (august), 2014, pp. 93–101. .
version history
introduced in r2021a
打开示例
您曾对此示例进行过修改。是否要打开带有您的编辑的示例?
matlab 命令
您点击的链接对应于以下 matlab 命令:
请在 matlab 命令行窗口中直接输入以执行命令。web 浏览器不支持 matlab 命令。
select a web site
choose a web site to get translated content where available and see local events and offers. based on your location, we recommend that you select: .
you can also select a web site from the following list:
how to get best site performance
select the china site (in chinese or english) for best site performance. other mathworks country sites are not optimized for visits from your location.
americas
- (español)
- (english)
- (english)
europe
- (english)
- (english)
- (deutsch)
- (español)
- (english)
- (français)
- (english)
- (italiano)
- (english)
- (english)
- (english)
- (deutsch)
- (english)
- (english)
- switzerland
- (english)
asia pacific
- (english)
- (english)
- (english)
- 中国
- (日本語)
- (한국어)