main content

使用深度学习生成文本 -凯发k8网页登录

此示例说明如何训练深度学习长短期记忆 (lstm) 网络以生成文本。

要训练深度学习网络以生成文本,请训练“序列到序列”的 lstm 网络,以预测字符序列中的下一个字符。要训练网络以预测下一个字符,请将移位一个时间步的输入序列指定为响应。

要将字符序列输入到 lstm 网络中,请将每个训练观测值转换为由向量 xrd 表示的字符序列,其中 d 是词汇表中唯一字符的数量。对于每个向量,如果 x 对应于给定词汇表中索引为 i 的字符,则 xi=1;如果 ji,则 xj=0

加载训练数据

从文本文件 sonnets.txt 中提取文本数据。

filename = "sonnets.txt";
textdata = fileread(filename);

十四行诗缩进两个空白字符,并用两个换行符分隔。使用 replace 删除缩进,并使用 split 将文本拆分为单独的十四行诗。删除前三个元素中的主标题和每一首十四行诗之前出现的十四行诗标题。

textdata = replace(textdata,"  ","");
textdata = split(textdata,[newline newline]);
textdata = textdata(5:2:end);

查看前几个观测值。

textdata(1:10)
ans = 10×1 cell array
    {'from fairest creatures we desire increase,↵that thereby beauty's rose might never die,↵but as the riper should by time decease,↵his tender heir might bear his memory:↵but thou, contracted to thine own bright eyes,↵feed'st thy light's flame with self-substantial fuel,↵making a famine where abundance lies,↵thy self thy foe, to thy sweet self too cruel:↵thou that art now the world's fresh ornament,↵and only herald to the gaudy spring,↵within thine own bud buriest thy content,↵and tender churl mak'st waste in niggarding:↵pity the world, or else this glutton be,↵to eat the world's due, by the grave and thee.'                                 }
    {'when forty winters shall besiege thy brow,↵and dig deep trenches in thy beauty's field,↵thy youth's proud livery so gazed on now,↵will be a tatter'd weed of small worth held:↵then being asked, where all thy beauty lies,↵where all the treasure of thy lusty days;↵to say, within thine own deep sunken eyes,↵were an all-eating shame, and thriftless praise.↵how much more praise deserv'd thy beauty's use,↵if thou couldst answer 'this fair child of mine↵shall sum my count, and make my old excuse,'↵proving his beauty by succession thine!↵this were to be new made when thou art old,↵and see thy blood warm when thou feel'st it cold.'               }
    {'look in thy glass and tell the face thou viewest↵now is the time that face should form another;↵whose fresh repair if now thou not renewest,↵thou dost beguile the world, unbless some mother.↵for where is she so fair whose unear'd womb↵disdains the tillage of thy husbandry?↵or who is he so fond will be the tomb,↵of his self-love to stop posterity?↵thou art thy mother's glass and she in thee↵calls back the lovely april of her prime;↵so thou through windows of thine age shalt see,↵despite of wrinkles this thy golden time.↵but if thou live, remember'd not to be,↵die single and thine image dies with thee.'                                    }
    {'unthrifty loveliness, why dost thou spend↵upon thy self thy beauty's legacy?↵nature's bequest gives nothing, but doth lend,↵and being frank she lends to those are free:↵then, beauteous niggard, why dost thou abuse↵the bounteous largess given thee to give?↵profitless usurer, why dost thou use↵so great a sum of sums, yet canst not live?↵for having traffic with thy self alone,↵thou of thy self thy sweet self dost deceive:↵then how when nature calls thee to be gone,↵what acceptable audit canst thou leave?↵thy unused beauty must be tombed with thee,↵which, used, lives th' executor to be.'                                                      }
    {'those hours, that with gentle work did frame↵the lovely gaze where every eye doth dwell,↵will play the tyrants to the very same↵and that unfair which fairly doth excel;↵for never-resting time leads summer on↵to hideous winter, and confounds him there;↵sap checked with frost, and lusty leaves quite gone,↵beauty o'er-snowed and bareness every where:↵then were not summer's distillation left,↵a liquid prisoner pent in walls of glass,↵beauty's effect with beauty were bereft,↵nor it, nor no remembrance what it was:↵but flowers distill'd, though they with winter meet,↵leese but their show; their substance still lives sweet.'                   }
    {'then let not winter's ragged hand deface,↵in thee thy summer, ere thou be distill'd:↵make sweet some vial; treasure thou some place↵with beauty's treasure ere it be self-kill'd.↵that use is not forbidden usury,↵which happies those that pay the willing loan;↵that's for thy self to breed another thee,↵or ten times happier, be it ten for one;↵ten times thy self were happier than thou art,↵if ten of thine ten times refigur'd thee:↵then what could death do if thou shouldst depart,↵leaving thee living in posterity?↵be not self-will'd, for thou art much too fair↵to be death's conquest and make worms thine heir.'                                }
    {'lo! in the orient when the gracious light↵lifts up his burning head, each under eye↵doth homage to his new-appearing sight,↵serving with looks his sacred majesty;↵and having climb'd the steep-up heavenly hill,↵resembling strong youth in his middle age,↵yet mortal looks adore his beauty still,↵attending on his golden pilgrimage:↵but when from highmost pitch, with weary car,↵like feeble age, he reeleth from the day,↵the eyes, 'fore duteous, now converted are↵from his low tract, and look another way:↵so thou, thyself outgoing in thy noon:↵unlook'd, on diest unless thou get a son.'                                                            }
    {'music to hear, why hear'st thou music sadly?↵sweets with sweets war not, joy delights in joy:↵why lov'st thou that which thou receiv'st not gladly,↵or else receiv'st with pleasure thine annoy?↵if the true concord of well-tuned sounds,↵by unions married, do offend thine ear,↵they do but sweetly chide thee, who confounds↵in singleness the parts that thou shouldst bear.↵mark how one string, sweet husband to another,↵strikes each in each by mutual ordering;↵resembling sire and child and happy mother,↵who, all in one, one pleasing note do sing:↵whose speechless song being many, seeming one,↵sings this to thee: 'thou single wilt prove none.''}
    {'is it for fear to wet a widow's eye,↵that thou consum'st thy self in single life?↵ah! if thou issueless shalt hap to die,↵the world will wail thee like a makeless wife;↵the world will be thy widow and still weep↵that thou no form of thee hast left behind,↵when every private widow well may keep↵by children's eyes, her husband's shape in mind:↵look! what an unthrift in the world doth spend↵shifts but his place, for still the world enjoys it;↵but beauty's waste hath in the world an end,↵and kept unused the user so destroys it.↵no love toward others in that bosom sits↵that on himself such murd'rous shame commits.'                           }
    {'for shame! deny that thou bear'st love to any,↵who for thy self art so unprovident.↵grant, if thou wilt, thou art belov'd of many,↵but that thou none lov'st is most evident:↵for thou art so possess'd with murderous hate,↵that 'gainst thy self thou stick'st not to conspire,↵seeking that beauteous roof to ruinate↵which to repair should be thy chief desire.↵o! change thy thought, that i may change my mind:↵shall hate be fairer lodg'd than gentle love?↵be, as thy presence is, gracious and kind,↵or to thyself at least kind-hearted prove:↵make thee another self for love of me,↵that beauty still may live in thine or thee.'                     }

将文本数据转换为序列

将文本数据转换为预测变量的向量序列和响应向量的分类序列。

创建特殊字符来表示“文本开始”、“空白”、“文本结束”和“换行符”。分别使用特殊字符 "\x0002"(文本开始)、"\x00b7"(“·”,间隔点)、"\x2403"(“␃”,文本结束)和 "\x00b6"(“”,段落符号)。为防止出现歧义,您必须选择文本中未出现的特殊字符。由于这些字符未出现在训练数据中,因此可用于此目的。

startoftextcharacter = compose("\x0002");
whitespacecharacter = compose("\x00b7");
endoftextcharacter = compose("\x2403");
newlinecharacter = compose("\x00b6");

对于每个观测值,在开头插入文本开始字符,并用对应的字符替换空白和换行符。

textdata = startoftextcharacter   textdata;
textdata = replace(textdata,[" " newline],[whitespacecharacter newlinecharacter]);

创建文本中唯一字符的词汇表。

uniquecharacters = unique([textdata{:}]);
numuniquecharacters = numel(uniquecharacters);

循环处理文本数据,并创建表示每个观测值的字符的向量序列以及响应的字符分类序列。要表示每个观测值的结束,请包含文本结束字符。

numdocuments = numel(textdata);
xtrain = cell(1,numdocuments);
ytrain = cell(1,numdocuments);
for i = 1:numel(textdata)
    characters = textdata{i};
    sequencelength = numel(characters);
    
    % get indices of characters.
    [~,idx] = ismember(characters,uniquecharacters);
    
    % convert characters to vectors.
    x = zeros(numuniquecharacters,sequencelength);
    for j = 1:sequencelength
        x(idx(j),j) = 1;
    end
    
    % create vector of categorical responses with end of text character.
    charactersshifted = [cellstr(characters(2:end)')' endoftextcharacter];
    y = categorical(charactersshifted);
    
    xtrain{i} = x;
    ytrain{i} = y;
end

查看第一个观测值和相应序列的大小。该序列是一个 d×s 矩阵,其中 d 是特征数(唯一字符的数量),s 是序列长度(文本中的字符数量)。

textdata{1}
ans = 
'from·fairest·creatures·we·desire·increase,¶that·thereby·beauty's·rose·might·never·die,¶but·as·the·riper·should·by·time·decease,¶his·tender·heir·might·bear·his·memory:¶but·thou,·contracted·to·thine·own·bright·eyes,¶feed'st·thy·light's·flame·with·self-substantial·fuel,¶making·a·famine·where·abundance·lies,¶thy·self·thy·foe,·to·thy·sweet·self·too·cruel:¶thou·that·art·now·the·world's·fresh·ornament,¶and·only·herald·to·the·gaudy·spring,¶within·thine·own·bud·buriest·thy·content,¶and·tender·churl·mak'st·waste·in·niggarding:¶pity·the·world,·or·else·this·glutton·be,¶to·eat·the·world's·due,·by·the·grave·and·thee.'
size(xtrain{1})
ans = 1×2
    62   611

查看相应的响应序列。该序列是由响应组成的 1×s 分类向量。

ytrain{1}
ans = 1×611 categorical array
     f      r      o      m      ·      f      a      i      r      e      s      t      ·      c      r      e      a      t      u      r      e      s      ·      w      e      ·      d      e      s      i      r      e      ·      i      n      c      r      e      a      s      e      ,      ¶      t      h      a      t      ·      t      h      e      r      e      b      y      ·      b      e      a      u      t      y      '      s      ·      r      o      s      e      ·      m      i      g      h      t      ·      n      e      v      e      r      ·      d      i      e      ,      ¶      b      u      t      ·      a      s      ·      t      h      e      ·      r      i      p      e      r      ·      s      h      o      u      l      d      ·      b      y      ·      t      i      m      e      ·      d      e      c      e      a      s      e      ,      ¶      h      i      s      ·      t      e      n      d      e      r      ·      h      e      i      r      ·      m      i      g      h      t      ·      b      e      a      r      ·      h      i      s      ·      m      e      m      o      r      y      :      ¶      b      u      t      ·      t      h      o      u      ,      ·      c      o      n      t      r      a      c      t      e      d      ·      t      o      ·      t      h      i      n      e      ·      o      w      n      ·      b      r      i      g      h      t      ·      e      y      e      s      ,      ¶      f      e      e      d      '      s      t      ·      t      h      y      ·      l      i      g      h      t      '      s      ·      f      l      a      m      e      ·      w      i      t      h      ·      s      e      l      f      -      s      u      b      s      t      a      n      t      i      a      l      ·      f      u      e      l      ,      ¶      m      a      k      i      n      g      ·      a      ·      f      a      m      i      n      e      ·      w      h      e      r      e      ·      a      b      u      n      d      a      n      c      e      ·      l      i      e      s      ,      ¶      t      h      y      ·      s      e      l      f      ·      t      h      y      ·      f      o      e      ,      ·      t      o      ·      t      h      y      ·      s      w      e      e      t      ·      s      e      l      f      ·      t      o      o      ·      c      r      u      e      l      :      ¶      t      h      o      u      ·      t      h      a      t      ·      a      r      t      ·      n      o      w      ·      t      h      e      ·      w      o      r      l      d      '      s      ·      f      r      e      s      h      ·      o      r      n      a      m      e      n      t      ,      ¶      a      n      d      ·      o      n      l      y      ·      h      e      r      a      l      d      ·      t      o      ·      t      h      e      ·      g      a      u      d      y      ·      s      p      r      i      n      g      ,      ¶      w      i      t      h      i      n      ·      t      h      i      n      e      ·      o      w      n      ·      b      u      d      ·      b      u      r      i      e      s      t      ·      t      h      y      ·      c      o      n      t      e      n      t      ,      ¶      a      n      d      ·      t      e      n      d      e      r      ·      c      h      u      r      l      ·      m      a      k      '      s      t      ·      w      a      s      t      e      ·      i      n      ·      n      i      g      g      a      r      d      i      n      g      :      ¶      p      i      t      y      ·      t      h      e      ·      w      o      r      l      d      ,      ·      o      r      ·      e      l      s      e      ·      t      h      i      s      ·      g      l      u      t      t      o      n      ·      b      e      ,      ¶      t      o      ·      e      a      t      ·      t      h      e      ·      w      o      r      l      d      '      s      ·      d      u      e      ,      ·      b      y      ·      t      h      e      ·      g      r      a      v      e      ·      a      n      d      ·      t      h      e      e      .      ␃ 

创建和训练 lstm 网络

定义 lstm 架构。指定一个“序列到序列”lstm 分类网络,其中包含 200 个隐含单元。将训练数据的特征维度(唯一字符的数量)设置为输入大小,将响应中的类别数量设置为全连接层的输出大小。

inputsize = size(xtrain{1},1);
numhiddenunits = 200;
numclasses = numel(categories([ytrain{:}]));
layers = [
    sequenceinputlayer(inputsize)
    lstmlayer(numhiddenunits,'outputmode','sequence')
    fullyconnectedlayer(numclasses)
    softmaxlayer
    classificationlayer];

使用 trainingoptions 函数指定训练选项。将训练轮数指定为 500,将初始学习率指定为 0.01。要防止梯度爆炸,请将梯度阈值设置为 2。通过将 'shuffle' 选项设置为 'every-epoch',指定在每轮对数据进行乱序处理。要监控训练进度,请将 'plots' 选项设置为 'training-progress'。要隐藏详细输出,请将 'verbose' 设置为 false

小批量大小选项指定一次迭代要处理的观测值数量。请指定能够均分数据的小批量大小,以确保函数使用全部观测值进行训练。否则,函数将忽略不能完成一个小批量的观测值。将小批量大小设置为 77。

options = trainingoptions('adam', ...
    'maxepochs',500, ...
    'initiallearnrate',0.01, ...
    'gradientthreshold',2, ...
    'minibatchsize',77,...
    'shuffle','every-epoch', ...
    'plots','training-progress', ...
    'verbose',false);

训练网络。

net = trainnetwork(xtrain,ytrain,layers,options);

生成新文本

使用示例末尾列出的 generatetext 函数,使用经过训练的网络生成文本。

generatetext 函数逐字符生成文本,从文本开始字符开始,并使用特殊字符重新构造文本。该函数使用输出预测分数对每个字符进行采样。当网络预测到文本结束字符或生成的文本长度为 500 个字符时,该函数停止预测。

使用经过训练的网络生成文本。

generatedtext = generatetext(net,uniquecharacters,startoftextcharacter,newlinecharacter,whitespacecharacter,endoftextcharacter)
generatedtext = 
    "look, that your lepperites of such soous toor men,
     where than proud on your sweetest but lever ill lie.
     one of death a deal doth teal hearts come,
     and that which gives did mistress one learn
     made mens of tongue that hands hear,
     and all they with me, do i fortune to brief;
     and every peinted could with this right ampontion sorend
     by genilir'd lime thau hours, and wonder sposing,
     and night by day you waster'd then new;
     for ailling thuse borrowest vein fulse were of here spent,
     since my heart morey "

文本生成函数

generatetext 函数逐字符生成文本,从文本开始字符开始,并使用特殊字符重新构造文本。该函数使用输出预测分数对每个字符进行采样。当网络预测到文本结束字符或生成的文本长度为 500 个字符时,该函数停止预测。

function generatedtext = generatetext(net,uniquecharacters,startoftextcharacter,newlinecharacter,whitespacecharacter,endoftextcharacter)

通过查找文本开始字符的索引来创建其向量。

numuniquecharacters = numel(uniquecharacters);
x = zeros(numuniquecharacters,1);
idx = strfind(uniquecharacters,startoftextcharacter);
x(idx) = 1;

使用经过训练的 lstm 网络,使用 predictandupdatestatedatasample 逐字符生成文本。当网络预测到文本结束字符或生成的文本长度为 500 个字符时,停止预测。datasample 函数需要 statistics and machine learning toolbox™。

对于大型数据集合、长序列或大型网络,在 gpu 上进行预测计算通常比在 cpu 上快。其他情况下,在 cpu 上进行预测计算通常更快。对于单时间步预测,请使用 cpu。要使用 cpu 进行预测,请将 predictandupdatestate'executionenvironment' 选项设置为 'cpu'

generatedtext = "";
vocabulary = string(net.layers(end).classes);
maxlength = 500;
while strlength(generatedtext) < maxlength
    % predict the next character scores.
    [net,characterscores] = predictandupdatestate(net,x,'executionenvironment','cpu');
    
    % sample the next character.
    newcharacter = datasample(vocabulary,1,'weights',characterscores);
    
    % stop predicting at the end of text.
    if newcharacter == endoftextcharacter
        break
    end
    
    % add the character to the generated text.
    generatedtext = generatedtext   newcharacter;
    
    % create a new vector for the next input.
    x(:) = 0;
    idx = strfind(uniquecharacters,newcharacter);
    x(idx) = 1;
end

通过将特殊字符替换为对应的空白字符和换行符,重新构造生成的文本。

generatedtext = replace(generatedtext,[newlinecharacter whitespacecharacter],[newline " "]);
end

另请参阅

| | |

相关主题

网站地图