python主题建模可视化LDA和T-SNE交互式可视化|附代码数据

2023-07-25 12:43 作者:拓端tecdat 0人读过 | 我要投稿

全文下载链接：http://tecdat.cn/?p=6917

我尝试使用Latent Dirichlet分配LDA来提取一些主题。本教程以自然语言处理流程为特色，从原始数据开始，准备，建模，可视化论文。

我们将涉及以下几点

使用LDA进行主题建模
使用pyLDAvis可视化主题模型
使用t-SNE可视化LDA结果

In [1]:

from scipy import sparse as sp

Populating the interactive namespace from numpy and matplotlib

In [2]:

docs = array(p_df\['PaperText'\])

预处理和矢量化文档

In [3]:

from nltk.stem.wordnet import WordNetLemmatizerfrom nltk.tokenize import RegexpTokenizerdef docs_preprocessor(docs):    tokenizer = RegexpTokenizer(r'\\w+')    for idx in range(len(docs)):        docs\[idx\] = docs\[idx\].lower()  # Convert to lowercase.        docs\[idx\] = tokenizer.tokenize(docs\[idx\])  # Split into words.    # 删除数字，但不要删除包含数字的单词。    docs = \[\[token for token in doc if not token.isdigit()\] for doc in docs\]        # 删除仅一个字符的单词。    docs = \[\[token for token in doc if len(token) > 3\] for doc in docs\]        # 使文档中的所有单词规则化    lemmatizer = WordNetLemmatizer()    docs = \[\[lemmatizer.lemmatize(token) for token in doc\] for doc in docs\]      return docs

In [4]:

docs = docs_preprocessor(docs)

计算双字母组/三元组：

主题非常相似，可以区分它们是短语而不是单个单词。

In [5]:

from gensim.models import Phrases# 向文档中添加双字母组和三字母组（仅出现10次或以上的文档）。bigram = Phrases(docs, min_count=10)trigram = Phrases(bigram\[docs\])for idx in range(len(docs)):    for token in bigram\[docs\[idx\]\]:        if '_' in token:            # Token is a bigram, add to document.            docs\[idx\].append(token)    for token in trigram\[docs\[idx\]\]:        if '_' in token:            # token是一个二元组，添加到文档中。            docs\[idx\].append(token)

Using TensorFlow backend./opt/conda/lib/python3.6/site-packages/gensim/models/phrases.py:316: UserWarning: For a faster implementation, use the gensim.models.phrases.Phraser class  warnings.warn("For a faster implementation, use the gensim.models.phrases.Phraser class")

删除

In [6]:

from gensim.corpora import Dictionary# 创建文档的字典表示dictionary = Dictionary(docs)print('Number of unique words in initital documents:', len(dictionary))# 过滤掉少于10个文档或占文档20％以上的单词。dictionary.filter\_extremes(no\_below=10, no_above=0.2)print('Number of unique words after removing rare and common words:', len(dictionary))

Number of unique words in initital documents: 39534Number of unique words after removing rare and common words: 6001

清理常见和罕见的单词，我们最终只有大约6％的词。

矢量化数据：
第一步是获得每个文档的单词表示。

In [7]:

corpus = \[dictionary.doc2bow(doc) for doc in docs\]

In [8]:

print('Number of unique tokens: %d' % len(dictionary))print('Number of documents: %d' % len(corpus))

Number of unique tokens: 6001Number of documents: 403

通过词袋语料库，我们可以继续从文档中学习我们的主题模型。

训练LDA模型

In [9]:

from gensim.models import LdaModel

In [10]:

%time model = LdaModel(corpus=corpus, id2word=id2word, chunksize=chunksize, \                       alpha='auto', eta='auto', \                       iterations=iterations, num\_topics=num\_topics, \                       passes=passes, eval\_every=eval\_every)

CPU times: user 3min 58s, sys: 348 ms, total: 3min 58sWall time: 3min 59s

如何选择主题数量？

LDA是一种无监督的技术，这意味着我们在运行模型之前不知道在我们的语料库中有多少主题存在。主题连贯性是用于确定主题数量的主要技术之一。

但是，我使用了LDA可视化工具pyLDAvis，尝试了几个主题并比较了结果。四个似乎是最能分离主题的最佳主题数量。

In [11]:

import pyLDAvis.gensimpyLDAvis.enable_notebook()import warningswarnings.filterwarnings("ignore", category=DeprecationWarning)

In [12]:

pyLDAvis.gensim.prepare(model, corpus, dictionary)

Out[12]:

我们在这看到什么？

左侧面板，标记为Intertopic Distance Map，圆圈表示不同的主题以及它们之间的距离。类似的主题看起来更近，而不同的主题更远。图中主题圆的相对大小对应于语料库中主题的相对频率。

如何评估我们的模型？

将每个文档分成两部分，看看分配给它们的主题是否类似。 =>越相似越好

将随机选择的文档相互比较。 =>越不相似越好

In [13]:

from sklearn.metrics.pairwise import cosine_similarityp_df\['tokenz'\] = docsdocs1 = p_df\['tokenz'\].apply(lambda l: l\[:int0(len(l)/2)\])docs2 = p_df\['tokenz'\].apply(lambda l: l\[int0(len(l)/2):\])

点击标题查阅往期内容

【视频】文本挖掘：主题模型（LDA）及R语言实现分析游记数据

左右滑动查看更多

转换数据

In [14]:

corpus1 = \[dictionary.doc2bow(doc) for doc in docs1\]corpus2 = \[dictionary.doc2bow(doc) for doc in docs2\]# 使用语料库LDA模型转换lda_corpus1 = model\[corpus1\]lda_corpus2 = model\[corpus2\]

In [15]:

from collections import OrderedDictdef get\_doc\_topic_dist(model, corpus, kwords=False):        '''LDA转换，对于每个文档，仅返回权重非零的主题此函数对主题空间中的文档进行矩阵转换    '''    top_dist =\[\]    keys = \[\]    for d in corpus:        tmp = {i:0 for i in range(num_topics)}        tmp.update(dict(model\[d\]))        vals = list(OrderedDict(tmp).values())        top_dist += \[array(vals)\]        if kwords:            keys += \[array(vals).argmax()\]    return array(top_dist), keys

Intra similarity: cosine similarity for corresponding parts of a doc(higher is better):0.906086532099Inter similarity: cosine similarity between random parts (lower is better):0.846485334252

让我们看一下每个主题中出现的单词。

In [17]:

def explore\_topic(lda\_model, topic_number, topn, output=True):    """输出topn词的列表    """    terms = \[\]    for term, frequency in lda\_model.show\_topic(topic_number, topn=topn):        terms += \[term\]        if output:            print(u'{:20} {:.3f}'.format(term, round(frequency, 3)))        return terms

In [18]:

term                 frequencyTopic 0 |---------------------data_set             0.006embedding            0.004query                0.004document             0.003tensor               0.003multi_label          0.003graphical_model      0.003singular_value       0.003topic_model          0.003margin               0.003Topic 1 |---------------------policy               0.007regret               0.007bandit               0.006reward               0.006active_learning      0.005agent                0.005vertex               0.005item                 0.005reward_function      0.005submodular           0.004Topic 2 |---------------------convolutional        0.005generative_model     0.005variational_inference 0.005recurrent            0.004gaussian_process     0.004fully_connected      0.004recurrent_neural     0.004hidden_unit          0.004deep_learning        0.004hidden_layer         0.004Topic 3 |---------------------convergence_rate     0.007step_size            0.006matrix_completion    0.006rank_matrix          0.005gradient_descent     0.005regret               0.004sample_complexity    0.004strongly_convex      0.004line_search          0.003sample_size          0.003

从上面可以检查每个主题并为其分配一个可解释的标签。在这里我将它们标记如下：

In [19]:

top_labels = {0: 'Statistics', 1:'Numerical Analysis', 2:'Online Learning', 3:'Deep Learning'}

In [20]:

  '''    # 1.删除非字母    paper_text = re.sub("\[^a-zA-Z\]"," ", paper)    # 2.将单词转换为小写并拆分    words = paper_text.lower().split()    # 3. 删除停用词    words = \[w for w in words if not w in stops\]    # 4. 删除短词    words = \[t for t in words if len(t) > 2\]    # 5. 形容词    words = \[nltk.stem.WordNetLemmatizer().lemmatize(t) for t in words\]

In \[21\]:

from sklearn.feature_extraction.text import TfidfVectorizertvectorizer = TfidfVectorizer(input='content', analyzer = 'word', lowercase=True, stop_words='english',\                                  tokenizer=paper\_to\_wordlist, ngram\_range=(1, 3), min\_df=40, max_df=0.20,\                                  norm='l2', use\_idf=True, smooth\_idf=True, sublinear_tf=True)dtm = tvectorizer.fit\_transform(p\_df\['PaperText'\]).toarray()

In [22]:

top_dist =\[\]for d in corpus:    tmp = {i:0 for i in range(num_topics)}    tmp.update(dict(model\[d\]))    vals = list(OrderedDict(tmp).values())    top_dist += \[array(vals)\]

In [23]:

top\_dist, lda\_keys= get\_doc\_topic_dist(model, corpus, True)features = tvectorizer.get\_feature\_names()

In [24]:

top_ws = \[\]for n in range(len(dtm)):    inds = int0(argsort(dtm\[n\])\[::-1\]\[:4\])    tmp = \[features\[i\] for i in inds\]        top_ws += \[' '.join(tmp)\]    cluster_colors = {0: 'blue', 1: 'green', 2: 'yellow', 3: 'red', 4: 'skyblue', 5:'salmon', 6:'orange', 7:'maroon', 8:'crimson', 9:'black', 10:'gray'}p\_df\['colors'\] = p\_df\['clusters'\].apply(lambda l: cluster_colors\[l\])

In [25]:

from sklearn.manifold import TSNEtsne = TSNE(n_components=2)X\_tsne = tsne.fit\_transform(top_dist)

In [26]:

p\_df\['X\_tsne'\] =X_tsne\[:, 0\]p\_df\['Y\_tsne'\] =X_tsne\[:, 1\]

In [27]:

from bokeh.plotting import figure, show, output_notebook, save#输出文件from bokeh.models import HoverTool, value, LabelSet, Legend, ColumnDataSourceoutput_notebook()

BokehJS 0.12.5成功加载。

In [28]:

source = ColumnDataSource(dict(    x=p\_df\['X\_tsne'\],    y=p\_df\['Y\_tsne'\],    color=p_df\['colors'\],    label=p\_df\['clusters'\].apply(lambda l: top\_labels\[l\]),#     msize= p\_df\['marker\_size'\],    topic\_key= p\_df\['clusters'\],    title= p_df\[u'Title'\],    content = p\_df\['Text\_Rep'\]))

In [29]:

title = 'T-SNE visualization of topics'plot_lda.scatter(x='x', y='y', legend='label', source=source,                 color='color', alpha=0.8, size=10)#'msize', )show(plot_lda)

点击文末 “阅读原文”

获取全文完整代码数据资料。

本文选自《python主题建模可视化LDA和T-SNE交互式可视化》。

点击标题查阅往期内容

【视频】文本挖掘：主题模型（LDA）及R语言实现分析游记数据

NLP自然语言处理—主题模型LDA案例：挖掘人民网留言板文本数据

Python主题建模LDA模型、t-SNE 降维聚类、词云可视化文本挖掘新闻组数据集

自然语言处理NLP：主题LDA、情感分析疫情下的新闻文本数据

R语言对NASA元数据进行文本挖掘的主题建模分析

R语言文本挖掘、情感分析和可视化哈利波特小说文本数据

Python、R对小说进行文本挖掘和层次聚类可视化分析案例

用于NLP的Python：使用Keras进行深度学习文本生成

长短期记忆网络LSTM在时间序列预测和文本分类中的应用

用Rapidminer做文本挖掘的应用：情感分析

R语言文本挖掘tf-idf,主题建模，情感分析,n-gram建模研究

R语言对推特twitter数据进行文本情感分析

Python使用神经网络进行简单文本分类

用于NLP的Python：使用Keras的多标签文本LSTM神经网络分类

R语言文本挖掘使用tf-idf分析NASA元数据的关键字

R语言NLP案例：LDA主题文本挖掘优惠券推荐网站数据