TM07 word2vec vs doc2vec#

Loading data#

!mkdir ./data
!wget -P ./data -N https://github.com/p4css/py4css/raw/main/data/sentiment.csv
mkdir: ./data: File exists
zsh:1: command not found: wget
import pandas as pd
df = pd.read_csv('data/sentiment.csv')
df.head(5)
tag text
0 P 店家很給力,快遞也是相當快,第三次光顧啦
1 N 這樣的配置用Vista系統還是有點卡。 指紋收集器。 沒送原裝滑鼠還需要自己買,不太好。
2 P 不錯,在同等檔次酒店中應該是值得推薦的!
3 N 哎! 不會是蒙牛乾的吧 嚴懲真凶!
4 N 空尤其是三立電視臺女主播做的序尤其無趣像是硬湊那麼多字

Tokenization#

import jieba
df['token_text'] = df['text'].apply(lambda x:list(jieba.cut(x)))
df.head()
Building prefix dict from the default dictionary ...
Loading model from cache /var/folders/4m/shks9p8j0dnbv51nf7cyysfc0000gn/T/jieba.cache
Loading model cost 0.261 seconds.
Prefix dict has been built successfully.
tag text token_text
0 P 店家很給力,快遞也是相當快,第三次光顧啦 [店家, 很, 給力, ,, 快遞, 也, 是, 相當快, ,, 第三次, 光顧, 啦]
1 N 這樣的配置用Vista系統還是有點卡。 指紋收集器。 沒送原裝滑鼠還需要自己買,不太好。 [這樣, 的, 配置, 用, Vista, 系統, 還是, 有點, 卡, 。, , 指紋,...
2 P 不錯,在同等檔次酒店中應該是值得推薦的! [不錯, ,, 在, 同等, 檔次, 酒店, 中應, 該, 是, 值得, 推薦, 的, !]
3 N 哎! 不會是蒙牛乾的吧 嚴懲真凶! [哎, !, , 不會, 是, 蒙牛, 乾, 的, 吧, , 嚴懲, 真凶, !]
4 N 空尤其是三立電視臺女主播做的序尤其無趣像是硬湊那麼多字 [空, 尤其, 是, 三立, 電視, 臺, 女主播, 做, 的, 序, 尤其, 無趣, 像是...

Training w2v#

from gensim.models import Word2Vec
w2v = Word2Vec(df['token_text'], min_count=1, vector_size=300, window=10, sg=0, workers=4)

Representsd by w2v#

import numpy as np

all_list = []
for i, tokens in enumerate(df["token_text"]):
    temp_w2v = np.zeros(300)
    for tok in tokens:
        temp_w2v += w2v.wv[tok]
    all_list.append(temp_w2v)
    
X = np.array(all_list)

Training doc2vec#

from gensim.models.doc2vec import Doc2Vec, TaggedDocument

Represented by TaggedDocument#

tagged = [TaggedDocument(words=tokens, tags=[i]) 
          for i, tokens in enumerate(df["token_text"])]

Training model#

d2v = Doc2Vec(tagged, vector_size=100, alpha=0.025, window=5,
              min_alpha=0.00025, min_count=5, dm=1)
    # dm=1 : ‘distributed memory’ (PV-DM) ; 
    # dm=0 : ‘distributed bag of words’ (PV-DBOW)

d2v.train(tagged, total_examples=d2v.corpus_count, epochs=20)

Represented by d2v#

import numpy as np

all_list = []
for i, tokens in enumerate(df["token_text"]):
    all_list.append(d2v.infer_vector(tokens))
    
X = np.array(all_list)
X.shape
(6388, 100)

Plotting#

Reduced by umap#

# !pip install umap-learn
import umap
umap_embeddings = umap.UMAP(n_neighbors=15, 
                            n_components=5, 
                            metric='cosine').fit_transform(X)
/Users/jirlong/anaconda3/lib/python3.10/site-packages/tqdm/auto.py:21: TqdmWarning: IProgress not found. Please update jupyter and ipywidgets. See https://ipywidgets.readthedocs.io/en/stable/user_install.html
  from .autonotebook import tqdm as notebook_tqdm

Clustering by HDBSCAN#

# !pip install hdbscan
import hdbscan
from collections import Counter

cluster = hdbscan.HDBSCAN(min_cluster_size= 50,
                          metric='euclidean',                      
                          cluster_selection_method='eom').fit(umap_embeddings)

df['cluster'] = list(cluster.labels_)
print(df.columns)
print(Counter(df['cluster']))
Index(['tag', 'text', 'token_text', 'cluster'], dtype='object')
Counter({1: 5825, 0: 563})

Plotting#

import matplotlib.pyplot as plt

umap_data = umap.UMAP(n_neighbors=15, n_components=2, min_dist=0.0, metric='cosine').fit_transform(X)
result = pd.DataFrame(umap_data, columns=['x', 'y'])
result['labels'] = cluster.labels_

fig, ax = plt.subplots(figsize=(20, 10))
outliers = result.loc[result.labels == -1, :]
clustered = result.loc[result.labels != -1, :]
plt.scatter(outliers.x, outliers.y, color='#BDBDBD', s=5, alpha=0.3)
plt.scatter(clustered.x, clustered.y, c=clustered.labels, s=5, alpha=0.5, cmap='hsv_r')
plt.colorbar()
<matplotlib.colorbar.Colorbar at 0x326ed6a70>
../_images/cdc5677d49def6a1820351cc63d2d86c97e2139fd6ba98dcec030204ec45fcb5.png