TM07 word2vec vs doc2vec#
Loading data#
!mkdir ./data
!wget -P ./data -N https://github.com/p4css/py4css/raw/main/data/sentiment.csv
mkdir: ./data: File exists
zsh:1: command not found: wget
import pandas as pd
df = pd.read_csv('data/sentiment.csv')
df.head(5)
tag | text | |
---|---|---|
0 | P | 店家很給力,快遞也是相當快,第三次光顧啦 |
1 | N | 這樣的配置用Vista系統還是有點卡。 指紋收集器。 沒送原裝滑鼠還需要自己買,不太好。 |
2 | P | 不錯,在同等檔次酒店中應該是值得推薦的! |
3 | N | 哎! 不會是蒙牛乾的吧 嚴懲真凶! |
4 | N | 空尤其是三立電視臺女主播做的序尤其無趣像是硬湊那麼多字 |
Tokenization#
import jieba
df['token_text'] = df['text'].apply(lambda x:list(jieba.cut(x)))
df.head()
Building prefix dict from the default dictionary ...
Loading model from cache /var/folders/4m/shks9p8j0dnbv51nf7cyysfc0000gn/T/jieba.cache
Loading model cost 0.261 seconds.
Prefix dict has been built successfully.
tag | text | token_text | |
---|---|---|---|
0 | P | 店家很給力,快遞也是相當快,第三次光顧啦 | [店家, 很, 給力, ,, 快遞, 也, 是, 相當快, ,, 第三次, 光顧, 啦] |
1 | N | 這樣的配置用Vista系統還是有點卡。 指紋收集器。 沒送原裝滑鼠還需要自己買,不太好。 | [這樣, 的, 配置, 用, Vista, 系統, 還是, 有點, 卡, 。, , 指紋,... |
2 | P | 不錯,在同等檔次酒店中應該是值得推薦的! | [不錯, ,, 在, 同等, 檔次, 酒店, 中應, 該, 是, 值得, 推薦, 的, !] |
3 | N | 哎! 不會是蒙牛乾的吧 嚴懲真凶! | [哎, !, , 不會, 是, 蒙牛, 乾, 的, 吧, , 嚴懲, 真凶, !] |
4 | N | 空尤其是三立電視臺女主播做的序尤其無趣像是硬湊那麼多字 | [空, 尤其, 是, 三立, 電視, 臺, 女主播, 做, 的, 序, 尤其, 無趣, 像是... |
Training w2v#
from gensim.models import Word2Vec
w2v = Word2Vec(df['token_text'], min_count=1, vector_size=300, window=10, sg=0, workers=4)
Representsd by w2v#
import numpy as np
all_list = []
for i, tokens in enumerate(df["token_text"]):
temp_w2v = np.zeros(300)
for tok in tokens:
temp_w2v += w2v.wv[tok]
all_list.append(temp_w2v)
X = np.array(all_list)
Training doc2vec#
from gensim.models.doc2vec import Doc2Vec, TaggedDocument
Represented by TaggedDocument#
tagged = [TaggedDocument(words=tokens, tags=[i])
for i, tokens in enumerate(df["token_text"])]
Training model#
d2v = Doc2Vec(tagged, vector_size=100, alpha=0.025, window=5,
min_alpha=0.00025, min_count=5, dm=1)
# dm=1 : ‘distributed memory’ (PV-DM) ;
# dm=0 : ‘distributed bag of words’ (PV-DBOW)
d2v.train(tagged, total_examples=d2v.corpus_count, epochs=20)
Represented by d2v#
import numpy as np
all_list = []
for i, tokens in enumerate(df["token_text"]):
all_list.append(d2v.infer_vector(tokens))
X = np.array(all_list)
X.shape
(6388, 100)
Plotting#
Reduced by umap#
# !pip install umap-learn
import umap
umap_embeddings = umap.UMAP(n_neighbors=15,
n_components=5,
metric='cosine').fit_transform(X)
/Users/jirlong/anaconda3/lib/python3.10/site-packages/tqdm/auto.py:21: TqdmWarning: IProgress not found. Please update jupyter and ipywidgets. See https://ipywidgets.readthedocs.io/en/stable/user_install.html
from .autonotebook import tqdm as notebook_tqdm
Clustering by HDBSCAN#
# !pip install hdbscan
import hdbscan
from collections import Counter
cluster = hdbscan.HDBSCAN(min_cluster_size= 50,
metric='euclidean',
cluster_selection_method='eom').fit(umap_embeddings)
df['cluster'] = list(cluster.labels_)
print(df.columns)
print(Counter(df['cluster']))
Index(['tag', 'text', 'token_text', 'cluster'], dtype='object')
Counter({1: 5825, 0: 563})
Plotting#
import matplotlib.pyplot as plt
umap_data = umap.UMAP(n_neighbors=15, n_components=2, min_dist=0.0, metric='cosine').fit_transform(X)
result = pd.DataFrame(umap_data, columns=['x', 'y'])
result['labels'] = cluster.labels_
fig, ax = plt.subplots(figsize=(20, 10))
outliers = result.loc[result.labels == -1, :]
clustered = result.loc[result.labels != -1, :]
plt.scatter(outliers.x, outliers.y, color='#BDBDBD', s=5, alpha=0.3)
plt.scatter(clustered.x, clustered.y, c=clustered.labels, s=5, alpha=0.5, cmap='hsv_r')
plt.colorbar()
<matplotlib.colorbar.Colorbar at 0x326ed6a70>
