AS08 Youtube comment clustering

AS08 Youtube comment clustering#

在AS06萃取關鍵字的練習中，我們曾使用反送中相關影片之留言共八千則作為範例。這次的作業將邀請你使用該筆資料，先篩檢關鍵字，再透過群集法，嘗試找出留言有哪些主題、立場、或者網民討論的面向等等。這次的作業是開放式的結果，也就是每個人所找出來的主題或立場等可能不太一樣，但目標是一致的，找出這些文本裡面有哪些主題、在討論什麼。這種問問題的方式其實也出現在不少專題組別中。你在寫作業的過程，應該會不斷地感受到，這怎麼抽得出來主題？這真的抽得出來嗎？這樣抽出來的主題有效嗎？有意義嗎？然後為了獲得能說服自己的主題，你可能會不斷反覆篩選這些留言，卻又會遇到對代表性的質疑，也就是，縮減到這麼小的資料集合，做出來的結果會有代表性嗎？

所以這個作業有個目的是先讓你透過做作業了解到，如果你做的專題是想回答「這些文本裡面有哪些主題」，很可能會遇到什麼樣的問題。

本週的作業程式碼無特殊規定，要怎麼處理這些資料，或者要怎麼解釋跑出來的東西，都讓同學自由發揮。唯獨需要回答你做了什麼樣的篩檢，還有繪製出視覺化的群集結果，和自我評估用群集來找主題是否有效。

(TODO) Answer the following questions#

(TODO)Document filtering strategies#

目前作業給的資料一共有8000筆留言，每筆留言的欄位有發佈時間、更新時間、按讚數、也可以計算留言的字數，請問你採用了什麼樣的篩選策略？請於以下的文字框回答（編輯下方Markdown，需說明至少五個連續處理過程）。這種過程的書寫通常會出現在論文或者研究報告中，或者出現在學術海報的一小塊paragraph。

（範例答案）

原資料有8000筆留言
篩除XX數不滿XX的留言後，剩下XXX筆留言
篩除XX後不滿XX後，剩下X筆留言
…
…

(TODO)Vocabulary filtering strategies#

在教學範例中，我們介紹不少Strategy來篩選不必要或必要的文字，請問你怎麼做關鍵字篩選？

原始資料在斷詞後共有XXX個相異字詞。（規定填寫）
在經過篩除標點符號後，剩下XXX個相異字詞。（規定填寫）
再經過XXX後，剩下XXX個相異字詞。
…
…

Loading youtube data#

by colab or jupyter-lab#

!mkdir ./data
!wget -P ./data -N https://github.com/p4css/py4css/raw/main/data/yt-comment-antiELAB.xlsx

mkdir: ./data: File exists

--2025-09-21 21:59:08--  https://github.com/p4css/py4css/raw/main/data/yt-comment-antiELAB.xlsx
Resolving github.com (github.com)... 20.27.177.113
Connecting to github.com (github.com)|20.27.177.113|:443... 

connected.
HTTP request sent, awaiting response... 

302 Found
Location: https://raw.githubusercontent.com/p4css/py4css/main/data/yt-comment-antiELAB.xlsx [following]
--2025-09-21 21:59:09--  https://raw.githubusercontent.com/p4css/py4css/main/data/yt-comment-antiELAB.xlsx
Resolving raw.githubusercontent.com (raw.githubusercontent.com)... 185.199.108.133, 185.199.109.133, 185.199.110.133, ...
Connecting to raw.githubusercontent.com (raw.githubusercontent.com)|185.199.108.133|:443... 

connected.

HTTP request sent, awaiting response...

200 OK
Length: 644405 (629K) [application/octet-stream]
Saving to: './data/yt-comment-antiELAB.xlsx'


yt-comment-antiELAB   0%[                    ]       0  --.-KB/s               

yt-comment-antiELAB   6%[>                   ]  40.00K  70.5KB/s

yt-comment-antiELAB  11%[=>                  ]  72.00K  50.8KB/s

yt-comment-antiELAB  16%[==>                 ] 104.00K  58.3KB/s

yt-comment-antiELAB  17%[==>                 ] 112.00K  41.0KB/s

yt-comment-antiELAB  17%[==>                 ] 112.00K  30.3KB/s    eta 17s

yt-comment-antiELAB  19%[==>                 ] 120.00K  27.2KB/s    eta 17s

yt-comment-antiELAB  47%[========>           ] 296.00K  57.3KB/s    eta 6s

yt-comment-antiELAB  58%[==========>         ] 368.00K  60.1KB/s    eta 6s

yt-comment-antiELAB  59%[==========>         ] 376.00K  57.6KB/s    eta 4s

yt-comment-antiELAB  62%[===========>        ] 392.00K  52.8KB/s    eta 4s

yt-comment-antiELAB  67%[============>       ] 424.00K  53.6KB/s    eta 4s

yt-comment-antiELAB  72%[=============>      ] 456.00K  55.0KB/s    eta 4s

yt-comment-antiELAB  75%[==============>     ] 472.00K  54.1KB/s    eta 4s

yt-comment-antiELAB  80%[===============>    ] 504.00K  54.4KB/s    eta 2s

yt-comment-antiELAB  81%[===============>    ] 512.00K  50.1KB/s    eta 2s

yt-comment-antiELAB  82%[===============>    ] 520.00K  48.4KB/s    eta 2s

yt-comment-antiELAB  87%[================>   ] 552.00K  48.4KB/s    eta 2s

yt-comment-antiELAB  92%[=================>  ] 584.00K  48.4KB/s    eta 1s

yt-comment-antiELAB  94%[=================>  ] 592.00K  45.5KB/s    eta 1s

yt-comment-antiELAB  96%[==================> ] 608.00K  43.0KB/s    eta 0s

yt-comment-antiELAB  97%[==================> ] 616.00K  42.5KB/s    eta 0s

yt-comment-antiELAB 100%[===================>] 629.30K  43.1KB/s    in 15s     

Last-modified header missing -- time-stamps turned off.
2025-09-21 21:59:26 (43.1 KB/s) - './data/yt-comment-antiELAB.xlsx' saved [644405/644405]

import pandas as pd
df = pd.read_excel('data/yt-comment-antiELAB.xlsx') 
df

/Users/jirlong/opt/anaconda3/lib/python3.9/site-packages/pandas/core/computation/expressions.py:21: UserWarning: Pandas requires version '2.8.4' or newer of 'numexpr' (version '2.8.1' currently installed).
  from pandas.core.computation.check import NUMEXPR_INSTALLED
/Users/jirlong/opt/anaconda3/lib/python3.9/site-packages/pandas/core/arrays/masked.py:60: UserWarning: Pandas requires version '1.3.6' or newer of 'bottleneck' (version '1.3.4' currently installed).
  from pandas.core import (

---------------------------------------------------------------------------
ImportError                               Traceback (most recent call last)
Input In [2], in <cell line: 2>()
import pandas as pd
----> 2 df = pd.read_excel('data/yt-comment-antiELAB.xlsx') 
df

File ~/opt/anaconda3/lib/python3.9/site-packages/pandas/io/excel/_base.py:495, in read_excel(io, sheet_name, header, names, index_col, usecols, dtype, engine, converters, true_values, false_values, skiprows, nrows, na_values, keep_default_na, na_filter, verbose, parse_dates, date_parser, date_format, thousands, decimal, comment, skipfooter, storage_options, dtype_backend, engine_kwargs)
if not isinstance(io, ExcelFile):
   should_close = True
--> 495     io = ExcelFile(
       io,
       storage_options=storage_options,
       engine=engine,
       engine_kwargs=engine_kwargs,
   )
elif engine and engine != io.engine:
   raise ValueError(
       "Engine should not be specified when passing "
       "an ExcelFile - ExcelFile already has the engine set"
   )

File ~/opt/anaconda3/lib/python3.9/site-packages/pandas/io/excel/_base.py:1567, in ExcelFile.__init__(self, path_or_buffer, engine, storage_options, engine_kwargs)
self.engine = engine
self.storage_options = storage_options
-> 1567 self._reader = self._engines[engine](
   self._io,
   storage_options=storage_options,
   engine_kwargs=engine_kwargs,
)

File ~/opt/anaconda3/lib/python3.9/site-packages/pandas/io/excel/_openpyxl.py:552, in OpenpyxlReader.__init__(self, filepath_or_buffer, storage_options, engine_kwargs)
@doc(storage_options=_shared_docs["storage_options"])
def __init__(
   self,
   (...)
   engine_kwargs: dict | None = None,
) -> None:
   """
   Reader using openpyxl engine.

   (...)
       Arbitrary keyword arguments passed to excel engine.
   """
--> 552     import_optional_dependency("openpyxl")
   super().__init__(
       filepath_or_buffer,
       storage_options=storage_options,
       engine_kwargs=engine_kwargs,
   )

File ~/opt/anaconda3/lib/python3.9/site-packages/pandas/compat/_optional.py:164, in import_optional_dependency(name, extra, errors, min_version)
   return None
elif errors == "raise":
--> 164     raise ImportError(msg)
else:
   return None

ImportError: Pandas requires version '3.1.0' or newer of 'openpyxl' (version '3.0.9' currently installed).

by local notebook#

import pandas as pd
df = pd.read_excel('data/yt-comment-antiELAB.xlsx')
df

	videoId	commentId	parentId	authorDisplayName	textOriginal	likeCount	publishedAt	updatedAt
0	2_tXjqhbe5E	UgwQIF9qNqGydjE2NkN4AaABAg	NaN	CHI-HAU CHEN	鄭大哥和吉雷米的互動很可愛，歡迎加入台灣這個溫馨的家庭喔~~~~	19	2020-04-02T01:40:29Z	2020-04-02T01:40:29Z
1	2_tXjqhbe5E	UgwuP0Jva-U69xTMaEF4AaABAg	NaN	Funky Duck	牛嘴掩.....\n這高雄老外的台語能力應該比台北人好	1	2020-04-07T04:36:37Z	2020-04-07T04:36:37Z
2	2_tXjqhbe5E	UgyVlhx36V2XWLOM9MZ4AaABAg	NaN	Ally	Zoom把客戶資料洩給中國，現在紐約市已不讓學生用這系統視訊上課了!	0	2020-04-06T19:54:04Z	2020-04-06T19:54:04Z
3	2_tXjqhbe5E	Ugz4v7OudQxaDXYyegZ4AaABAg	NaN	Ally	法國人執行居家避疫比起美國比較確實，外出還要有通行證	0	2020-04-06T19:18:24Z	2020-04-06T19:18:24Z
4	2_tXjqhbe5E	Ugyb-ogACbZWVewnU-94AaABAg	NaN	Kitty Wong	仆街鄭	0	2020-04-06T15:20:34Z	2020-04-06T15:20:34Z
...	...	...	...	...	...	...	...	...
7995	ySMAcMAL6rY	UgzU9oD5I6q1qHLzx_B4AaABAg	NaN	nova lee	自己不是會算嗎	1	2019-08-22T04:00:31Z	2019-08-22T04:00:31Z
7996	zHlhQoT9OF0	Ugz2ff-Be0yCoEUY-Rd4AaABAg	NaN	the world Rock	常德說的真好	3	2020-04-10T03:21:34Z	2020-04-10T03:21:34Z
7997	zQWzh4yj_g8	Ugwex9I2lZWa-DgeWrx4AaABAg	NaN	Zoe Su	自私的人多的是，水準就和中國人一樣	1	2020-02-12T22:13:24Z	2020-02-12T22:13:24Z
7998	zQWzh4yj_g8	UgxpokMr9hHK0Ugcvl94AaABAg	NaN	非也非也	人都自私自利的，適者生存，不適者自己想辦法，沒有人會幫的，靠自己最實在	0	2020-02-12T04:53:28Z	2020-02-12T04:53:28Z
7999	zQWzh4yj_g8	Ugwhiijo6yStMhOZqVB4AaABAg	NaN	你们是愚民	鑽石公主號我們的同胞22人以及留在大陸我們的同胞救回來了沒有，菜英文民進黨更無情，更自私吧！...	0	2020-02-12T04:25:58Z	2020-02-12T04:25:58Z

8000 rows × 8 columns

Feature selections#

用tf-idf或者word2vec的方式做Feature representation

Clustering#

照K-mean Clustering的步驟做Clustering。

Evaluating#

K-mean需要先找到比較好的K值，也就是要分為多少群，所以要照教學Clustering先找到比較好的K。

Final clustering by k=?#

用前面所找到的K值來做Clustering

Visualization#

Visualizing Doc distribution#

仿照Clustering將所有的文本降為二維後繪製其Scatter，看看分群的結果是否有效呈現出視覺化上可見的群集。

Visualizing Words distribution of each cluster#

仿照Clustering分別計算每一個群中文章的關鍵字分佈，刪除停用詞或標點符號後，找到每個群的重要關鍵字，並用長條橫圖（Horizontal Bar chart）表示出每個群的前十大關鍵字，以觀察群間的關鍵字差異。

Explaining your results#

從上述群集的結果，你觀察到有哪些留言群集？請評估你所找出來的關鍵字群集。

(Your Answer here) 在前述篩選策略下，從Evaluating的結果，我選擇以k=?作為群集數。經視覺化每個群集的關鍵字後，一共發現以下幾大主題。

k=0: 該群集為…
k=1: 該群集的主要關鍵字為…屬於…主題
k=2:
k=3:
k=?…

有效的主題我認為共有「XXXX（k=0, 1」、「XXXXX（k=2）」、「XXXXX（k=3, 4）」、「」等n個主題。k=5的群集只有單一關鍵字極高，其他關鍵字稀少，可能是受到……的影響。將包含關鍵字相關的留言抽出結果如下方dataFrame所示，為…的主題。

Final Modification (if you have)#

（這部分不計分）經觀察前述群集結果後，我篩除了XXX關鍵字，重新進行群集後所得到的結果為以下幾個主題，分別包含「XXXX（k=0, 1」、「XXXXX（k=2）」、「XXXXX（k=3, 4）」、「」