AS06 - Tokenization and keywords#
Q1~Q5: Each question is worth 15 points
Q6: 25 points
import json
post_list = []
# 開啟 JSONL 檔案,一行一行讀取並解析 JSON
with open("../_build/html/data/2020-01-02.jsonl", "r", encoding="utf-8") as file:
for line in file:
try:
# 解析 JSON 字典並加入到列表中
post = json.loads(line)
post_list.append(post)
except json.JSONDecodeError as e:
# 處理解析錯誤
print(f"解析 JSON 時出錯: {str(e)}")
---------------------------------------------------------------------------
FileNotFoundError Traceback (most recent call last)
Input In [1], in <cell line: 5>()
2 post_list = []
4 # 開啟 JSONL 檔案,一行一行讀取並解析 JSON
----> 5 with open("../_build/html/data/2020-01-02.jsonl", "r", encoding="utf-8") as file:
6 for line in file:
7 try:
8 # 解析 JSON 字典並加入到列表中
FileNotFoundError: [Errno 2] No such file or directory: '../_build/html/data/2020-01-02.jsonl'
import pandas as pd
raw = pd.DataFrame(post_list)
post_df = raw[['canonical_url', 'title', 'author', 'published_at', 'text']]
display(post_df.shape)
post_df.head()
(118, 5)
| canonical_url | title | author | published_at | text | |
|---|---|---|---|---|---|
| 0 | https://www.ptt.cc/bbs/Gossiping/M.1577894589.... | Re: [爆卦] 館長直播爆料柯師傅執行力強大 | KKB | 2020-01-02T00:03:06 | 這館長直播重播\nhttps://youtu.be/Lv3JIt5GKVA?t=838\n找... |
| 1 | https://www.ptt.cc/bbs/Gossiping/M.1577894641.... | [問卦] 台中和高雄的跨年晚會誰比較尷尬? | SuperIcon | 2020-01-02T00:03:59 | 台中后里跨年,\n\n吳宗憲演唱時居然有一個男大生騎車衝上舞台,\n\n當下也沒有任何工作人... |
| 2 | https://www.ptt.cc/bbs/Gossiping/M.1577896480.... | Re: [爆卦] 館長直播爆料柯師傅執行力強大 | generally | 2020-01-02T00:34:38 | 說真的,高嘉瑜很愛舔柯,風向仔一個\n我本來就不喜歡她,柯糞要投挺韓仔李彥秀喔,\n\n那剛... |
| 3 | https://www.ptt.cc/bbs/Gossiping/M.1577896538.... | [問卦] 本周要請前面特休還是後面特休比較好的卦 | Emerson158 | 2020-01-02T00:35:34 | 全世界的人都知道,\n\n今年元旦是星期三,\n\n然後大部分的人都知道,元旦要放假.\n\... |
| 4 | https://www.ptt.cc/bbs/Gossiping/M.1577898259.... | Re: [爆卦] 館長直播爆料柯師傅執行力強大 | mballen | 2020-01-02T01:04:17 | 唉,這種事也要炒\n\n柯糞=蟑螂=1450=五毛\n\n整天在ptt炒作這種無聊話題\n\... |
Q1 Replacing all English words and numbers#
Write a function to Remove all English words and numbers in the
textcolumnUsing
.apply()to apply the function to thetextcolumn, and save it to a new column namedtext_cleaned
# Your code should be here
Q2 Replacing all punctuations and spaces#
Write a function to Remove all punctuations and spaces in the
text_cleanedcolumnUsing
.apply()to apply the function to thetext_cleanedcolumn
# Your code should be here
Q3. Tokenization by jeiba#
Tokenize the
text_cleanedcolumn by jeiba and generate a new column namedtokensto thepost_df
# Your code should be here
Q4. Remove all Chinese stop words#
Write a function to Remove all Chinese stop words in the
tokenscolumnUsing
.apply()to apply the function to thetokenscolumn, and save the result to a new column namedcleaned_tokens
# Your code should be here
Q5. Calculate word frequency and print the top 30 words#
Select the
cleaned_tokenscolumn and calculate the word frequency by usingCounter()function.Print the top 30 words and their frequency
# Your code should be here
Q6. Retrieve keywords from posts#
Describe the question to ChatGPT, and ask ChatGPT to know how to retrieve keywords from tokens of posts by TF-IDF algorithm
Add a new column called ‘keywords,’ and these keywords are extracted from the ‘cleaned_tokens’ column using TF-IDF calculation. Originally, each pandas cell in the ‘cleaned_tokens’ column should be a list of strings, and ‘keywords’ should also be a list of strings.
# Your code should be here
Print for verification#
Print the first 5 rows of the
post_dfto verify the resultJust select the
cleaned_text,cleaned_tokens, andkeywordscolumns to print
# Your code should be here