AS06 - Tokenization and keywords#
Q1~Q5: Each question is worth 15 points
Q6: 25 points
import json
post_list = []
# 開啟 JSONL 檔案,一行一行讀取並解析 JSON
with open("../_build/html/data/2020-01-02.jsonl", "r", encoding="utf-8") as file:
for line in file:
try:
# 解析 JSON 字典並加入到列表中
post = json.loads(line)
post_list.append(post)
except json.JSONDecodeError as e:
# 處理解析錯誤
print(f"解析 JSON 時出錯: {str(e)}")
---------------------------------------------------------------------------
FileNotFoundError Traceback (most recent call last)
Cell In[1], line 5
2 post_list = []
4 # 開啟 JSONL 檔案,一行一行讀取並解析 JSON
----> 5 with open("../_build/html/data/2020-01-02.jsonl", "r", encoding="utf-8") as file:
6 for line in file:
7 try:
8 # 解析 JSON 字典並加入到列表中
File ~/anaconda3/lib/python3.10/site-packages/IPython/core/interactiveshell.py:282, in _modified_open(file, *args, **kwargs)
275 if file in {0, 1, 2}:
276 raise ValueError(
277 f"IPython won't let you open fd={file} by default "
278 "as it is likely to crash IPython. If you know what you are doing, "
279 "you can use builtins' open."
280 )
--> 282 return io_open(file, *args, **kwargs)
FileNotFoundError: [Errno 2] No such file or directory: '../_build/html/data/2020-01-02.jsonl'
import pandas as pd
raw = pd.DataFrame(post_list)
post_df = raw[['canonical_url', 'title', 'author', 'published_at', 'text']]
display(post_df.shape)
post_df.head()
(118, 5)
canonical_url | title | author | published_at | text | |
---|---|---|---|---|---|
0 | https://www.ptt.cc/bbs/Gossiping/M.1577894589.... | Re: [爆卦] 館長直播爆料柯師傅執行力強大 | KKB | 2020-01-02T00:03:06 | 這館長直播重播\nhttps://youtu.be/Lv3JIt5GKVA?t=838\n找... |
1 | https://www.ptt.cc/bbs/Gossiping/M.1577894641.... | [問卦] 台中和高雄的跨年晚會誰比較尷尬? | SuperIcon | 2020-01-02T00:03:59 | 台中后里跨年,\n\n吳宗憲演唱時居然有一個男大生騎車衝上舞台,\n\n當下也沒有任何工作人... |
2 | https://www.ptt.cc/bbs/Gossiping/M.1577896480.... | Re: [爆卦] 館長直播爆料柯師傅執行力強大 | generally | 2020-01-02T00:34:38 | 說真的,高嘉瑜很愛舔柯,風向仔一個\n我本來就不喜歡她,柯糞要投挺韓仔李彥秀喔,\n\n那剛... |
3 | https://www.ptt.cc/bbs/Gossiping/M.1577896538.... | [問卦] 本周要請前面特休還是後面特休比較好的卦 | Emerson158 | 2020-01-02T00:35:34 | 全世界的人都知道,\n\n今年元旦是星期三,\n\n然後大部分的人都知道,元旦要放假.\n\... |
4 | https://www.ptt.cc/bbs/Gossiping/M.1577898259.... | Re: [爆卦] 館長直播爆料柯師傅執行力強大 | mballen | 2020-01-02T01:04:17 | 唉,這種事也要炒\n\n柯糞=蟑螂=1450=五毛\n\n整天在ptt炒作這種無聊話題\n\... |
Q1 Replacing all English words and numbers#
Write a function to Remove all English words and numbers in the
text
columnUsing
.apply()
to apply the function to thetext
column, and save it to a new column namedtext_cleaned
# Your code should be here
Q2 Replacing all punctuations and spaces#
Write a function to Remove all punctuations and spaces in the
text_cleaned
columnUsing
.apply()
to apply the function to thetext_cleaned
column
# Your code should be here
Q3. Tokenization by jeiba#
Tokenize the
text_cleaned
column by jeiba and generate a new column namedtokens
to thepost_df
# Your code should be here
Q4. Remove all Chinese stop words#
Write a function to Remove all Chinese stop words in the
tokens
columnUsing
.apply()
to apply the function to thetokens
column, and save the result to a new column namedcleaned_tokens
# Your code should be here
Q5. Calculate word frequency and print the top 30 words#
Select the
cleaned_tokens
column and calculate the word frequency by usingCounter()
function.Print the top 30 words and their frequency
# Your code should be here
Q6. Retrieve keywords from posts#
Describe the question to ChatGPT, and ask ChatGPT to know how to retrieve keywords from tokens of posts by TF-IDF algorithm
Add a new column called ‘keywords,’ and these keywords are extracted from the ‘cleaned_tokens’ column using TF-IDF calculation. Originally, each pandas cell in the ‘cleaned_tokens’ column should be a list of strings, and ‘keywords’ should also be a list of strings.
# Your code should be here
Print for verification#
Print the first 5 rows of the
post_df
to verify the resultJust select the
cleaned_text
,cleaned_tokens
, andkeywords
columns to print
# Your code should be here