AS06 - Tokenization and keywords

AS06 - Tokenization and keywords#

Q1~Q5: Each question is worth 15 points
Q6: 25 points

import json
post_list = []

# 開啟 JSONL 檔案，一行一行讀取並解析 JSON
with open("../_build/html/data/2020-01-02.jsonl", "r", encoding="utf-8") as file:
    for line in file:
        try:
            # 解析 JSON 字典並加入到列表中
            post = json.loads(line)
            post_list.append(post)
        except json.JSONDecodeError as e:
            # 處理解析錯誤
            print(f"解析 JSON 時出錯: {str(e)}")

---------------------------------------------------------------------------
FileNotFoundError                         Traceback (most recent call last)
Input In [1], in <cell line: 5>()
      2 post_list = []
      4 # 開啟 JSONL 檔案，一行一行讀取並解析 JSON
----> 5 with open("../_build/html/data/2020-01-02.jsonl", "r", encoding="utf-8") as file:
      6     for line in file:
      7         try:
      8             # 解析 JSON 字典並加入到列表中

FileNotFoundError: [Errno 2] No such file or directory: '../_build/html/data/2020-01-02.jsonl'

import pandas as pd
raw = pd.DataFrame(post_list)
post_df = raw[['canonical_url', 'title', 'author', 'published_at', 'text']]
display(post_df.shape)
post_df.head()

(118, 5)

	canonical_url	title	author	published_at	text
0	https://www.ptt.cc/bbs/Gossiping/M.1577894589....	Re: [爆卦] 館長直播爆料柯師傅執行力強大	KKB	2020-01-02T00:03:06	這館長直播重播\nhttps://youtu.be/Lv3JIt5GKVA?t=838\n找...
1	https://www.ptt.cc/bbs/Gossiping/M.1577894641....	[問卦] 台中和高雄的跨年晚會誰比較尷尬?	SuperIcon	2020-01-02T00:03:59	台中后里跨年，\n\n吳宗憲演唱時居然有一個男大生騎車衝上舞台，\n\n當下也沒有任何工作人...
2	https://www.ptt.cc/bbs/Gossiping/M.1577896480....	Re: [爆卦] 館長直播爆料柯師傅執行力強大	generally	2020-01-02T00:34:38	說真的，高嘉瑜很愛舔柯，風向仔一個\n我本來就不喜歡她，柯糞要投挺韓仔李彥秀喔，\n\n那剛...
3	https://www.ptt.cc/bbs/Gossiping/M.1577896538....	[問卦] 本周要請前面特休還是後面特休比較好的卦	Emerson158	2020-01-02T00:35:34	全世界的人都知道,\n\n今年元旦是星期三,\n\n然後大部分的人都知道,元旦要放假.\n\...
4	https://www.ptt.cc/bbs/Gossiping/M.1577898259....	Re: [爆卦] 館長直播爆料柯師傅執行力強大	mballen	2020-01-02T01:04:17	唉，這種事也要炒\n\n柯糞=蟑螂=1450=五毛\n\n整天在ptt炒作這種無聊話題\n\...

Q1 Replacing all English words and numbers#

Write a function to Remove all English words and numbers in the text column
Using .apply() to apply the function to the text column, and save it to a new column named text_cleaned

# Your code should be here

Q2 Replacing all punctuations and spaces#

Write a function to Remove all punctuations and spaces in the text_cleaned column
Using .apply() to apply the function to the text_cleaned column

# Your code should be here

Q3. Tokenization by jeiba#

Tokenize the text_cleaned column by jeiba and generate a new column named tokens to the post_df

# Your code should be here

Q4. Remove all Chinese stop words#

Write a function to Remove all Chinese stop words in the tokens column
Using .apply() to apply the function to the tokens column, and save the result to a new column named cleaned_tokens

# Your code should be here

Q5. Calculate word frequency and print the top 30 words#

Select the cleaned_tokens column and calculate the word frequency by using Counter() function.
Print the top 30 words and their frequency

# Your code should be here

Q6. Retrieve keywords from posts#

Describe the question to ChatGPT, and ask ChatGPT to know how to retrieve keywords from tokens of posts by TF-IDF algorithm
Add a new column called ‘keywords,’ and these keywords are extracted from the ‘cleaned_tokens’ column using TF-IDF calculation. Originally, each pandas cell in the ‘cleaned_tokens’ column should be a list of strings, and ‘keywords’ should also be a list of strings.

# Your code should be here

Print for verification#

Print the first 5 rows of the post_df to verify the result
Just select the cleaned_text, cleaned_tokens, and keywords columns to print

# Your code should be here

AS06 - Tokenization and keywords

Contents

AS06 - Tokenization and keywords#

Q1 Replacing all English words and numbers#

Q2 Replacing all punctuations and spaces#

Q3. Tokenization by jeiba#

Q4. Remove all Chinese stop words#

Q5. Calculate word frequency and print the top 30 words#

Q6. Retrieve keywords from posts#

Print for verification#