AS06 - Tokenization and keywords#

  1. Q1~Q5: Each question is worth 15 points

  2. Q6: 25 points

import json
post_list = []

# 開啟 JSONL 檔案,一行一行讀取並解析 JSON
with open("../_build/html/data/2020-01-02.jsonl", "r", encoding="utf-8") as file:
    for line in file:
        try:
            # 解析 JSON 字典並加入到列表中
            post = json.loads(line)
            post_list.append(post)
        except json.JSONDecodeError as e:
            # 處理解析錯誤
            print(f"解析 JSON 時出錯: {str(e)}")
---------------------------------------------------------------------------
FileNotFoundError                         Traceback (most recent call last)
Cell In[1], line 5
      2 post_list = []
      4 # 開啟 JSONL 檔案,一行一行讀取並解析 JSON
----> 5 with open("../_build/html/data/2020-01-02.jsonl", "r", encoding="utf-8") as file:
      6     for line in file:
      7         try:
      8             # 解析 JSON 字典並加入到列表中

File ~/anaconda3/lib/python3.10/site-packages/IPython/core/interactiveshell.py:282, in _modified_open(file, *args, **kwargs)
    275 if file in {0, 1, 2}:
    276     raise ValueError(
    277         f"IPython won't let you open fd={file} by default "
    278         "as it is likely to crash IPython. If you know what you are doing, "
    279         "you can use builtins' open."
    280     )
--> 282 return io_open(file, *args, **kwargs)

FileNotFoundError: [Errno 2] No such file or directory: '../_build/html/data/2020-01-02.jsonl'
import pandas as pd
raw = pd.DataFrame(post_list)
post_df = raw[['canonical_url', 'title', 'author', 'published_at', 'text']]
display(post_df.shape)
post_df.head()
(118, 5)
canonical_url title author published_at text
0 https://www.ptt.cc/bbs/Gossiping/M.1577894589.... Re: [爆卦] 館長直播爆料柯師傅執行力強大 KKB 2020-01-02T00:03:06 這館長直播重播\nhttps://youtu.be/Lv3JIt5GKVA?t=838\n找...
1 https://www.ptt.cc/bbs/Gossiping/M.1577894641.... [問卦] 台中和高雄的跨年晚會誰比較尷尬? SuperIcon 2020-01-02T00:03:59 台中后里跨年,\n\n吳宗憲演唱時居然有一個男大生騎車衝上舞台,\n\n當下也沒有任何工作人...
2 https://www.ptt.cc/bbs/Gossiping/M.1577896480.... Re: [爆卦] 館長直播爆料柯師傅執行力強大 generally 2020-01-02T00:34:38 說真的,高嘉瑜很愛舔柯,風向仔一個\n我本來就不喜歡她,柯糞要投挺韓仔李彥秀喔,\n\n那剛...
3 https://www.ptt.cc/bbs/Gossiping/M.1577896538.... [問卦] 本周要請前面特休還是後面特休比較好的卦 Emerson158 2020-01-02T00:35:34 全世界的人都知道,\n\n今年元旦是星期三,\n\n然後大部分的人都知道,元旦要放假.\n\...
4 https://www.ptt.cc/bbs/Gossiping/M.1577898259.... Re: [爆卦] 館長直播爆料柯師傅執行力強大 mballen 2020-01-02T01:04:17 唉,這種事也要炒\n\n柯糞=蟑螂=1450=五毛\n\n整天在ptt炒作這種無聊話題\n\...

Q1 Replacing all English words and numbers#

  1. Write a function to Remove all English words and numbers in the text column

  2. Using .apply() to apply the function to the text column, and save it to a new column named text_cleaned

# Your code should be here

Q2 Replacing all punctuations and spaces#

  1. Write a function to Remove all punctuations and spaces in the text_cleaned column

  2. Using .apply() to apply the function to the text_cleaned column

# Your code should be here

Q3. Tokenization by jeiba#

  • Tokenize the text_cleaned column by jeiba and generate a new column named tokens to the post_df

# Your code should be here

Q4. Remove all Chinese stop words#

  1. Write a function to Remove all Chinese stop words in the tokens column

  2. Using .apply() to apply the function to the tokens column, and save the result to a new column named cleaned_tokens

# Your code should be here

Q5. Calculate word frequency and print the top 30 words#

  1. Select the cleaned_tokens column and calculate the word frequency by using Counter() function.

  2. Print the top 30 words and their frequency

# Your code should be here

Q6. Retrieve keywords from posts#

  1. Describe the question to ChatGPT, and ask ChatGPT to know how to retrieve keywords from tokens of posts by TF-IDF algorithm

  2. Add a new column called ‘keywords,’ and these keywords are extracted from the ‘cleaned_tokens’ column using TF-IDF calculation. Originally, each pandas cell in the ‘cleaned_tokens’ column should be a list of strings, and ‘keywords’ should also be a list of strings.

# Your code should be here