作業目的:

這份作業希望能夠讓你熟悉中文文字處理,並執行基本的文字相關分析,再將結果以圖表呈現。過程中會運用到過去幾週影片中的 document-level, word-level text analysis, regular expression, and text mining.

這次的作業使用維基文庫提供的「歷任中華民國總統國慶演說」。之前這堂課曾經以總統就職演說為題目,今年則替換為國慶演說。國內外媒體時常使用演說的內文當作素材,利用文字探勘的技巧寫出報導,以 2020 年的的總統就職演說為例,大家可以參考中央社的蔡總統關心什麼 文字會說話 以及 readr 的 少了「年輕人」多了「防疫」:臺灣歷屆民選總統就職演說字詞分析。國外的則可以參考 “I Have The Best Words.” Here’s How Trump’s First SOTU Compares To All The Others. by BuzzFeed, Word Aanalysis of 2016 Presidential debates - Clinton vs. Trump by Martin Krzywinski, and Trump used words like ‘invasion’ and ‘killer’ to discuss immigrants at rallies 500 times: USA TODAY analysis by USA today.

### 這邊不要動
library(tidyverse)
library(jiebaR)
library(tidytext)
library(lubridate)

df_speech_clean <- read_csv("data/AS07/df_speech_clean.csv")
### 給你看資料長這樣
df_speech_clean %>% glimpse()
#> Rows: 24
#> Columns: 5
#> $ id        <dbl> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 1…
#> $ text      <chr> "大會主席、各位貴賓、各位親愛的父老兄弟姐妹:\n今天是中華民…
#> $ title     <chr> "總統蒞臨中華民國八十六年國慶大會致詞", "總統蒞臨中華民國八…
#> $ date      <date> 1997-10-15, 1998-10-14, 1999-10-13, 2000-10-18, 2001-10-17,…
#> $ president <chr> "李登輝", "李登輝", "李登輝", "陳水扁", "陳水扁", "陳水扁", …

0. 斷詞:

請利用 library(jiebaR) 斷詞,過程中也要保留詞性的欄位。

### your code
### segment
cutter <- worker("tag", stop_word = "data/segment/df_stopword.txt")
vector_word = c("中華民國", "蔡英文", "李登輝", "蔣中正", "蔣經國", "李登輝", "陳水扁", "馬英九")
new_user_word(cutter, words = "data/segment/dict_jieba.txt")
new_user_word(cutter, words = "data/segment/hand.txt")
new_user_word(cutter, words = "data/segment/news.txt")
new_user_word(cutter, words = vector_word)
reg_space <- "%E3%80%80" %>% curl::curl_escape()

### text part
df_speech_seg <-
  df_speech_clean %>% mutate(year = year(date)) %>%
  mutate(text = str_replace_all(text, "台灣|臺灣", "臺灣")) %>%
  mutate(text = str_remove_all(text, "\\n|\\r|\\t|:| | ")) %>%
  mutate(text = str_remove_all(text, reg_space)) %>%
  mutate(text = str_remove_all(text, "[a-zA-Z0-9]+")) %>%
  mutate(text_segment = purrr::map(text, function(x)segment(x, cutter))) %>%
  mutate(text_POS = purrr::map(text_segment, function(x)names(x)))

# df_speech_seg %>% write_rds("data/AS07/df_speech_seg.rds")
#> [1] TRUE
#> [1] TRUE
#> [1] TRUE
#> [1] TRUE

1. 整體熱門詞彙:

請先找出所有總統演說當中出現次數最高的 10 個詞彙,接著計算每屆總統演說時,這些詞彙出現的次數,最後將結果畫成圖表。因為斷詞結果會因為預先載入的詞典有所不同,所以底下的示意圖參考即可,請盡量呈現 有意義的 詞彙!

### your code
# df_speech_seg <- read_rds("data/AS07/df_speech_seg.rds")
df_speech_seg_unnest <- df_speech_seg %>%
  unnest(c(text_segment, text_POS))

df_term_seg_count <- df_speech_seg_unnest %>% 
  count(id, year, text_segment, text_POS) %>%
  filter(str_length(text_segment) > 1)

df_seg_count_top <- df_term_seg_count %>% 
  group_by(text_segment, text_POS) %>% summarise(n = sum(n)) %>% 
  arrange(desc(n)) %>% ungroup() %>% filter(! text_segment %in% c("一個", "就是", "今天")) %>%
  slice(1:10) %>% select(text_segment)

plot01 <- df_term_seg_count %>% 
  mutate(year2 = year) %>% 
  mutate(year = str_sub(year, 3, 4)) %>% 
  inner_join(df_seg_count_top) %>%
  filter(! text_segment %in% c("一個", "就是", "今天")) %>%
  mutate(year = fct_reorder(as_factor(str_c("'", year)), year2)) %>%
  ggplot(aes(x = year, y = text_segment, fill = n)) + geom_tile() +
  theme_bw() +
  scale_linetype(guide = "none") +
  scale_fill_gradient(low = "white", high = "red")+
  labs(x= "年份",y= "詞彙", title = "歷屆總統演說 - 總體使用熱詞", fill = "次數") +
  theme(panel.grid.major = element_blank(), panel.grid.minor = element_blank(),
        panel.background = element_blank(), axis.line = element_line(colour = "black")) +
  theme(legend.position="bottom") +
  theme(plot.title = element_text(family = "Noto Sans CJK TC Medium", face = "plain", size = 26), axis.title = element_text(family = "Noto Sans CJK TC Medium", face = "plain", size = 26),
        legend.text = element_text(family = "Noto Sans CJK TC Medium", face = "plain"), legend.title = element_text(family = "Noto Sans CJK TC Medium", size=16),
        text = element_text(family = "Noto Sans CJK TC Medium"), axis.text = element_text(family = "Noto Sans CJK TC Medium", size=16))

### your result should be
# 自己畫就好唷

2. 各自熱門詞彙:

請先找出各個總統演說中,出現次數最高的 10 個詞彙,並且將結果畫成圖表。因為斷詞結果會因為預先載入的詞典有所不同,所以底下的示意圖參考即可,請盡量呈現 有意義的 詞彙!

### your code
df_president_seg_count <- df_speech_seg_unnest %>% 
  count(president, text_segment, text_POS) %>%
  filter(str_length(text_segment) > 1)

df_president_seg_count_top <- df_president_seg_count %>% group_by(president) %>%
  filter(! text_segment %in% c("一個", "就是", "今天")) %>%
  arrange(president, desc(n)) %>% mutate(rn = row_number()) %>%
  filter(rn <= 10) %>% ungroup() %>%
  group_by(president) %>% arrange(president, n) %>% ungroup() %>%
  mutate(president = fct_relevel(as.factor(president), "李登輝", "陳水扁", "馬英九", "蔡英文")) %>%
  mutate(text_order = str_c(text_segment, president, rn) %>% forcats::fct_inorder()) %>%
  select(president, text_segment, text_order, rn, n)

plot02 <- df_president_seg_count_top %>%
  ggplot(aes(x = text_order, y = n)) + geom_col() +
  facet_wrap(president ~ ., scales = "free") +
  coord_flip() +
  theme_bw() +
  scale_linetype(guide = "none") +
  scale_x_discrete(labels = setNames(as.character(df_president_seg_count_top$text_segment), df_president_seg_count_top$text_order)) +
  scale_fill_gradient(low = "white", high = "red")+
  labs(x= "年份",y= "次數", title = "歷屆總統演說 - 總統各自使用熱詞", fill = "次數") +
  theme(panel.grid.major = element_blank(), panel.grid.minor = element_blank(),
        panel.background = element_blank(), axis.line = element_line(colour = "black")) +
  theme(legend.position="bottom") +
  theme(plot.title = element_text(family = "Noto Sans CJK TC Medium", face = "plain", size = 26), axis.title = element_text(family = "Noto Sans CJK TC Medium", face = "plain", size = 26),
        legend.text = element_text(family = "Noto Sans CJK TC Medium", face = "plain"), legend.title = element_text(family = "Noto Sans CJK TC Medium", size=16),
        text = element_text(family = "Noto Sans CJK TC Medium"), axis.text = element_text(family = "Noto Sans CJK TC Medium", size=16))


### your result should be
# 自己畫就好唷

3. TF-IDF:

請先篩掉各個總統演說中出現次數小於 5 的詞彙,接著計算 TF-IDF (不知道這是什麼的話請看老師影片!),最後將結果畫成圖表。因為斷詞結果會因為預先載入的詞典有所不同,所以底下的示意圖參考即可,請盡量呈現 有意義的 詞彙!

### your code
df_president_tfidf <- df_president_seg_count %>% filter(n > 3) %>%
  filter(! text_segment %in% c("一個", "就是", "今天")) %>%
  bind_tf_idf(text_segment, president, n) %>%
  group_by(president) %>% arrange(-tf_idf) %>% 
  slice(1:10) %>% ungroup() %>%
  mutate(president = fct_relevel(as.factor(president), "李登輝", "陳水扁", "馬英九", "蔡英文")) %>%
  mutate(text_segment = fct_reorder(text_segment, tf_idf))

plot03 <- df_president_tfidf %>%
  ggplot(aes(x = text_segment, y = tf_idf)) + geom_col() +
  facet_wrap(president ~ ., scales = "free") +
  coord_flip() +
  theme_bw() +
  labs(x= "年份",y= "tf-idf", title = "歷屆總統演說 - tfidf", fill = "次數") +
  theme(panel.grid.major = element_blank(), panel.grid.minor = element_blank(),
        panel.background = element_blank(), axis.line = element_line(colour = "black")) +
  theme(legend.position="bottom") +
  theme(plot.title = element_text(family = "Noto Sans CJK TC Medium", face = "plain", size = 26), axis.title = element_text(family = "Noto Sans CJK TC Medium", face = "plain", size = 26),
        legend.text = element_text(family = "Noto Sans CJK TC Medium", face = "plain"), legend.title = element_text(family = "Noto Sans CJK TC Medium", size=16),
        text = element_text(family = "Noto Sans CJK TC Medium"), axis.text = element_text(family = "Noto Sans CJK TC Medium", size=16))


### your result should be
# 自己畫就好唷

4. 捉對廝殺:

請先留下蔡英文和馬英九的用詞,接著計算兩者用詞數量差異最大各自前十名的詞彙,最後將結果畫成圖表。因為斷詞結果會因為預先載入的詞典有所不同,所以底下的示意圖參考即可,請盡量呈現 有意義的 詞彙!

### your code
df_ying_seg_count <- df_speech_seg_unnest %>% 
  filter(president %in% c("馬英九", "蔡英文")) %>%
  count(president, text_segment) %>%
  filter(str_length(text_segment) > 1)

df_ying_seg_diff <- df_ying_seg_count %>% 
  pivot_wider(names_from = president, values_from = n, values_fill = list(n = 0)) %>%
  mutate(diff_tsai = `蔡英文` - `馬英九`, diff_ma = -diff_tsai)


plot04 <- df_ying_seg_diff %>% arrange(desc(diff_tsai)) %>% 
  filter(! text_segment %in% c("一個", "就是", "今天")) %>%
  slice(1:10) %>%
  select(text_segment, diff = diff_tsai) %>% mutate(president = "蔡英文") %>%
  bind_rows(
    df_ying_seg_diff %>% arrange(desc(diff_ma)) %>% 
      filter(! text_segment %in% c("一個", "就是", "今天")) %>%
      slice(1:10) %>%
      select(text_segment, diff = diff_ma) %>% mutate(president = "馬英九")  
  ) %>%
  mutate(diff2 = if_else(president == "馬英九", -diff, diff)) %>%
  mutate(text_segment = reorder(text_segment, diff2)) %>%
  ggplot(aes(x = diff2, y = text_segment, fill = president)) + geom_col() +
  theme_bw() +
  scale_x_continuous(limits = c(-80, 30)) +
  scale_fill_manual(values = c("#1B9431", "#000095")) +
  labs(x= "使用次數差異",y= "詞彙", title = "雙英對決:馬英九與蔡英文使用次數差異最大詞彙", fill = "總統") +
  theme(panel.grid.major = element_blank(), panel.grid.minor = element_blank(),
        panel.background = element_blank(), axis.line = element_line(colour = "black")) +
  theme(legend.position="bottom") +
  theme(plot.title = element_text(family = "Noto Sans CJK TC Medium", face = "plain", size = 26), axis.title = element_text(family = "Noto Sans CJK TC Medium", face = "plain", size = 26),
        legend.text = element_text(family = "Noto Sans CJK TC Medium", face = "plain"), legend.title = element_text(family = "Noto Sans CJK TC Medium", size=16),
        text = element_text(family = "Noto Sans CJK TC Medium"), axis.text = element_text(family = "Noto Sans CJK TC Medium", size=16))

### your result should be
# 自己畫就好唷

結語

https://collabin.netlify.app/don/tongxinglian-in-samesex-marriage-corpora-2/

一個文字探勘的實例,覺得結尾說得很好

“最近剛好在一堂課的讀本中讀到一句話: Computer-assisted text analysis [is] an activity best employed not in the service of a heightened critical objectivity, but as one that embraces the possibilities of that deepened subjectivity upon which critical insight depends.

這是 Stephen Ramsay 在其文章 “Toward an algorithmic criticism” (2003. Literary and Linguistic Computing, 18(2): 167-174)中所討論的,究竟電腦與演算法在言談分析和文本分析中扮演著什麼樣的角色。他認為演算法的嚴密與正確並不是分析的終點,客觀性與實證主義式的真理也不是所追求的目標,更重要的在於演算法是否讓分析者看見了原本看不見的,從而能針對文本做出更深刻的討論。(後來發現他有寫成一本書:Reading Machines: Toward an Algorithmic Criticism)

在練習用不同的工具分析文本的過程中,好像真的看到了工具所看到的某種世界。”

「讓分析者看見了原本看不見的,從而能針對文本做出更深刻的討論」讚喔!