class: center, middle, inverse, title-slide # Lab04_Data-Manipulation-Joining ## Lab04_grouping-joining-tidyr ### 曾子軒 Dennis Tseng ### 台大新聞所 NTU Journalism ### 2022/03/31 --- <style type="text/css"> .remark-slide-content { padding: 1em 1em 1em 1em; font-size: 28px; } .my-one-page-font { padding: 1em 1em 1em 1em; font-size: 20px; /*xaringan::inf_mr()*/ } </style> # 今日重點 - `library(dplyr)` - `library(tidyr)` --- # 今天用的檔案 ```r library(tidyverse) ``` ``` ## ── Attaching packages ─────────────────────────────────────── tidyverse 1.3.0 ── ``` ``` ## ✓ ggplot2 3.3.5 ✓ purrr 0.3.4 ## ✓ tibble 3.1.4 ✓ dplyr 1.0.7 ## ✓ tidyr 1.1.2 ✓ stringr 1.4.0.9000 ## ✓ readr 1.4.0 ✓ forcats 0.5.1 ``` ``` ## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ── ## x dplyr::filter() masks stats::filter() ## x dplyr::lag() masks stats::lag() ``` ```r df_main_clean <- read_csv("data/Lab04/df_main_clean.csv") %>% select(board, type, title, date, comments) ``` ``` ## ## ── Column specification ──────────────────────────────────────────────────────── ## cols( ## board = col_character(), ## type = col_character(), ## title = col_character(), ## date = col_date(format = ""), ## comments = col_double(), ## author = col_character(), ## text = col_character(), ## IP = col_character(), ## link = col_character(), ## time = col_datetime(format = "") ## ) ``` ```r df_main_clean2 <- read.csv("data/Lab04/df_main_clean.csv") %>% select(board, type, title, date, comments) ``` --- # data importing: 推薦 readr 套件的理由 1. 讀進來預設就是 tibble 2. 函數很聰明,會去猜測每個 column 是什麼 type 3. 預設編碼是 UTF8,可以避免許多問題 4. 讀檔效率較佳,當資料筆數超過十萬會有明顯差異 5. 提供較多元的參數可供挑選 --- # 作業檢討: 推薦 readr 套件的理由 - 遇上中文的時候 - 欄位型態不同 ```r df_main_clean %>% head(5) ``` ``` ## # A tibble: 5 × 5 ## board type title date comments ## <chr> <chr> <chr> <date> <dbl> ## 1 KoreaDrama [公告] Fw: [公告] 您有一封未讀取的重要信件 請… 2021-01-18 20 ## 2 KoreaDrama [新聞] [新聞] 柳惠英有望演出《直到清晨》 2021-01-18 5 ## 3 KoreaDrama [閒聊] [閒聊] Penthouse 上流戰爭第一季(雷) 2021-01-18 101 ## 4 KoreaDrama [新聞] [新聞] 孫錫久有望演出APPLE TV+《Alpha R… 2021-01-18 9 ## 5 KoreaDrama [LIVE] [LIVE] JTBC 前輩,那支口紅不要塗 EP01-E… 2021-01-18 249 ``` ```r df_main_clean2 %>% head(5) ``` ``` ## board type title date ## 1 KoreaDrama [公告] Fw: [公告] 您有一封未讀取的重要信件 請查閱 2021-01-18 ## 2 KoreaDrama [新聞] [新聞] 柳惠英有望演出《直到清晨》 2021-01-18 ## 3 KoreaDrama [閒聊] [閒聊] Penthouse 上流戰爭第一季(雷) 2021-01-18 ## 4 KoreaDrama [新聞] [新聞] 孫錫久有望演出APPLE TV+《Alpha Romeo》 2021-01-18 ## 5 KoreaDrama [LIVE] [LIVE] JTBC 前輩,那支口紅不要塗 EP01-EP02 2021-01-18 ## comments ## 1 20 ## 2 5 ## 3 101 ## 4 9 ## 5 249 ``` --- # dplyr: 修改欄位名稱 - `rename()` - `select()` - `mutate()` ```r df_main_clean %>% rename(ban = board) %>% head(5) df_main_clean %>% select(ban = board) %>% head(5) df_main_clean %>% mutate(ban = board) %>% head(5) ``` --- # dplyr: 選擇欄位 - 懶人的福音,幫助你快速選 column - operator: `:`, `!`, `&`, `|`, `c()` - selection helpers: - specific columns: `everything()`, `last_col()` - matching patterns: `starts_with()`, `ends_with()`, `contains()`, `matches()`, `num_range()` --- # dplyr: 選擇欄位 - operator: `:`, `!`, `&`, `|`, `c()` ```r df_main_clean %>% slice(1) ``` ``` ## # A tibble: 1 × 5 ## board type title date comments ## <chr> <chr> <chr> <date> <dbl> ## 1 KoreaDrama [公告] Fw: [公告] 您有一封未讀取的重要信件 請查閱 2021-01-18 20 ``` ```r df_main_clean %>% select(type:date) %>% slice(1) ``` ``` ## # A tibble: 1 × 3 ## type title date ## <chr> <chr> <date> ## 1 [公告] Fw: [公告] 您有一封未讀取的重要信件 請查閱 2021-01-18 ``` ```r df_main_clean %>% select(1:2, 4) %>% slice(1) ``` ``` ## # A tibble: 1 × 3 ## board type date ## <chr> <chr> <date> ## 1 KoreaDrama [公告] 2021-01-18 ``` ```r df_main_clean %>% select(!title) %>% slice(1) ``` ``` ## # A tibble: 1 × 4 ## board type date comments ## <chr> <chr> <date> <dbl> ## 1 KoreaDrama [公告] 2021-01-18 20 ``` --- # dplyr: 選擇欄位 - specific columns: `everything()`, `last_col()` ```r df_main_clean %>% select(comments, everything()) %>% slice(1) ``` ``` ## # A tibble: 1 × 5 ## comments board type title date ## <dbl> <chr> <chr> <chr> <date> ## 1 20 KoreaDrama [公告] Fw: [公告] 您有一封未讀取的重要信件 請查閱 2021-01-18 ``` ```r df_main_clean %>% select(-board, everything(), board) %>% slice(1) ``` ``` ## # A tibble: 1 × 5 ## type title date comments board ## <chr> <chr> <date> <dbl> <chr> ## 1 [公告] Fw: [公告] 您有一封未讀取的重要信件 請查閱 2021-01-18 20 KoreaDr… ``` --- # dplyr: 排序欄位 - specific columns: `everything()`, `last_col()` ```r df_main_clean %>% relocate(comments) %>% slice(1) ``` ``` ## # A tibble: 1 × 5 ## comments board type title date ## <dbl> <chr> <chr> <chr> <date> ## 1 20 KoreaDrama [公告] Fw: [公告] 您有一封未讀取的重要信件 請查閱 2021-01-18 ``` ```r df_main_clean %>% relocate(comments, .after = board) %>% slice(1) ``` ``` ## # A tibble: 1 × 5 ## board comments type title date ## <chr> <dbl> <chr> <chr> <date> ## 1 KoreaDrama 20 [公告] Fw: [公告] 您有一封未讀取的重要信件 請查閱 2021-01-18 ``` ```r df_main_clean %>% relocate(comments, .before = title) %>% slice(1) ``` ``` ## # A tibble: 1 × 5 ## board type comments title date ## <chr> <chr> <dbl> <chr> <date> ## 1 KoreaDrama [公告] 20 Fw: [公告] 您有一封未讀取的重要信件 請查閱 2021-01-18 ``` --- # dplyr: 選擇欄位 - specific columns: `everything()`, `last_col()` ```r df_main_clean %>% select(last_col()) %>% slice(1) ``` ``` ## # A tibble: 1 × 1 ## comments ## <dbl> ## 1 20 ``` ```r df_main_clean %>% select(1:last_col()) %>% slice(1) ``` ``` ## # A tibble: 1 × 5 ## board type title date comments ## <chr> <chr> <chr> <date> <dbl> ## 1 KoreaDrama [公告] Fw: [公告] 您有一封未讀取的重要信件 請查閱 2021-01-18 20 ``` ```r df_main_clean %>% select(1:last_col(1)) %>% slice(1) ``` ``` ## # A tibble: 1 × 4 ## board type title date ## <chr> <chr> <chr> <date> ## 1 KoreaDrama [公告] Fw: [公告] 您有一封未讀取的重要信件 請查閱 2021-01-18 ``` --- # dplyr: 選擇欄位 - matching patterns: `starts_with()`, `ends_with()`, `contains()`, `matches()`, `num_range()` ```r df_main_clean %>% select(starts_with("marry")) %>% slice(1) ``` ``` ## # A tibble: 1 × 0 ``` ```r df_main_clean %>% select(starts_with(c("marry_cp", "county"))) %>% slice(1) ``` ``` ## # A tibble: 1 × 0 ``` --- # dplyr: 選擇欄位 - matching patterns: `starts_with()`, `ends_with()`, `contains()`, `matches()`, `num_range()` - `contains()` 放字串 ```r df_main_clean %>% select(contains("comm")) %>% slice(1) ``` ``` ## # A tibble: 1 × 1 ## comments ## <dbl> ## 1 20 ``` --- # dplyr: 選擇欄位 - matching patterns: `starts_with()`, `ends_with()`, `contains()`, `matches()`, `num_range()` - `matches()` 放正規表示式 ```r df_main_clean %>% select(matches("com.*nt")) %>% slice(1) ``` ``` ## # A tibble: 1 × 1 ## comments ## <dbl> ## 1 20 ``` --- # dplyr: 判斷區間 - `if_else()` and `case_when()` - 常常與 `mutate()` 搭配判斷條件用,`NA` 無法判斷要特別處理 - `if_else(條件, TRUE 的值, FALSE 的值)` - `case_when(條件 ~ 滿足條件的值, 最後有一個類似 else 的東西)` ```r df_main_clean %>% mutate(comments_interval = if_else(comments < 60, "<60", ">=60")) %>% mutate(comments_interval = if_else(comments >= 70 & comments_interval == ">=60", ">=70", comments_interval)) %>% count(comments_interval) ``` --- # dplyr: 判斷區間 - `if_else()` and `case_when()` - 常常與 `mutate()` 搭配判斷條件用,`NA` 無法判斷要特別處理 - `if_else(條件, TRUE 的值, FALSE 的值)` - `case_when(條件 ~ 滿足條件的值, 最後有一個類似 else 的東西)` ```r df_main_clean %>% mutate(comments_interval = case_when( # 條件寫左邊,中間用 ~ 連接,右邊放數值 comments < 60 ~ "<60", comments >= 60 & comments < 70 ~ ">=60", comments >= 70 ~ ">=70", # 最後是 TRUE ~ 值 作結,放 else 的內容 TRUE ~ "others" )) %>% count(comments_interval) ``` --- # dplyr: 取 subset - 取出特定的 row - `slice()` - `row_number()` ```r df_main_clean %>% slice(11:12) df_main_clean %>% filter(row_number() == 10) ``` --- # dplyr: group and summarize - `summarize()` - 產出一個總結後的 dataframe - 若有先 `group_by()`,會產出各組的總結 - `group_by()` - 把 dataframe 變成 grouped dataframe,長相相同 - group 之後使用動詞 - 比較:有 `group_by()` 和沒 `group_by()` --- # dplyr: group and summarize ```r df_main_clean %>% summarise(comments = max(comments, na.rm = T)) ``` ``` ## # A tibble: 1 × 1 ## comments ## <dbl> ## 1 1499 ``` ```r df_main_clean %>% group_by(board) %>% summarise(comments = max(comments, na.rm = T)) ``` ``` ## # A tibble: 4 × 2 ## board comments ## <chr> <dbl> ## 1 China-Drama 825 ## 2 EAseries 1499 ## 3 KoreaDrama 1498 ## 4 TaiwanDrama 1413 ``` --- # dplyr: group and summarize - `ungroup()` - 把 grouped 的狀態消除變回原本的 - 有時候會需要先 `group_by()` 再 `ungroup()` - 底下來看 `summarize()`, `group_by()` and `ungroup()` 的應用 --- # dplyr: group and summarize - 舉例 - 計算各子板、各類型文章數量的**佔比** - 取出**各子板**當中文章佔該板最高的類型 - 取出**不分子板**當中文章佔該板最高的類型 --- # dplyr: group and summarize - 舉例 - 計算各子板、各類型文章數量的**佔比** ```r df_main_clean %>% group_by(board, type) %>% summarise(n = n()) %>% mutate(per = n/sum(n)) ``` ``` ## # A tibble: 60 × 4 ## # Groups: board [4] ## board type n per ## <chr> <chr> <int> <dbl> ## 1 China-Drama [Live] 45 0.125 ## 2 China-Drama [公告] 4 0.0111 ## 3 China-Drama [創作] 9 0.0249 ## 4 China-Drama [問題] 33 0.0914 ## 5 China-Drama [心得] 128 0.355 ## 6 China-Drama [情報] 12 0.0332 ## 7 China-Drama [新聞] 9 0.0249 ## 8 China-Drama [板務] 1 0.00277 ## 9 China-Drama [求薦] 7 0.0194 ## 10 China-Drama [討論] 1 0.00277 ## # … with 50 more rows ``` --- # group and summarize - 舉例 - 取出**各子板**當中文章佔該板最高的類型 ```r df_main_clean %>% group_by(board, type) %>% summarise(n = n()) %>% mutate(per = n/sum(n)) %>% filter(per == max(per)) ``` ``` ## # A tibble: 4 × 4 ## # Groups: board [4] ## board type n per ## <chr> <chr> <int> <dbl> ## 1 China-Drama [心得] 128 0.355 ## 2 EAseries [新聞] 123 0.321 ## 3 KoreaDrama [心得] 98 0.266 ## 4 TaiwanDrama [LIVE] 105 0.305 ``` --- # group and summarize - 舉例 - 取出**不分子板**當中文章佔該板最高的類型 ```r df_main_clean %>% group_by(board, type) %>% summarise(n = n()) %>% ungroup() %>% mutate(per = n/sum(n)) %>% filter(per == max(per)) ``` ``` ## # A tibble: 1 × 4 ## board type n per ## <chr> <chr> <int> <dbl> ## 1 China-Drama [心得] 128 0.0879 ``` --- # group and summarize - 舉例 - 取出**各子板**當中文章佔該板**次高**的類型 ```r df_main_clean %>% group_by(board, type) %>% summarise(n = n()) %>% mutate(per = n/sum(n)) %>% arrange(board, desc(per)) %>% filter(row_number() == 2) ``` ``` ## # A tibble: 4 × 4 ## # Groups: board [4] ## board type n per ## <chr> <chr> <int> <dbl> ## 1 China-Drama [閒聊] 110 0.305 ## 2 EAseries [心得] 109 0.285 ## 3 KoreaDrama [LIVE] 87 0.236 ## 4 TaiwanDrama [心得] 57 0.166 ``` --- # group and summarize - 舉例 - 取出**不分子板**當中文章佔該板**次高**的類型 ```r df_main_clean %>% group_by(board, type) %>% summarise(n = n()) %>% ungroup() %>% mutate(per = n/sum(n)) %>% arrange(desc(per)) %>% filter(row_number() == 2) ``` ``` ## # A tibble: 1 × 4 ## board type n per ## <chr> <chr> <int> <dbl> ## 1 EAseries [新聞] 123 0.0844 ``` --- # dplyr: group and summarize - 算不重複的數量 - `distinct()` & `n_distinct()` - 算出**各子板**有多少不重複發文天數 ```r df_main_clean %>% group_by(board) %>% summarise(n = n_distinct(date)) ``` ``` ## # A tibble: 4 × 2 ## board n ## <chr> <int> ## 1 China-Drama 81 ## 2 EAseries 48 ## 3 KoreaDrama 47 ## 4 TaiwanDrama 67 ``` --- # dplyr: group and summarize - 算不重複的數量 - `distinct()` & `n_distinct()` - 算出**各子板**有多少不重複發文天數 ```r df_main_clean %>% distinct(board, date) %>% count(board) ``` ``` ## # A tibble: 4 × 2 ## board n ## <chr> <int> ## 1 China-Drama 81 ## 2 EAseries 48 ## 3 KoreaDrama 47 ## 4 TaiwanDrama 67 ``` --- # tidy data - Pivoting - `pivot_longer()` 把資料變成長表格 - `cols` = 放進去的欄位, `names_to` = 名稱欄位叫做什麼, `values_to` = 值欄位叫做什麼 - 多注意一個參數 `values_drop_na` 預設為 FALSE - `pivot_wider()` 把資料變成寬表格 - `id_cols` = 不要動的欄位, `names_from` = 名稱來自哪, `values_from` = 值來自哪 - 多注意一個參數 `values_fill` --- # tidy data - df_main_agg 是一個 wide data ```r df_main_agg <- df_main_clean %>% group_by(board) %>% summarise(article = n(), comments = sum(comments)) df_main_agg ``` ``` ## # A tibble: 4 × 3 ## board article comments ## <chr> <int> <dbl> ## 1 China-Drama 361 38491 ## 2 EAseries 383 12397 ## 3 KoreaDrama 369 61846 ## 4 TaiwanDrama 344 24864 ``` --- # tidy data - 把它變長 ```r df_main_agg_long <- df_main_agg %>% pivot_longer(cols = -board, names_to = "type", values_to = "n") df_main_agg_long ``` ``` ## # A tibble: 8 × 3 ## board type n ## <chr> <chr> <dbl> ## 1 China-Drama article 361 ## 2 China-Drama comments 38491 ## 3 EAseries article 383 ## 4 EAseries comments 12397 ## 5 KoreaDrama article 369 ## 6 KoreaDrama comments 61846 ## 7 TaiwanDrama article 344 ## 8 TaiwanDrama comments 24864 ``` --- # tidy data - 把它變寬 ```r df_main_agg_wide <- df_main_agg_long %>% pivot_wider(id = board, names_from = type, values_from = n, values_fill = list(n = 0)) df_main_agg_wide ``` ``` ## # A tibble: 4 × 3 ## board article comments ## <chr> <dbl> <dbl> ## 1 China-Drama 361 38491 ## 2 EAseries 383 12397 ## 3 KoreaDrama 369 61846 ## 4 TaiwanDrama 344 24864 ``` --- # tidy data - Splitting and Combining - `separate()` 把一個欄位切開(split)成多個欄位 - `unite()` 把多個欄位合併(combine)成多個欄位 - `col` = 要動的欄位, `into` = 要變成什麼名字, `sep` = 切分的符號, `remove` = 是否要保留原本的欄位 --- # tidy data - 把它合併 ```r df_main_sep <- df_main_agg %>% unite(col = "metric", 2:3, sep = "-") df_main_sep ``` ``` ## # A tibble: 4 × 2 ## board metric ## <chr> <chr> ## 1 China-Drama 361-38491 ## 2 EAseries 383-12397 ## 3 KoreaDrama 369-61846 ## 4 TaiwanDrama 344-24864 ``` --- # tidy data - 把它切開 ```r df_main_uni <- df_main_sep %>% separate(col = metric, into = c("article", "comments")) df_main_uni ``` ``` ## # A tibble: 4 × 3 ## board article comments ## <chr> <chr> <chr> ## 1 China-Drama 361 38491 ## 2 EAseries 383 12397 ## 3 KoreaDrama 369 61846 ## 4 TaiwanDrama 344 24864 ```