class: center, middle, inverse, title-slide # Lab09_purrr-basic ## Lab09_map-for-scrape ### 曾子軒 Dennis Tseng ### 台大新聞所 NTU Journalism ### 2021/05/25 --- <style type="text/css"> .remark-slide-content { padding: 1em 1em 1em 1em; font-size: 28px; } .my-one-page-font { padding: 1em 1em 1em 1em; font-size: 20px; /*xaringan::inf_mr()*/ } </style> # 今日重點 - AS08 Review - purrr - Lab09 Review - Lab10 Practice --- # Web Scraping - HTML - 基本架構 - 試錯: 先拿單一網址練習,確定可以再寫回圈 - 迴圈外: 迴圈外放空的 tibble、index(現在在哪)、length(迴圈長度) - 迴圈內: 切分讀 html, 讀每個 nodes, 把讀到的 nodes 合併, index 和訊息, **以及休息** --- # Web Scraping - HTML ```r # test library(tidyverse) library(httr) library(rvest) url_test <- "https://tw.feature.appledaily.com/charity/projlist/1" html <- url %>% curl::curl(handle = curl::new_handle("useragent" = "Mozilla/5.0")) %>% read_html() link <- html %>% html_nodes(".artcatdetails") %>% html_attr("href") link_detail <- html %>% html_nodes(".details") %>% html_attr("href") ``` --- # Web Scraping - HTML ```r # loop p_read_html <- possibly(read_html, otherwise = NULL) df_case_list <- tibble() for(i in 1:10){ url <- str_c("https://tw.feature.appledaily.com/charity/projlist/", i) html <- url %>% curl::curl(handle = curl::new_handle("useragent" = "Mozilla/5.0")) %>% p_read_html() link <- html %>% html_nodes(".artcatdetails") %>% html_attr("href") link_detail <- html %>% html_nodes(".details") %>% html_attr("href") df_case_list_tmp <- dtibble(link = link, link_detail = link_detail) df_case_list <- df_case_list %>% bind_rows(df_case_list_tmp) print(i) Sys.sleep(20) } ``` --- # Web Scraping - HTML - 考慮 - 如果網址掛了怎麼辦? - 缺 nodes 產生空值?長度不一樣? - 如果跑到一半網路斷了怎麼辦? - 怎樣才比較不容易被發現是爬蟲? - 怎樣才比較不會出事被告? - 爬很久怎麼辦? --- # Web Scraping - HTML - 對策 - 寫 `ifelse` 處理網址掛掉 - 不管結果如何都要補空值 - 可以考慮把錯誤存起來 - 列印 index 並儲存當下進度 - 對人家有禮貌一點 - 用 map 爬快點 --- ```r library(polite) session <- bow("https://tw.feature.appledaily.com/charity/projlist/1", force = TRUE) result <- scrape(session) %>% html_node(".artcatdetails") %>% html_text() ``` --- class: inverse, center, middle # [AS08](https://p4css.github.io/R4CSS_TA/AS08_Web-Scraping-HTML_ref.html) --- # purrr - `map(.x, .f, ...)` - Apply a function to each element of a list or atomic vector - 前面放對象,後面放函數 e.g. <img src="photo/Lab09_map_description.png" width="65%" height="65%" /> ref: [數據科學中的 R 語言](https://bookdown.org/wangminjie/R4DS/purrr.html#purrr-1) --- # purrr - 問:底下的 list 要怎麼取每個同學的平均分數? ```r exams <- list( student1 = c(100,80,70), student2 = c(90,60,50), student3 = c(20,90,55) ) exams ``` ``` ## $student1 ## [1] 100 80 70 ## ## $student2 ## [1] 90 60 50 ## ## $student3 ## [1] 20 90 55 ``` --- # purrr ```r # base mean(exams$student1) mean(exams$student2) mean(exams$student3) # purrr exams %>% map(mean) exams %>% map_dbl(mean) exams %>% map_df(mean) ``` --- # purrr - 圖解 <img src="photo/Lab09_map_description02.png" width="65%" height="65%" /> --- # purrr - 換一張圖解 <img src="photo/Lab09_map_01.jpeg" width="45%" height="45%" /> ref: [@_ColinFay](https://twitter.com/_ColinFay/status/1045257504446443520) --- # purrr - 前面放對象,後面放函數 e.g. `map(.x, .f, ...)` - 前面的對象可以是 vector,也可以是 list - 函數可以正規表達,也可以用匿名函數 ```r # .x 放 vector url <- str_c("https://tw.feature.appledaily.com/charity/projdetail/", c("A5135", "A5134", "A5133")) url %>% map(read_html) # .x 放 list exams %>% map(mean) ``` --- # purrr - 前面放對象,後面放函數 e.g. `map(.x, .f, ...)` - 前面的對象可以是 vector,也可以是 list - 函數可以正規表達,也可以用匿名函數 ```r # .f exams %>% map(function(x){(x + 5)^2}) exams %>% map(~(. + 5)^2) exams %>% map(. %>% mean() %>% sqrt()) exams %>% map(. %>% mean() %>% `^`(3)) ``` --- # Anonymous Function 匿名函數 - function - 平常寫函數 - 但為了方便也可以不要寫完整,一次性使用 - `.` 點點代表前面的變數/資料 - 匿名函數的形式 - `~ function(x){sqrt(mean(x))}` - `~ (sqrt(mean(.)))` - `~ . %>% mean() %>% sqrt()` --- # purrr - 有 `map(.x)`, `map2(.x, .y)`, `pmap(.l)` <img src="photo/Lab09_map_03.jpeg" width="45%" height="45%" /><img src="photo/Lab09_map2_02.jpeg" width="45%" height="45%" /> --- # purrr - 有 `map(.x)`, `map2(.x, .y)`, `pmap(.l)` ```r weight <- c(1,1,2) weight_diff <- list(c(1, 1, 2), c(1, 2, 1), c(2, 1, 1)) mean_weight <- function(x, y){ if(length(x)==length(y)){sum(x*y)/sum(y)} else{print("length - bad")} } map(exams, ~mean_weight(x=., y=weight)) map2(exams, weight, mean_weight) ``` --- # purrr - 運用在 web scraping 的好時機 - 一次爬**多個網址**的時候 - 舉例來說,爬文章列表不需要 `map`,但是爬每篇文章的時候就可以用 ```r df_zec_main[j:k,] %>% pull(title_link) %>% str_c("https://www.zeczec.com", .) %>% map(function(x){ x %>% curl::curl(handle = curl::new_handle("useragent" = "Mozilla/5.0")) %>% read_html() }) ``` --- class: inverse, center, middle # [Lab09](https://p4css.github.io/R4CSS_TA/Lab09_Homework_Web-Scraping-HTML_ref.html) --- class: inverse, center, middle # [Lab10](https://p4css.github.io/R4CSS_TA/Lab10_Homework_more-dplyr.html)