這份作業希望能夠讓你熟悉於資料處理(data manipulation),利用這兩週所學的 filter, mutate, select, group by, summarize, join, pivot 等動詞完成任務。請一律印出 tibble,你可以利用 as_tibble()
將 dataframe 轉換成 tibble,不過按照預設的 code 應該都會是 tibble,另外印出結果不用特別打 print()
喔。作業的滿分為 100 分。
老師在影片中示範怎麼處理各村里的婚姻狀況,清理後再與公投資料串接,這次的作業則是要處理教育資料,資料中包含全台各村里 15歲以上現住人口 的性別、年齡、教育程度統計。請一定要先看影片再來寫!
### 這邊不要動
library(tidyverse)
<- read_csv("data/AS04/opendata107Y020.csv")
df_education_raw ### 給你看資料長這樣
%>% glimpse() df_education_raw
#> Rows: 7,761
#> Columns: 51
#> $ statistic_yyy <chr> "統計年", "107", "107", "107"…
#> $ district_code <chr> "區域別代碼", "65000010001", "6…
#> $ site_id <chr> "區域別", "新北市板橋區", "新北市板橋區",…
#> $ village <chr> "村里名稱", "留侯里", "流芳里", "赤松里…
#> $ edu_age_15up_total <chr> "總計", "1431", "1337", "707…
#> $ edu_doctor_graduated_m <chr> "博畢_男", "4", "7", "5", "2"…
#> $ edu_doctor_graduated_f <chr> "博畢_女", "0", "3", "1", "2"…
#> $ edu_doctor_ungraduated_m <chr> "博肄_男", "1", "2", "0", "3"…
#> $ edu_doctor_ungraduated_f <chr> "博肄_女", "2", "2", "0", "0"…
#> $ edu_master_graduated_m <chr> "碩畢_男", "52", "48", "31", …
#> $ edu_master_graduated_f <chr> "碩畢_女", "31", "39", "32", …
#> $ edu_master_ungraduated_m <chr> "碩肄_男", "12", "16", "6", "…
#> $ edu_master_ungraduated_f <chr> "碩肄_女", "7", "14", "2", "3…
#> $ edu_university_graduated_m <chr> "大畢_男", "150", "151", "98"…
#> $ edu_university_graduated_f <chr> "大畢_女", "167", "198", "99"…
#> $ edu_university_ungraduated_m <chr> "大肄_男", "54", "41", "36", …
#> $ edu_university_ungraduated_f <chr> "大肄_女", "48", "38", "19", …
#> $ edu_juniorcollege_2ys_graduated_m <chr> "二畢_男", "38", "36", "8", "…
#> $ edu_juniorcollege_2ys_graduated_f <chr> "二畢_女", "34", "46", "20", …
#> $ edu_juniorcollege_2ys_ungraduated_m <chr> "二肄_男", "8", "3", "3", "3"…
#> $ edu_juniorcollege_2ys_ungraduated_f <chr> "二肄_女", "4", "3", "2", "4"…
#> $ edu_juniorcollege_5ys_final2y_graduated_m <chr> "後二畢_男", "51", "46", "25",…
#> $ edu_juniorcollege_5ys_final2y_graduated_f <chr> "後二畢_女", "31", "43", "26",…
#> $ edu_juniorcollege_5ys_final2y_ungraduated_m <chr> "後二肄_男", "0", "2", "1", "1…
#> $ edu_juniorcollege_5ys_final2y_ungraduated_f <chr> "後二肄_女", "0", "2", "1", "0…
#> $ edu_senior_graduated_m <chr> "高畢_男", "73", "60", "27", …
#> $ edu_senior_graduated_f <chr> "高畢_女", "58", "49", "24", …
#> $ edu_senior_ungraduated_m <chr> "高肄_男", "15", "21", "14", …
#> $ edu_senior_ungraduated_f <chr> "高肄_女", "18", "19", "11", …
#> $ edu_seniorvocational_graduated_m <chr> "職畢_男", "106", "73", "37",…
#> $ edu_seniorvocational_graduated_f <chr> "職畢_女", "145", "109", "51"…
#> $ edu_seniorvocational_ungraduated_m <chr> "職肄_男", "19", "15", "8", "…
#> $ edu_seniorvocational_ungraduated_f <chr> "職肄_女", "15", "9", "6", "1…
#> $ edu_juniorcollege_5ys_first3y_ungraduated_m <chr> "前三肄_男", "2", "7", "1", "4…
#> $ edu_juniorcollege_5ys_first3y_ungraduated_f <chr> "前三肄_女", "4", "8", "0", "3…
#> $ edu_junior_graduated_m <chr> "國畢_男", "40", "29", "22", …
#> $ edu_junior_graduated_f <chr> "國畢_女", "74", "42", "19", …
#> $ edu_junior_ungraduated_m <chr> "國肄_男", "2", "5", "2", "5"…
#> $ edu_junior_ungraduated_f <chr> "國肄_女", "5", "10", "4", "4…
#> $ edu_juniorvocational_graduated_m <chr> "初畢_男", "12", "20", "0", "…
#> $ edu_juniorvocational_graduated_f <chr> "初畢_女", "37", "22", "2", "…
#> $ edu_juniorvocational_ungraduated_m <chr> "初肄_男", "0", "0", "0", "0"…
#> $ edu_juniorvocational_ungraduated_f <chr> "初肄_女", "0", "0", "0", "0"…
#> $ edu_primary_graduated_m <chr> "小畢_男", "26", "31", "18", …
#> $ edu_primary_graduated_f <chr> "小畢_女", "70", "59", "33", …
#> $ edu_primary_ungraduated_m <chr> "小肄_男", "1", "0", "1", "0"…
#> $ edu_primary_ungraduated_f <chr> "小肄_女", "7", "5", "5", "3"…
#> $ edu_selftaughtl_m <chr> "自修_男", "1", "0", "0", "0"…
#> $ edu_selftaughtl_f <chr> "自修_女", "2", "0", "0", "1"…
#> $ edu_illiterate_m <chr> "不識_男", "0", "0", "1", "1"…
#> $ edu_illiterate_f <chr> "不識_女", "5", "4", "6", "4"…
將資料讀進來之後可以看到這份資料的長相,前幾列是村里的名稱與代碼,後面以 edu_
開頭的欄位則是代表該村里教育相關的狀態。舉例來說,edu_doctor_graduated_m
代表的是 博畢_男
,edu_juniorcollege_2ys_ungraduated_f
代表的是 二肄_女
,edu_selftaughtl_m
代表的是 自修_男
,edu_juniorcollege_5ys_first3y_ungraduated_f
代表的是 前三肄_女
。
「2畢」、「2肄」、「後3畢」、「後3肄」、「前3肄」於教育程度統計上分別指專科2、3年制、5專、5專後2年、5專前3年之畢業、肄業狀況。另外,肄業包含中途輟學、休學者,或正在學校求學之學生,所以在學生也算在裡面喔!
### 這邊不要動
<- df_education_raw %>%
df_education slice(-1) %>%
select(-edu_age_15up_total, -statistic_yyy) %>%
mutate(across(contains("edu_"), ~as.integer(.)))
我把第一列刪掉後,再將用不到的欄位刪除,並且把教育相關欄位轉換成整數,接下來請用 df_education
作答!
pivot_longer()
:現在的資料是一個寬表格(wide),請利用 pivot_longer()
或是 gather()
將它轉換成長表格(long),儲存為 df_education_long
,並將 df_education_long
印出。
### your code
<- df_education %>%
df_education_long pivot_longer(cols = -c(1:3), names_to = "key", values_to = "value")
df_education_long### your result should be
#> # A tibble: 356,960 x 5
#> district_code site_id village key value
#> <chr> <chr> <chr> <chr> <int>
#> 1 65000010001 新北市板橋區 留侯里 edu_doctor_graduated_m 4
#> 2 65000010001 新北市板橋區 留侯里 edu_doctor_graduated_f 0
#> 3 65000010001 新北市板橋區 留侯里 edu_doctor_ungraduated_m 1
#> 4 65000010001 新北市板橋區 留侯里 edu_doctor_ungraduated_f 2
#> 5 65000010001 新北市板橋區 留侯里 edu_master_graduated_m 52
#> 6 65000010001 新北市板橋區 留侯里 edu_master_graduated_f 31
#> 7 65000010001 新北市板橋區 留侯里 edu_master_ungraduated_m 12
#> 8 65000010001 新北市板橋區 留侯里 edu_master_ungraduated_f 7
#> 9 65000010001 新北市板橋區 留侯里 edu_university_graduated_m 150
#> 10 65000010001 新北市板橋區 留侯里 edu_university_graduated_f 167
#> # … with 356,950 more rows
separate()
:請觀察 df_education_long
當中 key
欄位的值,將 key 利用 separate()
切開成四個欄位,分別是edu
, level
, edu_status
, gender
,各自代表教育前綴詞、學位、教育狀態、性別。最後將結果儲存在 df_education_long_cut
當中,並將 df_education_long_cut
印出。如果你不懂這邊在說什麼,請參考老師本週第一支影片(R04_1_1 Loading MOI demographic data),基本上就是按照裡面的流程進行。底下會遇到好切跟不好切的情況,遇到不好切的情況時可以使用 str_replace()
函數。
提示:
count()
觀察 key
的結構。舉例來說, edu_doctor_graduated_m
應該切成 edu
= “edu”, level
= “doctor”, edu_status
= “graduated”, gender
= “m”。edu_illiterate_f
代表不識字,沒辦法切成四塊,請想辦法將這個值切成 edu
= “edu”, level
= “illiterate”, edu_status
= “all”, gender
= “f”。edu_juniorcollege_2ys_graduated_m
裡面多了一個 _
,同樣沒辦法切成四塊,請想辦法將這個值切成 edu
= “edu”, level
= “juniorcollege2ys”, edu_status
= “graduated”, gender
= “m”。str_replace(string = "小軒不喜歡吃肉", pattern = "不喜歡", replacement = "超討厭")
的意思是把 string
參數當中滿足 pattern
參數的字串替換成 replacement
參數的值,所以結果會是 "小軒超討厭吃肉"
。### 這邊讓你看 key 的長相
%>% count(key)
df_education_long
### your result should be
### your code
<- df_education_long %>%
df_education_long_cut mutate(key = str_replace(key, "juniorcollege_2ys", "juniorcollege2ys")) %>%
mutate(key = str_replace(key, "juniorcollege_5ys_final2y", "juniorcollege5ysfinal2y")) %>%
mutate(key = str_replace(key, "juniorcollege_5ys_final3y", "juniorcollege5ysfinal3y")) %>%
mutate(key = str_replace(key, "juniorcollege_5ys_first3y", "juniorcollege5ysfirst3y")) %>%
mutate(key = str_replace(key, "_illiterate", "_illiterate_all")) %>%
mutate(key = str_replace(key, "_selftaughtl", "_selftaughtl_all")) %>%
separate(key, c("edu", "level", "graduate", "gender"), sep = "_")
df_education_long_cut
### your result should be
#> # A tibble: 46 x 2
#> key n
#> * <chr> <int>
#> 1 edu_doctor_graduated_f 7760
#> 2 edu_doctor_graduated_m 7760
#> 3 edu_doctor_ungraduated_f 7760
#> 4 edu_doctor_ungraduated_m 7760
#> 5 edu_illiterate_f 7760
#> 6 edu_illiterate_m 7760
#> 7 edu_junior_graduated_f 7760
#> 8 edu_junior_graduated_m 7760
#> 9 edu_junior_ungraduated_f 7760
#> 10 edu_junior_ungraduated_m 7760
#> # … with 36 more rows
#> # A tibble: 356,960 x 8
#> district_code site_id village edu level graduate gender value
#> <chr> <chr> <chr> <chr> <chr> <chr> <chr> <int>
#> 1 65000010001 新北市板橋區 留侯里 edu doctor graduated m 4
#> 2 65000010001 新北市板橋區 留侯里 edu doctor graduated f 0
#> 3 65000010001 新北市板橋區 留侯里 edu doctor ungraduated m 1
#> 4 65000010001 新北市板橋區 留侯里 edu doctor ungraduated f 2
#> 5 65000010001 新北市板橋區 留侯里 edu master graduated m 52
#> 6 65000010001 新北市板橋區 留侯里 edu master graduated f 31
#> 7 65000010001 新北市板橋區 留侯里 edu master ungraduated m 12
#> 8 65000010001 新北市板橋區 留侯里 edu master ungraduated f 7
#> 9 65000010001 新北市板橋區 留侯里 edu university graduated m 150
#> 10 65000010001 新北市板橋區 留侯里 edu university graduated f 167
#> # … with 356,950 more rows
group_by()
and summarize()
:請計算以區域別(site_id) 為單位的教育統計,產出的欄位包含 n_sum(15歲以上現住人口總人數), n_female(女性15歲以上現住人口總人數), n_higher(高等教育總人數, 包含博士、碩士、大學,畢業與肄業皆計入), n_higher_female(女性高等教育總人數), n_self(自學總人數), n_illiterate(不識字總人數),把資料儲存為 df_education_agg
,並且將 df_education_agg
印出。
### your code
<- df_education_long_cut %>%
df_education_agg group_by(site_id) %>%
summarise(n_sum = sum(value),
n_female = sum(value[gender == "f"]),
n_higher = sum(value[level %in% c("doctor", "master", "university")]),
n_higher_female = sum(value[level %in% c("doctor", "master", "university") & gender == "f"]),
n_self = sum(value[level == "selftaughtl"]),
n_illiterate = sum(value[level == "illiterate"]))
df_education_agg
### your result should be
#> # A tibble: 370 x 7
#> site_id n_sum n_female n_higher n_higher_female n_self n_illiterate
#> * <chr> <int> <int> <int> <int> <int> <int>
#> 1 南投縣中寮鄉 13571 6187 2339 1141 38 380
#> 2 南投縣仁愛鄉 13158 6386 2029 1088 31 51
#> 3 南投縣信義鄉 13847 6418 2482 1210 64 76
#> 4 南投縣南投市 87696 44119 28352 13650 88 791
#> 5 南投縣名間鄉 34622 16354 7706 3726 108 306
#> 6 南投縣國姓鄉 17132 7860 2937 1386 34 410
#> 7 南投縣埔里鎮 71529 36009 17679 8941 99 749
#> 8 南投縣水里鄉 15960 7561 3131 1489 38 266
#> 9 南投縣竹山鎮 48789 23780 11115 5336 785 0
#> 10 南投縣草屯鎮 86210 42768 25906 12396 135 1040
#> # … with 360 more rows
請以教育資料為基底,串接第十案公投的資料。串接後會發現兩邊資料大部分可以完成串接,但有少部分會有問題,請檢查到底問題在哪裡,將問題排除後,將串好的資料儲存成 df_ref_edu_join
,再將 df_ref_edu_join
印出。
提示:
filter(is.na(填你要檢查的欄位))
檢查是哪些列有問題。site_id
的問題之後,你會回頭重新利用 group_by()
和 summarize()
計算教育統計,再去 join 公投案資料。### 這裡不要動
<- read_csv("data/AS04/ref10_town.csv") %>%
ref10 select(county = 縣市, town = 鄉鎮市區,
agree = 同意票數, disagree = 不同意票數,
legalVote = 有效票數, illegalVote = 無效票數,
vote = 投票數, legalPopulation = 投票權人數) %>%
mutate(site_id = str_c(county, town)) %>%
select(-c(county, town)) %>%
drop_na(site_id) %>% select(site_id, everything())
### your code
%>% left_join(ref10) %>%
df_education_agg filter(is.na(legalPopulation))
<-
df_ref_edu_join %>%
df_education_agg mutate(site_id = str_remove_all(site_id, " | ")) %>%
mutate(site_id = case_when(
str_detect(site_id, "高雄市鳳山") ~ "高雄市鳳山區",
str_detect(site_id, "高雄市三民") ~ "高雄市三民區",
TRUE ~ site_id
%>%
)) group_by(site_id) %>%
summarise(across(contains("n_"), ~sum(.))) %>%
left_join(ref10)
df_ref_edu_join
### your result should be
#> # A tibble: 16 x 13
#> site_id n_sum n_female n_higher n_higher_female n_self n_illiterate agree
#> <chr> <int> <int> <int> <int> <int> <int> <dbl>
#> 1 嘉義市東 區… 105182 55143 40397 19894 195 917 NA
#> 2 嘉義市西 區… 126516 65445 42813 21093 164 1674 NA
#> 3 新竹市北 區… 128573 66399 51583 24554 184 883 NA
#> 4 新竹市東 區… 172265 88853 87811 41733 212 872 NA
#> 5 臺中市中 區… 15264 7759 5130 2499 23 58 NA
#> 6 臺中市北 區… 129168 68180 53046 26120 135 513 NA
#> 7 臺中市南 區… 108248 57055 46749 23674 176 427 NA
#> 8 臺中市東 區… 66792 33924 23149 11328 92 458 NA
#> 9 臺中市西 區… 99036 52767 45198 22567 145 351 NA
#> 10 臺南市北 區… 115147 59709 46231 22363 197 846 NA
#> 11 臺南市南 區… 112141 56919 37032 17795 477 1378 NA
#> 12 臺南市東 區… 159900 84356 74752 36548 121 737 NA
#> 13 高雄市三民一… 71098 36225 25434 11944 129 566 NA
#> 14 高雄市三民二… 228702 120298 87996 43237 250 1423 NA
#> 15 高雄市鳳山一… 157809 81806 57083 27256 231 1415 NA
#> 16 高雄市鳳山二… 154832 79126 49108 23833 204 1448 NA
#> # … with 5 more variables: disagree <dbl>, legalVote <dbl>, illegalVote <dbl>,
#> # vote <dbl>, legalPopulation <dbl>
#> # A tibble: 368 x 13
#> site_id n_sum n_female n_higher n_higher_female n_self n_illiterate agree
#> <chr> <int> <int> <int> <int> <int> <int> <dbl>
#> 1 南投縣中寮鄉… 13571 6187 2339 1141 38 380 5748
#> 2 南投縣仁愛鄉… 13158 6386 2029 1088 31 51 5702
#> 3 南投縣信義鄉… 13847 6418 2482 1210 64 76 6891
#> 4 南投縣南投市… 87696 44119 28352 13650 88 791 37547
#> 5 南投縣名間鄉… 34622 16354 7706 3726 108 306 14533
#> 6 南投縣國姓鄉… 17132 7860 2937 1386 34 410 7089
#> 7 南投縣埔里鎮… 71529 36009 17679 8941 99 749 29571
#> 8 南投縣水里鄉… 15960 7561 3131 1489 38 266 6392
#> 9 南投縣竹山鎮… 48789 23780 11115 5336 785 0 19254
#> 10 南投縣草屯鎮… 86210 42768 25906 12396 135 1040 35215
#> # … with 358 more rows, and 5 more variables: disagree <dbl>, legalVote <dbl>,
#> # illegalVote <dbl>, vote <dbl>, legalPopulation <dbl>
請利用 df_ref_edu_join
,計算 agreeRate
(分母是 legalVote
,分子是 agree
) 以及 per_higher
(分母是 n_sum
,分子是 n_higher
) ,接著利用 ggplot 和 geom_jitter() 畫出 x 軸 = per_higher
, y 軸 = agreeRate
的散點圖,並用三行以內的文字描述你的詮釋。若不知道怎麼畫,請參考老師本週第四支影片(R04_1_4 Joining demographic and referendum data) 最後面。
### your code
%>%
df_ref_edu_join mutate(agreeRate = agree / legalVote,
per_higher = n_higher / n_sum) %>%
ggplot() +
aes(per_higher, agreeRate) +
geom_jitter(alpha = 0.5, color = "royalblue") +
theme_light()
### your result should be
# 自己畫就好唷
### your text
# "高等教育程度人口比例和贊成公投第10案的投票人口比例呈現負相關,高等教育程度人口比例越高,贊成公投第10案的投票人口比例越低。"