• R for Data Journalism
  • About
  • 1 Introduction
    • 1.0.1 書中數據集
    • 1.0.2 案例介紹
  • I R BASIC
  • 2 Using R
    • 2.1 Using RStudio
      • 2.1.1 介面概覽
      • 2.1.2 撰寫R Markdown
      • 2.1.3 常用快捷鍵
      • 2.1.4 安裝與載入第三方套件
      • 2.1.5 R 預載套件介紹
      • 2.1.6 程式碼註解技巧
    • 2.2 Practical Examples
      • 2.2.1 從內政部開放資料讀取資料
      • 2.2.2 取得臺北 YouBike (v2) 即時資料
      • 2.2.3 讀取臺北住宅竊盜點位資訊
    • 2.3 Q&A
      • 2.3.1 編碼與語系設定
      • 2.3.2 RMD/R Notebook無法儲存
  • 3 R Basic
    • 3.1 R Syntax
      • 3.1.1 Assignment
      • 3.1.2 Comments 註解
    • 3.2 Vector
      • 3.2.1 Creating vectors
      • 3.2.2 Creating a fake data with ChatGPT
      • 3.2.3 Viewing
      • 3.2.4 Subsetting, filtering
      • 3.2.5 Deleting
      • 3.2.6 Concatenating
    • 3.3 Calculating with vectors
      • 3.3.1 Arithmetic operations
      • 3.3.2 Logic comparisons
      • 3.3.3 Subsetting by logic comparisons
      • 3.3.4 Sorting and ordering
      • 3.3.5 Built-in math functions
    • 3.4 Data types
      • 3.4.1 Checking data type
      • 3.4.2 Converting data type
    • 3.5 Character operations
  • 4 DataFrame
    • 4.1 基本操作
      • 4.1.1 產生新的Dataframe
      • 4.1.2 觀察dataframe
      • 4.1.3 操作dataframe
    • 4.2 簡易繪圖
    • 4.3 延伸學習
      • 4.3.1 預覽dplyr
      • 4.3.2 比較tibble, data_frame, data.frame
    • 4.4 Paid Maternity Leave
      • 4.4.1 The Data
      • 4.4.2 Visual Strategies
      • 4.4.3 Cleaning
      • 4.4.4 Plotting
      • 4.4.5 Practice. Plotting more
      • 4.4.6 Practice. Selecting and filtering by dplyr I
      • 4.4.7 (More) Clean version
      • 4.4.8 (More) The fittest version to compute staySame
  • 5 Counting and Cross-tabulation
    • 5.1 Taipei Residential Burglary
      • 5.1.1 讀取檔案
      • 5.1.2 萃取所需新變項
      • 5.1.3 使用table()計數
      • 5.1.4 依變數值篩選資料
      • 5.1.5 做雙變數樞紐分析:table()
      • 5.1.6 繪圖
      • 5.1.7 Practices
    • 5.2 Read online files
    • 5.3 Counting Review
      • 5.3.1 tapply()
      • 5.3.2 tapply() two variables
      • 5.3.3 dplyr::count() two variables
    • 5.4 Pivoting long-wide tables
      • 5.4.1 long-to-wide
      • 5.4.2 Wide-to-long
    • 5.5 Residuals analysis
  • II DATA MANIPULATION
  • 6 From base R to dplyr
    • 6.1 dplyr
    • 6.2 Taipie Theft Count (base to dplyr)
      • 6.2.1 Reading data
      • 6.2.2 Cleaning data I
      • 6.2.3 Cleaning data II
      • 6.2.4 Long to wide table
      • 6.2.5 Plot with long table
      • 6.2.6 Clean version
    • 6.3 Paid Maternity Leave
      • 6.3.1 The Data
      • 6.3.2 Advanced Visual Strategies
      • 6.3.3 Code by base R
      • 6.3.4 Code by dplyr
      • 6.3.5 Generating each
      • 6.3.6 Gathering subplots by cowplot
  • 7 Data manipultaiton: Join data
    • 7.1 An Example: Joining Two Data Frames
      • 7.1.1 left_join() & right_join()
      • 7.1.2 inner_join() and full_join()
      • 7.1.3 join() by different keys
    • 7.2 1. 案例說明-公投案與人口資料
      • 7.2.1 1.1 資料來源
      • 7.2.2 1.2 處理策略
    • 7.3 2. 讀取內政部人口統計資料
    • 7.4 3. 觀察資料
    • 7.5 4. 彙整列數據為新的變項:使用Rowwise()
      • 7.5.1 補充:c_across()的應用時機
    • 7.6 5. 將村里指標匯總為鄉鎮市區指標
    • 7.7 6. 視覺化測試(老年人口數 x 曾婚人口數)
    • 7.8 7. 合併公投資料
      • 7.8.1 7.1. 讀取公投資料
      • 7.8.2 7.2. 合併公投資料並視覺化
    • 7.9 8. 補充:不用rowwise()的做法
      • 7.9.1 8.1. 寬表轉長表
      • 7.9.2 8.2. 切分變項
      • 7.9.3 8.3. 使用group_by()建立村里指標
  • 8 Categorical Data Analysis
    • 8.1 Survey Analysis
    • 8.2 The Case: Misinformation Perception
    • 8.3 Ordered-factor
      • 8.3.1 Covert to ordered-factor
      • 8.3.2 Excluding
      • 8.3.3 Grouping-up
    • 8.4 Order-to-factor
    • 8.5 Cross-tabulating
    • 8.6 Plot
      • 8.6.1 Plot by ggplot()
  • 9 Processing Timeline
    • 9.1 Time object
    • 9.2 Example: Processing time object in social opinions
      • 9.2.1 Char-to-Time
      • 9.2.2 Density plot along time
      • 9.2.3 Freq by month
      • 9.2.4 Freq-by-date (good)
      • 9.2.5 Freq-by-hour
  • 10 NA Processing
    • 10.1 Cleaning Gov Annual Budget
      • 10.1.1 Basic Cleaning
      • 10.1.2 Processing NA
      • 10.1.3 Complete Code
    • 10.2 Cleaning Covid Vaccinating data
      • 10.2.1 觀察並評估資料概況
      • 10.2.2 按月對齊資料
      • 10.2.3 處理遺漏資料的月份
      • 10.2.4 完整程式碼
  • III TEXT PROCESSING
  • 11 Text Processing
  • 12 Trump’s tweets
    • 12.1 Loading data
    • 12.2 Cleaning data
    • 12.3 Visual Exploring
      • 12.3.1 Productivity by time
      • 12.3.2 Tweeting with figures
    • 12.4 Keyness
      • 12.4.1 Log-likelihood ratio
      • 12.4.2 Plotting keyness
  • 13 Regular expression
    • 13.1 RE applications on string operations
      • 13.1.1 Extracting
      • 13.1.2 Detecting with non-greedy
      • 13.1.3 Detecting multiple patterns
      • 13.1.4 Extracting nearby words
    • 13.2 RE Case studies
      • 13.2.1 Getting the last page of PTT HatePolitics
      • 13.2.2 Practice. Ask CHATGPT
    • 13.3 Useful cases
      • 13.3.1 Matching URL
      • 13.3.2 Removing all html tags but keeping comment content
      • 13.3.3 Removing space
      • 13.3.4 Testing
  • 14 Text processing in Chinese
    • 14.1 Preprocessing
      • 14.1.1 Assigning unique id to each doc
    • 14.2 Tokenization
      • 14.2.1 Initializer tokenizer
      • 14.2.2 Tokenization
    • 14.3 Exploring wording features
      • 14.3.1 Word frequency distribution
      • 14.3.2 Keyness by logratio
      • 14.3.3 Keyness by scatter
    • 14.4 TF-IDF
      • 14.4.1 Term-frequency
      • 14.4.2 TF-IDF to filter significant words
      • 14.4.3 Practice. Understanding TF-IDF
  • IV CRAWLER
  • 15 Introduction to Web Scraping
    • 15.1 Webpage Browsing
    • 15.2 Scraper
    • 15.3 Type of Scraper
      • 15.3.1 Type 1. Response with JSON
      • 15.3.2 Type 2. HTML Parsing
    • 15.4 Supplementary Materials
      • 15.4.1 HTTP Status Code
      • 15.4.2 Using Chrome DevTools
      • 15.4.3 Observing web request
  • 16 Scraping 104.com
    • 16.1 Complete Code
    • 16.2 Step-by-Step
      • 16.2.1 Get the first pages
      • 16.2.2 Get the first page by modifying url
      • 16.2.3 Combine two data with the same variables
      • 16.2.4 Drop out hierarchical variables
      • 16.2.5 Dropping hierarchical variables by dplyr way
      • 16.2.6 Finding out the last page number
      • 16.2.7 Using for-loop to get all pages
      • 16.2.8 combine all data.frame
  • 17 Read JSON
    • 17.1 Reading JSON
      • 17.1.1 JSON as a string
      • 17.1.2 JSON as a local file
      • 17.1.3 JSON as a web file
      • 17.1.4 Practice. Convert ubike json to data.frame
    • 17.2 Case 1: Air-Quality (well-formatted )
      • 17.2.1 Using knitr::kable() for better printing
      • 17.2.2 Step-by-step: Parse JSON format string to R objects
      • 17.2.3 Combining all
    • 17.3 Practices: traversing json data
    • 17.4 Case 2: cnyes news (well-formatted)
      • 17.4.1 (option) 取回資料並寫在硬碟
    • 17.5 Case 3: footRumor (ill-formatted)
      • 17.5.1 處理非典型的JSON檔
  • 18 HTML Parser
    • 18.1 HTML
    • 18.2 Detecting Element Path
      • 18.2.1 XPath
      • 18.2.2 CSS Selector
  • 19 Scraping PTT
    • 19.1 Step 1. 載入所需套件
    • 19.2 Step 2. 取回並剖析HTML檔案
      • 19.2.1 Step 2-1. read_html() 將網頁取回並轉為xml_document
      • 19.2.2 Step 2-2 以html_nodes() 以選擇所需的資料節點
      • 19.2.3 Step 2-2 補充說明與XPath、CSS Selector的最佳化
      • 19.2.4 Step 2-3 html_text()或html_attr()轉出所要的資料
    • 19.3 Step 3. 用for迴圈打撈多頁的連結
    • 19.4 Step 4. 根據連結取回所有貼文
    • 19.5 補充(1) 較好的寫法
    • 19.6 補充(2) 最佳的寫法
  • 20 NYT: LeBron James Achievement
    • 20.1 Get top250 players
    • 20.2 Scraping live scores
      • 20.2.1 Testing: Scrape one
      • 20.2.2 Scrape life time scores of all top-250 players
    • 20.3 Cleaning data
    • 20.4 Visualization
      • 20.4.1 Line: Age x cumPTS
      • 20.4.2 Line: year x cumPTS
      • 20.4.3 Line: Age x PER_by_year
      • 20.4.4 Comparing LeBron James and Jabbar
    • 20.5 Scraping and cleaning
      • 20.5.1 VIS LJames and jabbar
    • 20.6 (More) Scraping all players
      • 20.6.1 Testing
      • 20.6.2 Scrape from a-z except x(no x)
  • V VISUALIZATION
  • 21 Visualization
    • 21.1 ggplot2
    • 21.2 VIS packages
    • 21.3 Case Gallery
      • 21.3.1 WP: Paid Maternity Leave (產假支薪): barplot
      • 21.3.2 NYT: Population Changes Over More Than 20,000 Years: Coordinate, lineplot
      • 21.3.3 NYT: LeBron James’ Achievement: Coordinate, lineplot
      • 21.3.4 Taiwan Village Population Distribution: Coordinate, lineplot
      • 21.3.5 NYT: Net Worth by Age Group: Coordinate, barplot
      • 21.3.6 NYT: Optimistic of different generation: Association, scatter
      • 21.3.7 Vaccinating Proportion by countries: Amount, heatmap
      • 21.3.8 Taiwan salary distribution: Distribution, boxmap
      • 21.3.9 Taiwan income distribution by each town: Distribution, boxmap
      • 21.3.10 NYT: Carbon by countries: Proportion, Treemap
      • 21.3.11 Taiwan Annual Expenditure: Proportion, Treemap
  • 22 ggplot
    • 22.1 Essentials of ggplot
      • 22.1.1 (1) ggplot() 秀出預備要繪製的繪圖區
      • 22.1.2 (2) aes() 指定X/Y軸與群組因子
      • 22.1.3 (3) geom_???() 指定要繪製的圖表類型。
    • 22.2 NYT: Inequality
      • 22.2.1 (1) Loading data
      • 22.2.2 (2) Visualizing
    • 22.3 Adjusting Chart
      • 22.3.1 Type of Points and Lines
      • 22.3.2 Line Types
      • 22.3.3 Title, Labels and Legends
      • 22.3.4 Font
      • 22.3.5 Color Themes
      • 22.3.6 Set-up Default Theme
      • 22.3.7 Show Chinese Text
      • 22.3.8 X/Y axis
    • 22.4 Highlighting & Storytelling
      • 22.4.1 依群組指定顏色
      • 22.4.2 使用gghighlight套件
      • 22.4.3 為視覺化建立群組
  • 23 Coordinate
    • 23.1 NYT: Population Growth
      • 23.1.1 Parsing table from pdf
      • 23.1.2 X and Y with log-scale
    • 23.2 Order as axis
    • 23.3 Log-scale
    • 23.4
    • 23.5 Square-root scale
    • 23.6 Increasing percentage as Y
      • 23.6.1 NYT: Net Worth by Age Group
      • 23.6.2 Read and sort data
    • 23.7 X/Y aspect ratio
      • 23.7.1 UNICEF-Optimistic (WGOITH)
  • 24 AMOUNT
    • 24.1 Bar chart
    • 24.2 Heatmap: Vaccination
      • 24.2.1 The case: Vaccinating coverage by month
      • 24.2.2 Data cleaning
      • 24.2.3 Visualization
  • 25 DISTRIBUTION: Histogram & Density
    • 25.1 Density plot
      • 25.1.1 Density with different bandwidth
    • 25.2 Histogram
      • 25.2.1 Histogram with different number of bins
      • 25.2.2 Density vs histogram
      • 25.2.3 Positions of bar chart
      • 25.2.4 Display two groups histogram by facet_wrap()
    • 25.3 Pyramid Plot
      • 25.3.1 Modify geom_col() to pyramid plot
    • 25.4 Box plot: Muitiple Distrubution
      • 25.4.1 TW-Salary (boxplot)
      • 25.4.2 TW-Income (boxplot)
    • 25.5 Likert plot
      • 25.5.1 Stacked or dodged bar
      • 25.5.2 Likert Graph
  • 26 PROPORTION
    • 26.1 Pie Chart
    • 26.2 Dodged Bar Chart
    • 26.3 Treemap: Nested Proportion
      • 26.3.1 NYT: Carbon by countries
      • 26.3.2 TW: Taiwan Annual Expenditure
  • 27 ASSOCIATION
    • 27.1 等比例座標軸
      • 27.1.1 UNICEF-Optimistic (WGOITH)
  • 28 TIME & TRENDS
    • 28.1 Highlighting: Unemployed Population
      • 28.1.1 The econimics data
      • 28.1.2 Setting marking area
    • 28.2 Smoothing: Unemployed
      • 28.2.1 Polls_2008
  • 29 GEOSPATIAL
    • 29.1 World Map
      • 29.1.1 Bind data to map data
      • 29.1.2 Drawing Map
      • 29.1.3 Drawing map by specific colors
      • 29.1.4 Practice. Drawing map for every years
    • 29.2 Read Spatial Data from SEGIS
      • 29.2.1 The case: Population and Density of Taipei
      • 29.2.2 Projection 投影的概念
    • 29.3 Town-level: Taipei income
      • 29.3.1 Reading income data
      • 29.3.2 Read Taipei zip code
    • 29.4 Voting map - County level
      • 29.4.1 Loading county-level president voting rate
      • 29.4.2 sf to load county level shp
      • 29.4.3 Simplfying map polygon
      • 29.4.4 Practice. Drawing Taiwan county-scale map from SEGIS data
    • 29.5 Mapping data with grid
      • 29.5.1 Loading Taiwan map
      • 29.5.2 Building grid
      • 29.5.3 loading data
      • 29.5.4 Merging data
    • 29.6 Mapping Youbike Location
      • 29.6.1 Creating a new variable
      • 29.6.2 Mapping with sf
      • 29.6.3 Using ggmap (Deprecated)
  • 30 NETWORK VIS
    • 30.1 Generating networks
      • 30.1.1 Random network
      • 30.1.2 Random network
    • 30.2 Retrieve Top3 Components
      • 30.2.1 Visualize again
    • 30.3 Motif visualization and analysis
      • 30.3.1 Motif type
      • 30.3.2 Motif analysis
      • 30.3.3 Generate motives
  • 31 Interactivity
    • 31.1 ggplotly
      • 31.1.1 LINE CHART
      • 31.1.2 SCATTER
      • 31.1.3 Barplot
      • 31.1.4 Boxplot
      • 31.1.5 Treemap (Global Carbon)
    • 31.2 產製圖表動畫
      • 31.2.1 地圖下載與轉換投影方法
      • 31.2.2 靜態繪圖測試
  • VI CASE STUDIES
  • 32 WGOITG of NyTimes
  • 33 Inequality: Net Worth by Age Group
  • 34 Optimism Survey by Countries
  • 35 Case Studies (Taiwan)
    • 35.1 TW AQI Visual Studies
      • 35.1.1 eda-load-data-from-github
      • 35.1.2 Trending: Central tendency
      • 35.1.3 Trending: Extreme value
  • 36 Appendix
    • 36.1 Dataset
  • Published with bookdown

R for Data Journalism

Chapter 28 TIME & TRENDS

28.1 Highlighting: Unemployed Population

This example is referenced from Datacamp’s Introduction to data visualization with ggplot2。

28.1.1 The econimics data

這是一個包含美國經濟時間序列資料的資料集,資料來源為https://fred.stlouisfed.org/。economics是以「寬」表格方式儲存,而economics_long 資料框則以「長」表格方式儲存。每一列之date為資料收集的月份。

  • pce:個人消費支出,以十億美元為單位,資料來源為 https://fred.stlouisfed.org/series/PCE
  • pop:總人口數,以千人為單位,資料來源為 https://fred.stlouisfed.org/series/POP
  • psavert:個人儲蓄率,資料來源為 https://fred.stlouisfed.org/series/PSAVERT/
  • uempmed:失業中位數持續時間,以週為單位,資料來源為 https://fred.stlouisfed.org/series/UEMPMED
  • unemploy:失業人數,以千人為單位,資料來源為 https://fred.stlouisfed.org/series/UNEMPLOY
economics %>% head()
## # A tibble: 6 × 6
##   date         pce    pop psavert uempmed unemploy
##   <date>     <dbl>  <dbl>   <dbl>   <dbl>    <dbl>
## 1 1967-07-01  507. 198712    12.6     4.5     2944
## 2 1967-08-01  510. 198911    12.6     4.7     2945
## 3 1967-09-01  516. 199113    11.9     4.6     2958
## 4 1967-10-01  512. 199311    12.9     4.9     3143
## 5 1967-11-01  517. 199498    12.8     4.7     3066
## 6 1967-12-01  525. 199657    11.8     4.8     3018

28.1.2 Setting marking area

recess <- data.frame(
  begin = c("1969-12-01","1973-11-01","1980-01-01","1981-07-01","1990-07-01","2001-03-01", "2007-12-01"), 
  end = c("1970-11-01","1975-03-01","1980-07-01","1982-11-01","1991-03-01","2001-11-01", "2009-07-30"),
  event = c("Fiscal & Monetary\ntightening", "1973 Oil crisis", "Double dip I","Double dip II", "Oil price shock", "Dot-com bubble", "Sub-prime\nmortgage crisis"),
  y =  c(.01415981, 0.02067402, 0.02951190,  0.03419201,  0.02767339, 0.02159662, 0.02520715)
  )

library(lubridate)
recess <- recess %>%
  mutate(begin = ymd(begin), 
         end = ymd(end))

economics %>% 
  ggplot() + 
  aes(x = date, y = unemploy/pop) + 
  ggtitle(c("The percentage of unemployed Americans \n increases sharply during recessions")) +
  geom_line() +
  geom_rect(data = recess, 
            aes(xmin = begin, xmax = end, ymin = -Inf, ymax = +Inf, fill = "Recession"), 
            inherit.aes = FALSE, alpha = 0.2) +
  geom_label(data = recess, aes(x = end, y = y, label=event), size = 3) + 
  scale_fill_manual(name = "", values="red", label="Recessions")

28.2 Smoothing: Unemployed

  • Smooth by bin smoothing
fit <- with(economics,
            ksmooth(date, unemploy, kernel = "box", bandwidth=210))

economics %>%
  mutate(smooth = fit$y) %>%
  ggplot() + aes(date, unemploy) + 
  geom_point(alpha = 5, color = "skyblue") + 
  geom_line(aes(date, smooth), color="red") + theme_minimal()

28.2.1 Polls_2008

Second Example comes from Rafael’s online book

library(dslabs)
span <- 7 
polls_2008
## # A tibble: 131 × 2
##      day margin
##    <dbl>  <dbl>
##  1  -155 0.0200
##  2  -153 0.0300
##  3  -152 0.065 
##  4  -151 0.06  
##  5  -150 0.07  
##  6  -149 0.05  
##  7  -147 0.035 
##  8  -146 0.06  
##  9  -145 0.0267
## 10  -144 0.0300
## # ℹ 121 more rows
fit <- with(polls_2008, 
            ksmooth(day, margin, kernel = "box", bandwidth = span))

polls_2008 %>% 
    mutate(smooth = fit$y) %>%
    ggplot(aes(day, margin)) +
    geom_point(size = 3, alpha = .5, color = "grey") + 
    geom_line(aes(day, smooth), color="red") + theme_minimal()