Bokeh & Seaborn(Vaccinating)#

Main Reference: https://pandas.pydata.org/pandas-docs/stable/user_guide/visualization.html

用來做範例的這個資料是COVID疫情期間的各國疫苗接種資料。資料包含不同國家在不同日期所上傳的資料。要注意的是,這份資料的空值相當的多,有看得出來是空值的資料(如某些項目沒有填寫),也有沒有填寫的天數。每個國家開始登記的日期、漏登的日期、後來不再追蹤的日期都不一定,因此對齊資料的日期、決定資料可回答問題的區間都非常辛苦。

Load vaccination data#

import pandas as pd
raw = pd.read_csv("https://raw.githubusercontent.com/owid/covid-19-data/master/public/data/owid-covid-data.csv")
raw
iso_code continent location date total_cases new_cases new_cases_smoothed total_deaths new_deaths new_deaths_smoothed ... male_smokers handwashing_facilities hospital_beds_per_thousand life_expectancy human_development_index population excess_mortality_cumulative_absolute excess_mortality_cumulative excess_mortality excess_mortality_cumulative_per_million
0 AFG Asia Afghanistan 2020-01-05 0.0 0.0 NaN 0.0 0.0 NaN ... NaN 37.75 0.5 64.83 0.51 41128772 NaN NaN NaN NaN
1 AFG Asia Afghanistan 2020-01-06 0.0 0.0 NaN 0.0 0.0 NaN ... NaN 37.75 0.5 64.83 0.51 41128772 NaN NaN NaN NaN
2 AFG Asia Afghanistan 2020-01-07 0.0 0.0 NaN 0.0 0.0 NaN ... NaN 37.75 0.5 64.83 0.51 41128772 NaN NaN NaN NaN
3 AFG Asia Afghanistan 2020-01-08 0.0 0.0 NaN 0.0 0.0 NaN ... NaN 37.75 0.5 64.83 0.51 41128772 NaN NaN NaN NaN
4 AFG Asia Afghanistan 2020-01-09 0.0 0.0 NaN 0.0 0.0 NaN ... NaN 37.75 0.5 64.83 0.51 41128772 NaN NaN NaN NaN
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
429430 ZWE Africa Zimbabwe 2024-07-31 266386.0 0.0 0.0 5740.0 0.0 0.0 ... 30.7 36.79 1.7 61.49 0.57 16320539 NaN NaN NaN NaN
429431 ZWE Africa Zimbabwe 2024-08-01 266386.0 0.0 0.0 5740.0 0.0 0.0 ... 30.7 36.79 1.7 61.49 0.57 16320539 NaN NaN NaN NaN
429432 ZWE Africa Zimbabwe 2024-08-02 266386.0 0.0 0.0 5740.0 0.0 0.0 ... 30.7 36.79 1.7 61.49 0.57 16320539 NaN NaN NaN NaN
429433 ZWE Africa Zimbabwe 2024-08-03 266386.0 0.0 0.0 5740.0 0.0 0.0 ... 30.7 36.79 1.7 61.49 0.57 16320539 NaN NaN NaN NaN
429434 ZWE Africa Zimbabwe 2024-08-04 266386.0 0.0 0.0 5740.0 0.0 0.0 ... 30.7 36.79 1.7 61.49 0.57 16320539 NaN NaN NaN NaN

429435 rows × 67 columns

Observing data#

raw.columns
Index(['iso_code', 'continent', 'location', 'date', 'total_cases', 'new_cases',
       'new_cases_smoothed', 'total_deaths', 'new_deaths',
       'new_deaths_smoothed', 'total_cases_per_million',
       'new_cases_per_million', 'new_cases_smoothed_per_million',
       'total_deaths_per_million', 'new_deaths_per_million',
       'new_deaths_smoothed_per_million', 'reproduction_rate', 'icu_patients',
       'icu_patients_per_million', 'hosp_patients',
       'hosp_patients_per_million', 'weekly_icu_admissions',
       'weekly_icu_admissions_per_million', 'weekly_hosp_admissions',
       'weekly_hosp_admissions_per_million', 'total_tests', 'new_tests',
       'total_tests_per_thousand', 'new_tests_per_thousand',
       'new_tests_smoothed', 'new_tests_smoothed_per_thousand',
       'positive_rate', 'tests_per_case', 'tests_units', 'total_vaccinations',
       'people_vaccinated', 'people_fully_vaccinated', 'total_boosters',
       'new_vaccinations', 'new_vaccinations_smoothed',
       'total_vaccinations_per_hundred', 'people_vaccinated_per_hundred',
       'people_fully_vaccinated_per_hundred', 'total_boosters_per_hundred',
       'new_vaccinations_smoothed_per_million',
       'new_people_vaccinated_smoothed',
       'new_people_vaccinated_smoothed_per_hundred', 'stringency_index',
       'population_density', 'median_age', 'aged_65_older', 'aged_70_older',
       'gdp_per_capita', 'extreme_poverty', 'cardiovasc_death_rate',
       'diabetes_prevalence', 'female_smokers', 'male_smokers',
       'handwashing_facilities', 'hospital_beds_per_thousand',
       'life_expectancy', 'human_development_index', 'population',
       'excess_mortality_cumulative_absolute', 'excess_mortality_cumulative',
       'excess_mortality', 'excess_mortality_cumulative_per_million'],
      dtype='object')

計算每個洲(continent)有多少資料。每個洲會高達數萬筆資料,原因是因為每一列是一個國家一天的資料。

print(set(raw.continent))
raw.continent.value_counts()
{'Africa', 'Europe', 'Asia', 'South America', 'Oceania', nan, 'North America'}
continent
Africa           95419
Europe           91031
Asia             84199
North America    68638
Oceania          40183
South America    23440
Name: count, dtype: int64

Filtering data#

Since the purpose is to understand the similarities and differences between Taiwan’s and other countries, the following only deals with Asian data, including South Korea, Japan and other countries that deal with the epidemic situation similar to my country’s.

df_asia = raw.loc[raw['continent']=="Asia"]
set(df_asia.location)
{'Afghanistan',
 'Armenia',
 'Azerbaijan',
 'Bahrain',
 'Bangladesh',
 'Bhutan',
 'Brunei',
 'Cambodia',
 'China',
 'East Timor',
 'Georgia',
 'Hong Kong',
 'India',
 'Indonesia',
 'Iran',
 'Iraq',
 'Israel',
 'Japan',
 'Jordan',
 'Kazakhstan',
 'Kuwait',
 'Kyrgyzstan',
 'Laos',
 'Lebanon',
 'Macao',
 'Malaysia',
 'Maldives',
 'Mongolia',
 'Myanmar',
 'Nepal',
 'North Korea',
 'Northern Cyprus',
 'Oman',
 'Pakistan',
 'Palestine',
 'Philippines',
 'Qatar',
 'Saudi Arabia',
 'Singapore',
 'South Korea',
 'Sri Lanka',
 'Syria',
 'Taiwan',
 'Tajikistan',
 'Thailand',
 'Turkey',
 'Turkmenistan',
 'United Arab Emirates',
 'Uzbekistan',
 'Vietnam',
 'Yemen'}
# Using .loc() to filter location == Taiwan
# df_tw = df_asia.loc[df_asia['location'] == "Taiwan"]

# Using pandas.Dataframe.query() function
df_tw = df_asia.query('location == "Taiwan"')
df_tw
iso_code continent location date total_cases new_cases new_cases_smoothed total_deaths new_deaths new_deaths_smoothed ... male_smokers handwashing_facilities hospital_beds_per_thousand life_expectancy human_development_index population excess_mortality_cumulative_absolute excess_mortality_cumulative excess_mortality excess_mortality_cumulative_per_million
374304 TWN Asia Taiwan 2020-01-16 NaN NaN NaN NaN NaN NaN ... NaN NaN NaN 80.46 NaN 23893396 NaN NaN NaN NaN
374305 TWN Asia Taiwan 2020-01-17 NaN NaN NaN NaN NaN NaN ... NaN NaN NaN 80.46 NaN 23893396 NaN NaN NaN NaN
374306 TWN Asia Taiwan 2020-01-18 NaN NaN NaN NaN NaN NaN ... NaN NaN NaN 80.46 NaN 23893396 NaN NaN NaN NaN
374307 TWN Asia Taiwan 2020-01-19 NaN NaN NaN NaN NaN NaN ... NaN NaN NaN 80.46 NaN 23893396 NaN NaN NaN NaN
374308 TWN Asia Taiwan 2020-01-20 NaN NaN NaN NaN NaN NaN ... NaN NaN NaN 80.46 NaN 23893396 NaN NaN NaN NaN
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
375647 TWN Asia Taiwan 2023-09-20 NaN NaN NaN NaN NaN NaN ... NaN NaN NaN 80.46 NaN 23893396 NaN NaN NaN NaN
375648 TWN Asia Taiwan 2023-09-21 NaN NaN NaN NaN NaN NaN ... NaN NaN NaN 80.46 NaN 23893396 NaN NaN NaN NaN
375649 TWN Asia Taiwan 2023-09-22 NaN NaN NaN NaN NaN NaN ... NaN NaN NaN 80.46 NaN 23893396 NaN NaN NaN NaN
375650 TWN Asia Taiwan 2023-09-23 NaN NaN NaN NaN NaN NaN ... NaN NaN NaN 80.46 NaN 23893396 NaN NaN NaN NaN
375651 TWN Asia Taiwan 2023-09-24 NaN NaN NaN NaN NaN NaN ... NaN NaN NaN 80.46 NaN 23893396 NaN NaN NaN NaN

1348 rows × 67 columns

df_tw.dtypes
iso_code                                    object
continent                                   object
location                                    object
date                                        object
total_cases                                float64
                                            ...   
population                                   int64
excess_mortality_cumulative_absolute       float64
excess_mortality_cumulative                float64
excess_mortality                           float64
excess_mortality_cumulative_per_million    float64
Length: 67, dtype: object

Line plot of time series#

由於要以時間(日期)當成X軸來繪圖,所以要先偵測看看目前的日期(date)變數型態為何(由於載下來的資料是CSV,八成是字串,偶而會是整數),所以會需要將日期的字串轉為Python的時間物件datetime

print(type(df_tw.date))
# <class 'pandas.core.series.Series'>

print(df_tw.date.dtype)
# object (str)

# Converting columns to datetime
df_tw['date'] = pd.to_datetime(df_tw['date'], format="%Y-%m-%d")

print(df_tw.date.dtype)
# datetime64[ns]
<class 'pandas.core.series.Series'>
object
datetime64[ns]
/var/folders/4m/shks9p8j0dnbv51nf7cyysfc0000gn/T/ipykernel_44836/1951838620.py:8: SettingWithCopyWarning: 
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df_tw['date'] = pd.to_datetime(df_tw['date'], format="%Y-%m-%d")
df_tw
iso_code continent location date total_cases new_cases new_cases_smoothed total_deaths new_deaths new_deaths_smoothed ... male_smokers handwashing_facilities hospital_beds_per_thousand life_expectancy human_development_index population excess_mortality_cumulative_absolute excess_mortality_cumulative excess_mortality excess_mortality_cumulative_per_million
374304 TWN Asia Taiwan 2020-01-16 NaN NaN NaN NaN NaN NaN ... NaN NaN NaN 80.46 NaN 23893396 NaN NaN NaN NaN
374305 TWN Asia Taiwan 2020-01-17 NaN NaN NaN NaN NaN NaN ... NaN NaN NaN 80.46 NaN 23893396 NaN NaN NaN NaN
374306 TWN Asia Taiwan 2020-01-18 NaN NaN NaN NaN NaN NaN ... NaN NaN NaN 80.46 NaN 23893396 NaN NaN NaN NaN
374307 TWN Asia Taiwan 2020-01-19 NaN NaN NaN NaN NaN NaN ... NaN NaN NaN 80.46 NaN 23893396 NaN NaN NaN NaN
374308 TWN Asia Taiwan 2020-01-20 NaN NaN NaN NaN NaN NaN ... NaN NaN NaN 80.46 NaN 23893396 NaN NaN NaN NaN
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
375647 TWN Asia Taiwan 2023-09-20 NaN NaN NaN NaN NaN NaN ... NaN NaN NaN 80.46 NaN 23893396 NaN NaN NaN NaN
375648 TWN Asia Taiwan 2023-09-21 NaN NaN NaN NaN NaN NaN ... NaN NaN NaN 80.46 NaN 23893396 NaN NaN NaN NaN
375649 TWN Asia Taiwan 2023-09-22 NaN NaN NaN NaN NaN NaN ... NaN NaN NaN 80.46 NaN 23893396 NaN NaN NaN NaN
375650 TWN Asia Taiwan 2023-09-23 NaN NaN NaN NaN NaN NaN ... NaN NaN NaN 80.46 NaN 23893396 NaN NaN NaN NaN
375651 TWN Asia Taiwan 2023-09-24 NaN NaN NaN NaN NaN NaN ... NaN NaN NaN 80.46 NaN 23893396 NaN NaN NaN NaN

1348 rows × 67 columns

Plot 1 line by Pandas#

args

df_tw.plot(x="date", y="new_cases", figsize=(10, 5))
<Axes: xlabel='date'>
../_images/3f8ffd6f70b30b32dc32e1609f71f0b251729f3ff11ba1b718f3804e150f62f7.png

Plot multiple lines#

要繪製單一變項(一個國家)的折線圖很容易,X軸為日期、Y軸為案例數。但要如何繪製多個國家、多條折線圖(每個國家一條線)?以下就以日本和台灣兩國的數據為例來進行繪製。

location這個欄位紀錄了該列資料屬於日本或台灣。通常視覺化軟體會有兩種作法,一種做法是必須把日本和台灣在欄的方向展開(用df.pivot()),變成兩個變項,日本和台灣各一個變項,Python最基本的繪圖函式庫matplotlib就必須這麼做。但如果用號稱是matplotlib的進階版seaborn,則可以指定location這個變項作為群組資訊,簡單地說是用location當成群組變數來繪製不同的線。

df1 = df_asia.loc[df_asia['location'].isin(["Taiwan", "Japan"])]
df1['date'] = pd.to_datetime(df1['date'], format="%Y-%m-%d")
set(df1.location)
/var/folders/4m/shks9p8j0dnbv51nf7cyysfc0000gn/T/ipykernel_44836/2904734101.py:2: SettingWithCopyWarning: 
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df1['date'] = pd.to_datetime(df1['date'], format="%Y-%m-%d")
{'Japan', 'Taiwan'}
df1[['location', 'date', 'new_cases']]
location date new_cases
188626 Japan 2020-01-05 0.0
188627 Japan 2020-01-06 0.0
188628 Japan 2020-01-07 0.0
188629 Japan 2020-01-08 0.0
188630 Japan 2020-01-09 0.0
... ... ... ...
375647 Taiwan 2023-09-20 NaN
375648 Taiwan 2023-09-21 NaN
375649 Taiwan 2023-09-22 NaN
375650 Taiwan 2023-09-23 NaN
375651 Taiwan 2023-09-24 NaN

3022 rows × 3 columns

# df1 data contains more than 1 location

df1.plot(x="date", y="new_cases", figsize=(10, 5))
<Axes: xlabel='date'>
../_images/cae0426af683da2b323030696c3b427ed13afcf6e0d987d2e08f6a30e3840e82.png
df_wide = df1.pivot(index="date", columns="location", 
                    values=["new_cases", "total_cases", "total_vaccinations_per_hundred"])
df_wide
new_cases total_cases total_vaccinations_per_hundred
location Japan Taiwan Japan Taiwan Japan Taiwan
date
2020-01-05 0.0 NaN 0.0 NaN NaN NaN
2020-01-06 0.0 NaN 0.0 NaN NaN NaN
2020-01-07 0.0 NaN 0.0 NaN NaN NaN
2020-01-08 0.0 NaN 0.0 NaN NaN NaN
2020-01-09 0.0 NaN 0.0 NaN NaN NaN
... ... ... ... ... ... ...
2024-07-31 0.0 NaN 33803572.0 NaN NaN NaN
2024-08-01 0.0 NaN 33803572.0 NaN NaN NaN
2024-08-02 0.0 NaN 33803572.0 NaN NaN NaN
2024-08-03 0.0 NaN 33803572.0 NaN NaN NaN
2024-08-04 0.0 NaN 33803572.0 NaN NaN NaN

1674 rows × 6 columns

fillna()#

df_wide.fillna(0, inplace=True)
df_wide.new_cases.Taiwan
df_wide
new_cases total_cases total_vaccinations_per_hundred
location Japan Taiwan Japan Taiwan Japan Taiwan
date
2020-01-05 0.0 0.0 0.0 0.0 0.0 0.0
2020-01-06 0.0 0.0 0.0 0.0 0.0 0.0
2020-01-07 0.0 0.0 0.0 0.0 0.0 0.0
2020-01-08 0.0 0.0 0.0 0.0 0.0 0.0
2020-01-09 0.0 0.0 0.0 0.0 0.0 0.0
... ... ... ... ... ... ...
2024-07-31 0.0 0.0 33803572.0 0.0 0.0 0.0
2024-08-01 0.0 0.0 33803572.0 0.0 0.0 0.0
2024-08-02 0.0 0.0 33803572.0 0.0 0.0 0.0
2024-08-03 0.0 0.0 33803572.0 0.0 0.0 0.0
2024-08-04 0.0 0.0 33803572.0 0.0 0.0 0.0

1674 rows × 6 columns

reset_index()#

https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.reset_index.html

在經過pivot後,列方向會變成以date為index,此時我希望將data恢復為欄方向的變數,就需要用reset_index()

df_wide.reset_index(inplace=True)
df_wide
date new_cases total_cases total_vaccinations_per_hundred
location Japan Taiwan Japan Taiwan Japan Taiwan
0 2020-01-05 0.0 0.0 0.0 0.0 0.0 0.0
1 2020-01-06 0.0 0.0 0.0 0.0 0.0 0.0
2 2020-01-07 0.0 0.0 0.0 0.0 0.0 0.0
3 2020-01-08 0.0 0.0 0.0 0.0 0.0 0.0
4 2020-01-09 0.0 0.0 0.0 0.0 0.0 0.0
... ... ... ... ... ... ... ...
1669 2024-07-31 0.0 0.0 33803572.0 0.0 0.0 0.0
1670 2024-08-01 0.0 0.0 33803572.0 0.0 0.0 0.0
1671 2024-08-02 0.0 0.0 33803572.0 0.0 0.0 0.0
1672 2024-08-03 0.0 0.0 33803572.0 0.0 0.0 0.0
1673 2024-08-04 0.0 0.0 33803572.0 0.0 0.0 0.0

1674 rows × 7 columns

Visualized by matplotlib with pandas#

後面加上figsize參數可以調整長寬比。 pandas.DataFrame.plot的可用參數可見https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.plot.html。

df_wide.plot(x="date", y="new_cases", figsize=(10, 5))
<Axes: xlabel='date'>
../_images/5bf01f4cdc93850698acd44484258c83b20086a3752026ff7340af2928e1a067.png

More params#

例如對Y軸取log。 pandas.DataFrame.plot的可用參數可見https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.plot.html。

df_wide.plot(x="date", y="new_cases", figsize=(10, 5), logy=True)
<Axes: xlabel='date'>
../_images/66c52820e47847bfde3029b39e579aaa600aade098a99889d04f18efa0df5055.png

Visualized by seaborn#

seaborn可以將location作為群組變數,不同組的就繪製在不同的線。

以下先將locationdatenew_cases取出後,把NA值填0。

df1 = df_asia.loc[df_asia['location'].isin(["Taiwan", "Japan", "South Korea"])]
df1['date'] = pd.to_datetime(df1['date'], format="%Y-%m-%d")
df_sns = df1[["location", 'date', 'new_cases']].fillna(0)
df_sns
/var/folders/4m/shks9p8j0dnbv51nf7cyysfc0000gn/T/ipykernel_44836/2059310855.py:2: SettingWithCopyWarning: 
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df1['date'] = pd.to_datetime(df1['date'], format="%Y-%m-%d")
location date new_cases
188626 Japan 2020-01-05 0.0
188627 Japan 2020-01-06 0.0
188628 Japan 2020-01-07 0.0
188629 Japan 2020-01-08 0.0
188630 Japan 2020-01-09 0.0
... ... ... ...
375647 Taiwan 2023-09-20 0.0
375648 Taiwan 2023-09-21 0.0
375649 Taiwan 2023-09-22 0.0
375650 Taiwan 2023-09-23 0.0
375651 Taiwan 2023-09-24 0.0

4696 rows × 3 columns

Seaborn繪圖還是基於matplotlib套件,但他的lineplot()可以多給一個參數hue,並將location指定給該參數,這樣繪圖時便會依照不同的location進行繪圖。

import matplotlib.pyplot as plt
import seaborn as sns
fig, ax = plt.subplots(figsize=(11, 6))
sns.lineplot(data=df_sns, x='date', y='new_cases', hue='location', ax=ax)
/Users/jirlong/anaconda3/lib/python3.10/site-packages/seaborn/_oldcore.py:1119: FutureWarning: use_inf_as_na option is deprecated and will be removed in a future version. Convert inf values to NaN before operating instead.
  with pd.option_context('mode.use_inf_as_na', True):
/Users/jirlong/anaconda3/lib/python3.10/site-packages/seaborn/_oldcore.py:1119: FutureWarning: use_inf_as_na option is deprecated and will be removed in a future version. Convert inf values to NaN before operating instead.
  with pd.option_context('mode.use_inf_as_na', True):
/Users/jirlong/anaconda3/lib/python3.10/site-packages/seaborn/_oldcore.py:1075: FutureWarning: When grouping with a length-1 list-like, you will need to pass a length-1 tuple to get_group in a future version of pandas. Pass `(name,)` instead of `name` to silence this warning.
  data_subset = grouped_data.get_group(pd_key)
/Users/jirlong/anaconda3/lib/python3.10/site-packages/seaborn/_oldcore.py:1075: FutureWarning: When grouping with a length-1 list-like, you will need to pass a length-1 tuple to get_group in a future version of pandas. Pass `(name,)` instead of `name` to silence this warning.
  data_subset = grouped_data.get_group(pd_key)
/Users/jirlong/anaconda3/lib/python3.10/site-packages/seaborn/_oldcore.py:1075: FutureWarning: When grouping with a length-1 list-like, you will need to pass a length-1 tuple to get_group in a future version of pandas. Pass `(name,)` instead of `name` to silence this warning.
  data_subset = grouped_data.get_group(pd_key)
<Axes: xlabel='date', ylabel='new_cases'>
../_images/e0244c28674384ff8210f50a03b10bd3cd13a11ed972097e186f06be3501f0cd.png

Visualized by bokeh: plot_bokeh()#

Bokeh的功能則是可以提供可互動的視覺化。但他不吃Pandas的MultiIndex,所以要將Pandas的階層欄位扁平化。以下是其中一種做法。做完扁平化就可以使用bokeh的函數來進行繪圖。

from bokeh.plotting import figure, show
from bokeh.io import output_notebook
# !pip install pandas_bokeh
import pandas_bokeh
pandas_bokeh.output_notebook()
---------------------------------------------------------------------------
ModuleNotFoundError                       Traceback (most recent call last)
Cell In[22], line 2
      1 # !pip install pandas_bokeh
----> 2 import pandas_bokeh
      3 pandas_bokeh.output_notebook()

ModuleNotFoundError: No module named 'pandas_bokeh'
df_wide2 = df_wide.copy()
df_wide2.columns = df_wide.columns.map('_'.join)
df_wide2
date_ new_cases_Japan new_cases_Taiwan total_cases_Japan total_cases_Taiwan total_vaccinations_per_hundred_Japan total_vaccinations_per_hundred_Taiwan
0 2020-01-16 0.0 0.0 0.0 0.0 0.00 0.00
1 2020-01-17 0.0 0.0 0.0 0.0 0.00 0.00
2 2020-01-18 0.0 0.0 0.0 0.0 0.00 0.00
3 2020-01-19 0.0 0.0 0.0 0.0 0.00 0.00
4 2020-01-20 0.0 0.0 0.0 0.0 0.00 0.00
... ... ... ... ... ... ... ...
1021 2022-11-02 70396.0 33156.0 22460268.0 7780125.0 268.25 264.69
1022 2022-11-03 67473.0 29952.0 22527741.0 7810077.0 268.34 264.83
1023 2022-11-04 34064.0 27581.0 22561805.0 7837658.0 268.56 265.00
1024 2022-11-05 74170.0 25535.0 22635975.0 7863193.0 268.81 0.00
1025 2022-11-06 66397.0 24345.0 22702372.0 7887538.0 268.90 265.15

1026 rows × 7 columns

df_wide2.plot_bokeh(
    kind='line',
    x='date_',
    y=['new_cases_Japan', 'new_cases_Taiwan']
)
Figure(
id = '1003', …)

Bar chart: vaccinating rate#

df_asia.dtypes
iso_code                                    object
continent                                   object
location                                    object
date                                        object
total_cases                                float64
                                            ...   
population                                 float64
excess_mortality_cumulative_absolute       float64
excess_mortality_cumulative                float64
excess_mortality                           float64
excess_mortality_cumulative_per_million    float64
Length: 67, dtype: object
df_asia['date'] = pd.to_datetime(df_asia.date)
print(df_asia.date.dtype)
datetime64[ns]
/var/folders/0p/7xy1_dzx0_s5rnf06c0b316w0000gn/T/ipykernel_38668/1918172663.py:1: SettingWithCopyWarning: 
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df_asia['date'] = pd.to_datetime(df_asia.date)
max(df_asia.date)
Timestamp('2022-11-07 00:00:00')
import datetime
df_recent = df_asia.loc[df_asia['date'] == datetime.datetime(2021, 10, 28)]

by pure pandas#

# df_recent.columns
df_recent.plot.barh(x="location", y="total_vaccinations_per_hundred")
<AxesSubplot:ylabel='location'>
../_images/1e6dad14c34b0aaa88d850ed2eee982efa98c89f43740ca7c52855f408c7aa9d.png
df_recent.plot.barh(x="location", y="total_vaccinations_per_hundred", figsize=(10, 10))
<AxesSubplot:ylabel='location'>
../_images/73ea1e2012059fca5420e1085819c42db4754f0eb9a3bde4395d5d72dc095528.png
df_recent.sort_values('total_vaccinations_per_hundred', ascending=True).plot.barh(x="location", y="total_vaccinations_per_hundred", figsize=(10, 10))
<AxesSubplot:ylabel='location'>
../_images/82271f3a3d6995b9400ec052525b08467fb560217c7f632700d8fec62f3d4e2a.png
df_recent.fillna(0).sort_values('total_vaccinations_per_hundred', ascending=True).plot.barh(x="location", y="total_vaccinations_per_hundred", figsize=(10, 10))
<AxesSubplot:ylabel='location'>
../_images/e128d7ab226a75a30fbdc04e405df125033c3c407277c1d382f9670b71f61618.png

by plot_bokeh#

toplot = df_recent.fillna(0).sort_values('total_vaccinations_per_hundred', ascending=True)
toplot.plot_bokeh(kind="barh", x="location", y="total_vaccinations_per_hundred")
Figure(
id = '1282', …)

Bokeh Settings#

Displaying output in jupyter notebook#

from bokeh.io import output_notebook
output_notebook()
Loading BokehJS ...

Adjust figure size along with windows size#

plot_df = pd.DataFrame({"x":[1, 2, 3, 4, 5],
                        "y":[1, 2, 3, 4, 5],
                        "freq":[10, 20, 13, 40, 35],
                        "label":["10", "20", "13", "40", "35"]})
plot_df
x y freq label
0 1 1 10 10
1 2 2 20 20
2 3 3 13 13
3 4 4 40 40
4 5 5 35 35
p = figure(title = "TEST")
p.circle(plot_df["x"], plot_df["y"], fill_alpha=0.2, size=plot_df["freq"])
p.sizing_mode = 'scale_width'
show(p)

Color mapper#

Categorical color transforming Manually#

# from bokeh.palettes import Magma, Inferno, Plasma, Viridis, Cividis, d3

# cluster_label = list(Counter(df2plot.cluster).keys())
# color_mapper = CategoricalColorMapper(palette=d3['Category20'][len(cluster_label)], factors=cluster_label)
# p = figure(title = "doc clustering")
# p.sizing_mode = 'scale_width'
# p.circle(x = "x", y = "y", 
#          color={'field': 'cluster', 'transform': color_mapper},
#          source = df2plot, 
#          fill_alpha=0.5, size=5, line_color=None)
# show(p)

Continuous color transforming#

from bokeh.palettes import Magma, Inferno, Plasma, Viridis, Cividis, d3
from bokeh.models import LogColorMapper, LinearColorMapper, LabelSet, ColumnDataSource 


p = figure(title = "ColorMapper Tester")
color_mapper = LinearColorMapper(palette="Plasma256", 
                                 low = min(plot_df["freq"]), 
                                 high = max(plot_df["freq"]))

source = ColumnDataSource(plot_df)
p.circle("x", "y", fill_alpha = 0.5, 
         size = "freq", 
         line_color=None,
         source = source,
         fill_color = {'field': 'freq', 'transform': color_mapper}
        )

p.sizing_mode = 'scale_width'

show(p)