Bokeh & Seaborn(Vaccinating)#
Main Reference: https://pandas.pydata.org/pandas-docs/stable/user_guide/visualization.html
用來做範例的這個資料是COVID疫情期間的各國疫苗接種資料。資料包含不同國家在不同日期所上傳的資料。要注意的是,這份資料的空值相當的多,有看得出來是空值的資料(如某些項目沒有填寫),也有沒有填寫的天數。每個國家開始登記的日期、漏登的日期、後來不再追蹤的日期都不一定,因此對齊資料的日期、決定資料可回答問題的區間都非常辛苦。
Load vaccination data#
import pandas as pd
raw = pd.read_csv("https://raw.githubusercontent.com/owid/covid-19-data/master/public/data/owid-covid-data.csv")
raw
iso_code | continent | location | date | total_cases | new_cases | new_cases_smoothed | total_deaths | new_deaths | new_deaths_smoothed | ... | male_smokers | handwashing_facilities | hospital_beds_per_thousand | life_expectancy | human_development_index | population | excess_mortality_cumulative_absolute | excess_mortality_cumulative | excess_mortality | excess_mortality_cumulative_per_million | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | AFG | Asia | Afghanistan | 2020-01-05 | 0.0 | 0.0 | NaN | 0.0 | 0.0 | NaN | ... | NaN | 37.75 | 0.5 | 64.83 | 0.51 | 41128772 | NaN | NaN | NaN | NaN |
1 | AFG | Asia | Afghanistan | 2020-01-06 | 0.0 | 0.0 | NaN | 0.0 | 0.0 | NaN | ... | NaN | 37.75 | 0.5 | 64.83 | 0.51 | 41128772 | NaN | NaN | NaN | NaN |
2 | AFG | Asia | Afghanistan | 2020-01-07 | 0.0 | 0.0 | NaN | 0.0 | 0.0 | NaN | ... | NaN | 37.75 | 0.5 | 64.83 | 0.51 | 41128772 | NaN | NaN | NaN | NaN |
3 | AFG | Asia | Afghanistan | 2020-01-08 | 0.0 | 0.0 | NaN | 0.0 | 0.0 | NaN | ... | NaN | 37.75 | 0.5 | 64.83 | 0.51 | 41128772 | NaN | NaN | NaN | NaN |
4 | AFG | Asia | Afghanistan | 2020-01-09 | 0.0 | 0.0 | NaN | 0.0 | 0.0 | NaN | ... | NaN | 37.75 | 0.5 | 64.83 | 0.51 | 41128772 | NaN | NaN | NaN | NaN |
... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
429430 | ZWE | Africa | Zimbabwe | 2024-07-31 | 266386.0 | 0.0 | 0.0 | 5740.0 | 0.0 | 0.0 | ... | 30.7 | 36.79 | 1.7 | 61.49 | 0.57 | 16320539 | NaN | NaN | NaN | NaN |
429431 | ZWE | Africa | Zimbabwe | 2024-08-01 | 266386.0 | 0.0 | 0.0 | 5740.0 | 0.0 | 0.0 | ... | 30.7 | 36.79 | 1.7 | 61.49 | 0.57 | 16320539 | NaN | NaN | NaN | NaN |
429432 | ZWE | Africa | Zimbabwe | 2024-08-02 | 266386.0 | 0.0 | 0.0 | 5740.0 | 0.0 | 0.0 | ... | 30.7 | 36.79 | 1.7 | 61.49 | 0.57 | 16320539 | NaN | NaN | NaN | NaN |
429433 | ZWE | Africa | Zimbabwe | 2024-08-03 | 266386.0 | 0.0 | 0.0 | 5740.0 | 0.0 | 0.0 | ... | 30.7 | 36.79 | 1.7 | 61.49 | 0.57 | 16320539 | NaN | NaN | NaN | NaN |
429434 | ZWE | Africa | Zimbabwe | 2024-08-04 | 266386.0 | 0.0 | 0.0 | 5740.0 | 0.0 | 0.0 | ... | 30.7 | 36.79 | 1.7 | 61.49 | 0.57 | 16320539 | NaN | NaN | NaN | NaN |
429435 rows × 67 columns
Observing data#
raw.columns
Index(['iso_code', 'continent', 'location', 'date', 'total_cases', 'new_cases',
'new_cases_smoothed', 'total_deaths', 'new_deaths',
'new_deaths_smoothed', 'total_cases_per_million',
'new_cases_per_million', 'new_cases_smoothed_per_million',
'total_deaths_per_million', 'new_deaths_per_million',
'new_deaths_smoothed_per_million', 'reproduction_rate', 'icu_patients',
'icu_patients_per_million', 'hosp_patients',
'hosp_patients_per_million', 'weekly_icu_admissions',
'weekly_icu_admissions_per_million', 'weekly_hosp_admissions',
'weekly_hosp_admissions_per_million', 'total_tests', 'new_tests',
'total_tests_per_thousand', 'new_tests_per_thousand',
'new_tests_smoothed', 'new_tests_smoothed_per_thousand',
'positive_rate', 'tests_per_case', 'tests_units', 'total_vaccinations',
'people_vaccinated', 'people_fully_vaccinated', 'total_boosters',
'new_vaccinations', 'new_vaccinations_smoothed',
'total_vaccinations_per_hundred', 'people_vaccinated_per_hundred',
'people_fully_vaccinated_per_hundred', 'total_boosters_per_hundred',
'new_vaccinations_smoothed_per_million',
'new_people_vaccinated_smoothed',
'new_people_vaccinated_smoothed_per_hundred', 'stringency_index',
'population_density', 'median_age', 'aged_65_older', 'aged_70_older',
'gdp_per_capita', 'extreme_poverty', 'cardiovasc_death_rate',
'diabetes_prevalence', 'female_smokers', 'male_smokers',
'handwashing_facilities', 'hospital_beds_per_thousand',
'life_expectancy', 'human_development_index', 'population',
'excess_mortality_cumulative_absolute', 'excess_mortality_cumulative',
'excess_mortality', 'excess_mortality_cumulative_per_million'],
dtype='object')
計算每個洲(continent)有多少資料。每個洲會高達數萬筆資料,原因是因為每一列是一個國家一天的資料。
print(set(raw.continent))
raw.continent.value_counts()
{'Africa', 'Europe', 'Asia', 'South America', 'Oceania', nan, 'North America'}
continent
Africa 95419
Europe 91031
Asia 84199
North America 68638
Oceania 40183
South America 23440
Name: count, dtype: int64
Filtering data#
Since the purpose is to understand the similarities and differences between Taiwan’s and other countries, the following only deals with Asian data, including South Korea, Japan and other countries that deal with the epidemic situation similar to my country’s.
df_asia = raw.loc[raw['continent']=="Asia"]
set(df_asia.location)
{'Afghanistan',
'Armenia',
'Azerbaijan',
'Bahrain',
'Bangladesh',
'Bhutan',
'Brunei',
'Cambodia',
'China',
'East Timor',
'Georgia',
'Hong Kong',
'India',
'Indonesia',
'Iran',
'Iraq',
'Israel',
'Japan',
'Jordan',
'Kazakhstan',
'Kuwait',
'Kyrgyzstan',
'Laos',
'Lebanon',
'Macao',
'Malaysia',
'Maldives',
'Mongolia',
'Myanmar',
'Nepal',
'North Korea',
'Northern Cyprus',
'Oman',
'Pakistan',
'Palestine',
'Philippines',
'Qatar',
'Saudi Arabia',
'Singapore',
'South Korea',
'Sri Lanka',
'Syria',
'Taiwan',
'Tajikistan',
'Thailand',
'Turkey',
'Turkmenistan',
'United Arab Emirates',
'Uzbekistan',
'Vietnam',
'Yemen'}
# Using .loc() to filter location == Taiwan
# df_tw = df_asia.loc[df_asia['location'] == "Taiwan"]
# Using pandas.Dataframe.query() function
df_tw = df_asia.query('location == "Taiwan"')
df_tw
iso_code | continent | location | date | total_cases | new_cases | new_cases_smoothed | total_deaths | new_deaths | new_deaths_smoothed | ... | male_smokers | handwashing_facilities | hospital_beds_per_thousand | life_expectancy | human_development_index | population | excess_mortality_cumulative_absolute | excess_mortality_cumulative | excess_mortality | excess_mortality_cumulative_per_million | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
374304 | TWN | Asia | Taiwan | 2020-01-16 | NaN | NaN | NaN | NaN | NaN | NaN | ... | NaN | NaN | NaN | 80.46 | NaN | 23893396 | NaN | NaN | NaN | NaN |
374305 | TWN | Asia | Taiwan | 2020-01-17 | NaN | NaN | NaN | NaN | NaN | NaN | ... | NaN | NaN | NaN | 80.46 | NaN | 23893396 | NaN | NaN | NaN | NaN |
374306 | TWN | Asia | Taiwan | 2020-01-18 | NaN | NaN | NaN | NaN | NaN | NaN | ... | NaN | NaN | NaN | 80.46 | NaN | 23893396 | NaN | NaN | NaN | NaN |
374307 | TWN | Asia | Taiwan | 2020-01-19 | NaN | NaN | NaN | NaN | NaN | NaN | ... | NaN | NaN | NaN | 80.46 | NaN | 23893396 | NaN | NaN | NaN | NaN |
374308 | TWN | Asia | Taiwan | 2020-01-20 | NaN | NaN | NaN | NaN | NaN | NaN | ... | NaN | NaN | NaN | 80.46 | NaN | 23893396 | NaN | NaN | NaN | NaN |
... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
375647 | TWN | Asia | Taiwan | 2023-09-20 | NaN | NaN | NaN | NaN | NaN | NaN | ... | NaN | NaN | NaN | 80.46 | NaN | 23893396 | NaN | NaN | NaN | NaN |
375648 | TWN | Asia | Taiwan | 2023-09-21 | NaN | NaN | NaN | NaN | NaN | NaN | ... | NaN | NaN | NaN | 80.46 | NaN | 23893396 | NaN | NaN | NaN | NaN |
375649 | TWN | Asia | Taiwan | 2023-09-22 | NaN | NaN | NaN | NaN | NaN | NaN | ... | NaN | NaN | NaN | 80.46 | NaN | 23893396 | NaN | NaN | NaN | NaN |
375650 | TWN | Asia | Taiwan | 2023-09-23 | NaN | NaN | NaN | NaN | NaN | NaN | ... | NaN | NaN | NaN | 80.46 | NaN | 23893396 | NaN | NaN | NaN | NaN |
375651 | TWN | Asia | Taiwan | 2023-09-24 | NaN | NaN | NaN | NaN | NaN | NaN | ... | NaN | NaN | NaN | 80.46 | NaN | 23893396 | NaN | NaN | NaN | NaN |
1348 rows × 67 columns
df_tw.dtypes
iso_code object
continent object
location object
date object
total_cases float64
...
population int64
excess_mortality_cumulative_absolute float64
excess_mortality_cumulative float64
excess_mortality float64
excess_mortality_cumulative_per_million float64
Length: 67, dtype: object
Line plot of time series#
由於要以時間(日期)當成X軸來繪圖,所以要先偵測看看目前的日期(date
)變數型態為何(由於載下來的資料是CSV,八成是字串,偶而會是整數),所以會需要將日期的字串轉為Python的時間物件datetime
。
print(type(df_tw.date))
# <class 'pandas.core.series.Series'>
print(df_tw.date.dtype)
# object (str)
# Converting columns to datetime
df_tw['date'] = pd.to_datetime(df_tw['date'], format="%Y-%m-%d")
print(df_tw.date.dtype)
# datetime64[ns]
<class 'pandas.core.series.Series'>
object
datetime64[ns]
/var/folders/4m/shks9p8j0dnbv51nf7cyysfc0000gn/T/ipykernel_44836/1951838620.py:8: SettingWithCopyWarning:
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead
See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
df_tw['date'] = pd.to_datetime(df_tw['date'], format="%Y-%m-%d")
df_tw
iso_code | continent | location | date | total_cases | new_cases | new_cases_smoothed | total_deaths | new_deaths | new_deaths_smoothed | ... | male_smokers | handwashing_facilities | hospital_beds_per_thousand | life_expectancy | human_development_index | population | excess_mortality_cumulative_absolute | excess_mortality_cumulative | excess_mortality | excess_mortality_cumulative_per_million | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
374304 | TWN | Asia | Taiwan | 2020-01-16 | NaN | NaN | NaN | NaN | NaN | NaN | ... | NaN | NaN | NaN | 80.46 | NaN | 23893396 | NaN | NaN | NaN | NaN |
374305 | TWN | Asia | Taiwan | 2020-01-17 | NaN | NaN | NaN | NaN | NaN | NaN | ... | NaN | NaN | NaN | 80.46 | NaN | 23893396 | NaN | NaN | NaN | NaN |
374306 | TWN | Asia | Taiwan | 2020-01-18 | NaN | NaN | NaN | NaN | NaN | NaN | ... | NaN | NaN | NaN | 80.46 | NaN | 23893396 | NaN | NaN | NaN | NaN |
374307 | TWN | Asia | Taiwan | 2020-01-19 | NaN | NaN | NaN | NaN | NaN | NaN | ... | NaN | NaN | NaN | 80.46 | NaN | 23893396 | NaN | NaN | NaN | NaN |
374308 | TWN | Asia | Taiwan | 2020-01-20 | NaN | NaN | NaN | NaN | NaN | NaN | ... | NaN | NaN | NaN | 80.46 | NaN | 23893396 | NaN | NaN | NaN | NaN |
... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
375647 | TWN | Asia | Taiwan | 2023-09-20 | NaN | NaN | NaN | NaN | NaN | NaN | ... | NaN | NaN | NaN | 80.46 | NaN | 23893396 | NaN | NaN | NaN | NaN |
375648 | TWN | Asia | Taiwan | 2023-09-21 | NaN | NaN | NaN | NaN | NaN | NaN | ... | NaN | NaN | NaN | 80.46 | NaN | 23893396 | NaN | NaN | NaN | NaN |
375649 | TWN | Asia | Taiwan | 2023-09-22 | NaN | NaN | NaN | NaN | NaN | NaN | ... | NaN | NaN | NaN | 80.46 | NaN | 23893396 | NaN | NaN | NaN | NaN |
375650 | TWN | Asia | Taiwan | 2023-09-23 | NaN | NaN | NaN | NaN | NaN | NaN | ... | NaN | NaN | NaN | 80.46 | NaN | 23893396 | NaN | NaN | NaN | NaN |
375651 | TWN | Asia | Taiwan | 2023-09-24 | NaN | NaN | NaN | NaN | NaN | NaN | ... | NaN | NaN | NaN | 80.46 | NaN | 23893396 | NaN | NaN | NaN | NaN |
1348 rows × 67 columns
Plot 1 line by Pandas#
args
figsize=(10,5)
: The size infigsize=(5,3)
is given in inches per (width, height). See https://stackoverflow.com/questions/51174691/how-to-increase-image-size-of-pandas-dataframe-plot
df_tw.plot(x="date", y="new_cases", figsize=(10, 5))
<Axes: xlabel='date'>

Plot multiple lines#
要繪製單一變項(一個國家)的折線圖很容易,X軸為日期、Y軸為案例數。但要如何繪製多個國家、多條折線圖(每個國家一條線)?以下就以日本和台灣兩國的數據為例來進行繪製。
location
這個欄位紀錄了該列資料屬於日本或台灣。通常視覺化軟體會有兩種作法,一種做法是必須把日本和台灣在欄的方向展開(用df.pivot()
),變成兩個變項,日本和台灣各一個變項,Python最基本的繪圖函式庫matplotlib就必須這麼做。但如果用號稱是matplotlib的進階版seaborn,則可以指定location
這個變項作為群組資訊,簡單地說是用location
當成群組變數來繪製不同的線。
df1 = df_asia.loc[df_asia['location'].isin(["Taiwan", "Japan"])]
df1['date'] = pd.to_datetime(df1['date'], format="%Y-%m-%d")
set(df1.location)
/var/folders/4m/shks9p8j0dnbv51nf7cyysfc0000gn/T/ipykernel_44836/2904734101.py:2: SettingWithCopyWarning:
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead
See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
df1['date'] = pd.to_datetime(df1['date'], format="%Y-%m-%d")
{'Japan', 'Taiwan'}
df1[['location', 'date', 'new_cases']]
location | date | new_cases | |
---|---|---|---|
188626 | Japan | 2020-01-05 | 0.0 |
188627 | Japan | 2020-01-06 | 0.0 |
188628 | Japan | 2020-01-07 | 0.0 |
188629 | Japan | 2020-01-08 | 0.0 |
188630 | Japan | 2020-01-09 | 0.0 |
... | ... | ... | ... |
375647 | Taiwan | 2023-09-20 | NaN |
375648 | Taiwan | 2023-09-21 | NaN |
375649 | Taiwan | 2023-09-22 | NaN |
375650 | Taiwan | 2023-09-23 | NaN |
375651 | Taiwan | 2023-09-24 | NaN |
3022 rows × 3 columns
# df1 data contains more than 1 location
df1.plot(x="date", y="new_cases", figsize=(10, 5))
<Axes: xlabel='date'>

df_wide = df1.pivot(index="date", columns="location",
values=["new_cases", "total_cases", "total_vaccinations_per_hundred"])
df_wide
new_cases | total_cases | total_vaccinations_per_hundred | ||||
---|---|---|---|---|---|---|
location | Japan | Taiwan | Japan | Taiwan | Japan | Taiwan |
date | ||||||
2020-01-05 | 0.0 | NaN | 0.0 | NaN | NaN | NaN |
2020-01-06 | 0.0 | NaN | 0.0 | NaN | NaN | NaN |
2020-01-07 | 0.0 | NaN | 0.0 | NaN | NaN | NaN |
2020-01-08 | 0.0 | NaN | 0.0 | NaN | NaN | NaN |
2020-01-09 | 0.0 | NaN | 0.0 | NaN | NaN | NaN |
... | ... | ... | ... | ... | ... | ... |
2024-07-31 | 0.0 | NaN | 33803572.0 | NaN | NaN | NaN |
2024-08-01 | 0.0 | NaN | 33803572.0 | NaN | NaN | NaN |
2024-08-02 | 0.0 | NaN | 33803572.0 | NaN | NaN | NaN |
2024-08-03 | 0.0 | NaN | 33803572.0 | NaN | NaN | NaN |
2024-08-04 | 0.0 | NaN | 33803572.0 | NaN | NaN | NaN |
1674 rows × 6 columns
fillna()
#
df_wide.fillna(0, inplace=True)
df_wide.new_cases.Taiwan
df_wide
new_cases | total_cases | total_vaccinations_per_hundred | ||||
---|---|---|---|---|---|---|
location | Japan | Taiwan | Japan | Taiwan | Japan | Taiwan |
date | ||||||
2020-01-05 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 |
2020-01-06 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 |
2020-01-07 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 |
2020-01-08 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 |
2020-01-09 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 |
... | ... | ... | ... | ... | ... | ... |
2024-07-31 | 0.0 | 0.0 | 33803572.0 | 0.0 | 0.0 | 0.0 |
2024-08-01 | 0.0 | 0.0 | 33803572.0 | 0.0 | 0.0 | 0.0 |
2024-08-02 | 0.0 | 0.0 | 33803572.0 | 0.0 | 0.0 | 0.0 |
2024-08-03 | 0.0 | 0.0 | 33803572.0 | 0.0 | 0.0 | 0.0 |
2024-08-04 | 0.0 | 0.0 | 33803572.0 | 0.0 | 0.0 | 0.0 |
1674 rows × 6 columns
reset_index()
#
https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.reset_index.html
在經過pivot後,列方向會變成以date
為index,此時我希望將data
恢復為欄方向的變數,就需要用reset_index()
。
df_wide.reset_index(inplace=True)
df_wide
date | new_cases | total_cases | total_vaccinations_per_hundred | ||||
---|---|---|---|---|---|---|---|
location | Japan | Taiwan | Japan | Taiwan | Japan | Taiwan | |
0 | 2020-01-05 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 |
1 | 2020-01-06 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 |
2 | 2020-01-07 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 |
3 | 2020-01-08 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 |
4 | 2020-01-09 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 |
... | ... | ... | ... | ... | ... | ... | ... |
1669 | 2024-07-31 | 0.0 | 0.0 | 33803572.0 | 0.0 | 0.0 | 0.0 |
1670 | 2024-08-01 | 0.0 | 0.0 | 33803572.0 | 0.0 | 0.0 | 0.0 |
1671 | 2024-08-02 | 0.0 | 0.0 | 33803572.0 | 0.0 | 0.0 | 0.0 |
1672 | 2024-08-03 | 0.0 | 0.0 | 33803572.0 | 0.0 | 0.0 | 0.0 |
1673 | 2024-08-04 | 0.0 | 0.0 | 33803572.0 | 0.0 | 0.0 | 0.0 |
1674 rows × 7 columns
Visualized by matplotlib with pandas#
後面加上figsize
參數可以調整長寬比。
pandas.DataFrame.plot
的可用參數可見https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.plot.html。
df_wide.plot(x="date", y="new_cases", figsize=(10, 5))
<Axes: xlabel='date'>

More params#
例如對Y軸取log。
pandas.DataFrame.plot
的可用參數可見https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.plot.html。
df_wide.plot(x="date", y="new_cases", figsize=(10, 5), logy=True)
<Axes: xlabel='date'>

Visualized by seaborn#
seaborn可以將location
作為群組變數,不同組的就繪製在不同的線。
以下先將location
、date
、new_cases
取出後,把NA值填0。
df1 = df_asia.loc[df_asia['location'].isin(["Taiwan", "Japan", "South Korea"])]
df1['date'] = pd.to_datetime(df1['date'], format="%Y-%m-%d")
df_sns = df1[["location", 'date', 'new_cases']].fillna(0)
df_sns
/var/folders/4m/shks9p8j0dnbv51nf7cyysfc0000gn/T/ipykernel_44836/2059310855.py:2: SettingWithCopyWarning:
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead
See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
df1['date'] = pd.to_datetime(df1['date'], format="%Y-%m-%d")
location | date | new_cases | |
---|---|---|---|
188626 | Japan | 2020-01-05 | 0.0 |
188627 | Japan | 2020-01-06 | 0.0 |
188628 | Japan | 2020-01-07 | 0.0 |
188629 | Japan | 2020-01-08 | 0.0 |
188630 | Japan | 2020-01-09 | 0.0 |
... | ... | ... | ... |
375647 | Taiwan | 2023-09-20 | 0.0 |
375648 | Taiwan | 2023-09-21 | 0.0 |
375649 | Taiwan | 2023-09-22 | 0.0 |
375650 | Taiwan | 2023-09-23 | 0.0 |
375651 | Taiwan | 2023-09-24 | 0.0 |
4696 rows × 3 columns
Seaborn繪圖還是基於matplotlib套件,但他的lineplot()
可以多給一個參數hue
,並將location
指定給該參數,這樣繪圖時便會依照不同的location
進行繪圖。
import matplotlib.pyplot as plt
import seaborn as sns
fig, ax = plt.subplots(figsize=(11, 6))
sns.lineplot(data=df_sns, x='date', y='new_cases', hue='location', ax=ax)
/Users/jirlong/anaconda3/lib/python3.10/site-packages/seaborn/_oldcore.py:1119: FutureWarning: use_inf_as_na option is deprecated and will be removed in a future version. Convert inf values to NaN before operating instead.
with pd.option_context('mode.use_inf_as_na', True):
/Users/jirlong/anaconda3/lib/python3.10/site-packages/seaborn/_oldcore.py:1119: FutureWarning: use_inf_as_na option is deprecated and will be removed in a future version. Convert inf values to NaN before operating instead.
with pd.option_context('mode.use_inf_as_na', True):
/Users/jirlong/anaconda3/lib/python3.10/site-packages/seaborn/_oldcore.py:1075: FutureWarning: When grouping with a length-1 list-like, you will need to pass a length-1 tuple to get_group in a future version of pandas. Pass `(name,)` instead of `name` to silence this warning.
data_subset = grouped_data.get_group(pd_key)
/Users/jirlong/anaconda3/lib/python3.10/site-packages/seaborn/_oldcore.py:1075: FutureWarning: When grouping with a length-1 list-like, you will need to pass a length-1 tuple to get_group in a future version of pandas. Pass `(name,)` instead of `name` to silence this warning.
data_subset = grouped_data.get_group(pd_key)
/Users/jirlong/anaconda3/lib/python3.10/site-packages/seaborn/_oldcore.py:1075: FutureWarning: When grouping with a length-1 list-like, you will need to pass a length-1 tuple to get_group in a future version of pandas. Pass `(name,)` instead of `name` to silence this warning.
data_subset = grouped_data.get_group(pd_key)
<Axes: xlabel='date', ylabel='new_cases'>

Visualized by bokeh: plot_bokeh()
#
https://towardsdatascience.com/beautiful-and-easy-plotting-in-python-pandas-bokeh-afa92d792167
https://patrikhlobil.github.io/Pandas-Bokeh/ (Document of Pandas-Bokeh)
Bokeh的功能則是可以提供可互動的視覺化。但他不吃Pandas的MultiIndex,所以要將Pandas的階層欄位扁平化。以下是其中一種做法。做完扁平化就可以使用bokeh的函數來進行繪圖。
from bokeh.plotting import figure, show
from bokeh.io import output_notebook
# !pip install pandas_bokeh
import pandas_bokeh
pandas_bokeh.output_notebook()
---------------------------------------------------------------------------
ModuleNotFoundError Traceback (most recent call last)
Cell In[22], line 2
1 # !pip install pandas_bokeh
----> 2 import pandas_bokeh
3 pandas_bokeh.output_notebook()
ModuleNotFoundError: No module named 'pandas_bokeh'
df_wide2 = df_wide.copy()
df_wide2.columns = df_wide.columns.map('_'.join)
df_wide2
date_ | new_cases_Japan | new_cases_Taiwan | total_cases_Japan | total_cases_Taiwan | total_vaccinations_per_hundred_Japan | total_vaccinations_per_hundred_Taiwan | |
---|---|---|---|---|---|---|---|
0 | 2020-01-16 | 0.0 | 0.0 | 0.0 | 0.0 | 0.00 | 0.00 |
1 | 2020-01-17 | 0.0 | 0.0 | 0.0 | 0.0 | 0.00 | 0.00 |
2 | 2020-01-18 | 0.0 | 0.0 | 0.0 | 0.0 | 0.00 | 0.00 |
3 | 2020-01-19 | 0.0 | 0.0 | 0.0 | 0.0 | 0.00 | 0.00 |
4 | 2020-01-20 | 0.0 | 0.0 | 0.0 | 0.0 | 0.00 | 0.00 |
... | ... | ... | ... | ... | ... | ... | ... |
1021 | 2022-11-02 | 70396.0 | 33156.0 | 22460268.0 | 7780125.0 | 268.25 | 264.69 |
1022 | 2022-11-03 | 67473.0 | 29952.0 | 22527741.0 | 7810077.0 | 268.34 | 264.83 |
1023 | 2022-11-04 | 34064.0 | 27581.0 | 22561805.0 | 7837658.0 | 268.56 | 265.00 |
1024 | 2022-11-05 | 74170.0 | 25535.0 | 22635975.0 | 7863193.0 | 268.81 | 0.00 |
1025 | 2022-11-06 | 66397.0 | 24345.0 | 22702372.0 | 7887538.0 | 268.90 | 265.15 |
1026 rows × 7 columns
df_wide2.plot_bokeh(
kind='line',
x='date_',
y=['new_cases_Japan', 'new_cases_Taiwan']
)
Bar chart: vaccinating rate#
df_asia.dtypes
iso_code object
continent object
location object
date object
total_cases float64
...
population float64
excess_mortality_cumulative_absolute float64
excess_mortality_cumulative float64
excess_mortality float64
excess_mortality_cumulative_per_million float64
Length: 67, dtype: object
df_asia['date'] = pd.to_datetime(df_asia.date)
print(df_asia.date.dtype)
datetime64[ns]
/var/folders/0p/7xy1_dzx0_s5rnf06c0b316w0000gn/T/ipykernel_38668/1918172663.py:1: SettingWithCopyWarning:
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead
See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
df_asia['date'] = pd.to_datetime(df_asia.date)
max(df_asia.date)
Timestamp('2022-11-07 00:00:00')
import datetime
df_recent = df_asia.loc[df_asia['date'] == datetime.datetime(2021, 10, 28)]
by pure pandas#
# df_recent.columns
df_recent.plot.barh(x="location", y="total_vaccinations_per_hundred")
<AxesSubplot:ylabel='location'>

df_recent.plot.barh(x="location", y="total_vaccinations_per_hundred", figsize=(10, 10))
<AxesSubplot:ylabel='location'>

df_recent.sort_values('total_vaccinations_per_hundred', ascending=True).plot.barh(x="location", y="total_vaccinations_per_hundred", figsize=(10, 10))
<AxesSubplot:ylabel='location'>

df_recent.fillna(0).sort_values('total_vaccinations_per_hundred', ascending=True).plot.barh(x="location", y="total_vaccinations_per_hundred", figsize=(10, 10))
<AxesSubplot:ylabel='location'>

by plot_bokeh#
toplot = df_recent.fillna(0).sort_values('total_vaccinations_per_hundred', ascending=True)
toplot.plot_bokeh(kind="barh", x="location", y="total_vaccinations_per_hundred")
Bokeh Settings#
Displaying output in jupyter notebook#
Adjust figure size along with windows size#
plot_df = pd.DataFrame({"x":[1, 2, 3, 4, 5],
"y":[1, 2, 3, 4, 5],
"freq":[10, 20, 13, 40, 35],
"label":["10", "20", "13", "40", "35"]})
plot_df
x | y | freq | label | |
---|---|---|---|---|
0 | 1 | 1 | 10 | 10 |
1 | 2 | 2 | 20 | 20 |
2 | 3 | 3 | 13 | 13 |
3 | 4 | 4 | 40 | 40 |
4 | 5 | 5 | 35 | 35 |
p = figure(title = "TEST")
p.circle(plot_df["x"], plot_df["y"], fill_alpha=0.2, size=plot_df["freq"])
p.sizing_mode = 'scale_width'
show(p)
Color mapper#
Categorical color transforming Manually#
# from bokeh.palettes import Magma, Inferno, Plasma, Viridis, Cividis, d3
# cluster_label = list(Counter(df2plot.cluster).keys())
# color_mapper = CategoricalColorMapper(palette=d3['Category20'][len(cluster_label)], factors=cluster_label)
# p = figure(title = "doc clustering")
# p.sizing_mode = 'scale_width'
# p.circle(x = "x", y = "y",
# color={'field': 'cluster', 'transform': color_mapper},
# source = df2plot,
# fill_alpha=0.5, size=5, line_color=None)
# show(p)
Continuous color transforming#
from bokeh.palettes import Magma, Inferno, Plasma, Viridis, Cividis, d3
from bokeh.models import LogColorMapper, LinearColorMapper, LabelSet, ColumnDataSource
p = figure(title = "ColorMapper Tester")
color_mapper = LinearColorMapper(palette="Plasma256",
low = min(plot_df["freq"]),
high = max(plot_df["freq"]))
source = ColumnDataSource(plot_df)
p.circle("x", "y", fill_alpha = 0.5,
size = "freq",
line_color=None,
source = source,
fill_color = {'field': 'freq', 'transform': color_mapper}
)
p.sizing_mode = 'scale_width'
show(p)