AS04 Twitter API and Pandas Visualization

AS04 Twitter API and Pandas Visualization#

import pandas as pd
import requests
import os
import json

bearer_token = "YOUR_BEARER_TOKEN"

search_url = "https://api.twitter.com/2/tweets/search/recent"

def bearer_oauth(r):
    r.headers["Authorization"] = f"Bearer {bearer_token}"
    r.headers["User-Agent"] = "v2RecentSearchPython"
    return r

def connect_to_endpoint(url, params):
    response = requests.get(url, auth=bearer_oauth, params=params)
    print(response.status_code)
    if response.status_code != 200:
        raise Exception(response.status_code, response.text)
    return response.json()

/Users/jirlong/opt/anaconda3/lib/python3.9/site-packages/pandas/core/computation/expressions.py:21: UserWarning: Pandas requires version '2.8.4' or newer of 'numexpr' (version '2.8.1' currently installed).
  from pandas.core.computation.check import NUMEXPR_INSTALLED
/Users/jirlong/opt/anaconda3/lib/python3.9/site-packages/pandas/core/arrays/masked.py:60: UserWarning: Pandas requires version '1.3.6' or newer of 'bottleneck' (version '1.3.4' currently installed).
  from pandas.core import (

Cleaning twitter data#

以下程式碼為爬取tweets資料的主要片段，請稍微觀察一下query_params怎麼下，和資料怎麼存起來，會有助於答題（Q1~Q3）

query_params = {'query': '(taiwan -is:retweet) OR #taiwan',
                'tweet.fields': 'author_id,conversation_id,created_at,id,in_reply_to_user_id,lang,public_metrics,source',
                'expansions':'author_id',
                'user.fields': 'created_at,description,id,location,name,public_metrics,username,verified,withheld'
               }

json_response = connect_to_endpoint(search_url, query_params)
tweets_all = json_response['data']
users_all = json_response['includes']['users']

for i in range(10):
    query_params['next_token'] =  json_response["meta"]["next_token"]
    json_response = connect_to_endpoint(search_url, query_params)
    tweets_all.extend(json_response['data'])
    users_all.extend(json_response['includes']['users'])
    print(len(tweets_all), len(users_all))
    

以下資料是某一次用上列程式碼撈取資料的結果，users和tweets的資料被分開儲存，但共同儲存在檔案sample_tweets.json中。以下列程式碼讀取後，會發現users_all有109筆資料，但實際上是由103個不同的user所構成。

!wget https://raw.githubusercontent.com/p4css/py4css/main/data/sample_tweets.json -O sample_tweets.json
data = json.load(open('sample_tweets.json', 'r'))
tweets_all = data['tweets']
users_all = data['users']

print("資料中共有", len(tweets_all), "則tweets")
print("資料中共有", len(set([t['id'] for t in tweets_all])), "則不同的tweets")
print("資料中共有", len(users_all), "個users")
print("資料中共有", len(set([t['id'] for t in users_all])), "個不同的users")

--2025-09-21 21:58:54--  https://raw.githubusercontent.com/p4css/py4css/main/data/sample_tweets.json
Resolving raw.githubusercontent.com (raw.githubusercontent.com)... 185.199.111.133, 185.199.108.133, 185.199.109.133, ...
Connecting to raw.githubusercontent.com (raw.githubusercontent.com)|185.199.111.133|:443... 

connected.

HTTP request sent, awaiting response...

200 OK
Length: 103598 (101K) [text/plain]
Saving to: 'sample_tweets.json'


sample_tweets.json    0%[                    ]       0  --.-KB/s               

sample_tweets.json    3%[                    ]   4.04K  5.98KB/s

sample_tweets.json   39%[======>             ]  40.00K  32.0KB/s

sample_tweets.json   71%[=============>      ]  72.00K  39.1KB/s

sample_tweets.json   94%[=================>  ]  96.00K  34.1KB/s

sample_tweets.json   94%[=================>  ]  96.00K  25.5KB/s    eta 0s

sample_tweets.json  100%[===================>] 101.17K  22.0KB/s    eta 0s     
sample_tweets.json  100%[===================>] 101.17K  22.0KB/s    in 4.6s    

2025-09-21 21:59:02 (22.0 KB/s) - 'sample_tweets.json' saved [103598/103598]

資料中共有 109 則tweets
資料中共有 109 則不同的tweets
資料中共有 109 個users
資料中共有 103 個不同的users

Q1. Keep unique user#

前面for-loop在爬取tweets資料時，有可能在不同圈的loop會爬取到相同的users，亦即這些tweets中有些tweets是同一個人發表的。因此造成上面雖然有109個users，但實際上只有103個不同的users。

任務：請將users_all中相同的使用者資料刪除，保持users_all資料的唯一性（uniqueness）

# YOUR CODE SHOULD BE HERE

Q1. Verification#

執行以下驗證程式碼，驗證結果應該要是

資料中共有 109 則tweets
資料中共有 109 則不同的tweets
資料中共有 103 個users
資料中共有 103 個不同的users

print("資料中共有", len(tweets_all), "則tweets")
print("資料中共有", len(set([t['id'] for t in tweets_all])), "則不同的tweets")
print("資料中共有", len(users_all), "個users")
print("資料中共有", len(set([t['id'] for t in users_all])), "個不同的users")

資料中共有 109 則tweets
資料中共有 109 則不同的tweets
資料中共有 109 個users
資料中共有 103 個不同的users

Q2. flatten multi-layered data#

若單獨列印一筆users_all的資料出來觀察，或用pandas.DataFrame來觀察users_all，可發現public_metrics這個欄位是個階層欄位，如果希望把followers_count、following_count、tweet_count、listed_count分別獨立出來變成一個變項，使得用pandas.DataFrame的時候，這四個變項分別為該DataFrame的columns，要怎麼做？

import pandas as pd
print(json.dumps(users_all[0], indent=4))
pd.DataFrame(users_all).head()

{
    "description": "",
    "verified": false,
    "username": "darleneclarke5",
    "id": "564257648",
    "created_at": "2012-04-27T03:27:25.000Z",
    "public_metrics": {
        "followers_count": 701,
        "following_count": 1205,
        "tweet_count": 214782,
        "listed_count": 1
    },
    "name": "Darlene Clarke"
}

	description	verified	username	id	created_at	public_metrics	name	location
0		False	darleneclarke5	564257648	2012-04-27T03:27:25.000Z	{'followers_count': 701, 'following_count': 12...	Darlene Clarke	NaN
1	158/38D有腰身的棉花糖～我願做林場的管理者也不願吊死在樹上，有趣的靈魂可遇不可求，遇...	False	tokoyuki0527	1196170120940310529	2019-11-17T20:56:08.000Z	{'followers_count': 7110, 'following_count': 2...	阿希	Taichung City, Taiwan
2	しんこうちゅう勉強二ホンゴ～ワだシは台湾にです，Twitter 執筆オオユソ星占い。\n\n...	False	NicolasCheng6	1292734971975618560	2020-08-10T08:10:23.000Z	{'followers_count': 0, 'following_count': 0, '...	Nicolas Cheng	台灣
3	Freedom of Thought, of Expresson and from Viol...	False	Spectrumofreas1	1306881559463636994	2020-09-18T09:03:51.000Z	{'followers_count': 471, 'following_count': 35...	Spectre of Reason	Lithosphere
4	Sociólogo, humanista, sobresalido, acelerado, ...	False	ACruzCoutino	839588194433773568	2017-03-08T21:26:39.000Z	{'followers_count': 1370, 'following_count': 1...	Antonio Cruz Coutiño. Dr.	Tuxtla Gutiérrez, Chiapas, Méx

# YOUR CODE SHOULD BE HERE

Q2. Verification#

用下列程式碼列印出結果應該如下，沒有public_metrics這個column，但多出了四個裡面的columns：followers_count, following_count, tweet_count, listed_count。

Following output should be …

Index(['description', 'verified', 'username', 'id', 'created_at', 'name',
       'followers_count', 'following_count', 'tweet_count', 'listed_count',
       'location'],
      dtype='object')

pd.DataFrame(users_all).columns

Index(['description', 'verified', 'username', 'id', 'created_at',
       'public_metrics', 'name', 'location'],
      dtype='object')

Repeats on tweets data#

print(json.dumps(tweets_all[0], indent=4))
pd.DataFrame(tweets_all).head()

{
    "lang": "en",
    "created_at": "2022-10-17T16:14:34.000Z",
    "conversation_id": "1582042394651873280",
    "edit_history_tweet_ids": [
        "1582042394651873280"
    ],
    "text": "RT @GordonGChang: #America\u2019s military analysts think we will have lots of warning of a #Chinese attack on #Taiwan. In fact, we won\u2019t. See:\u2026",
    "id": "1582042394651873280",
    "source": "Twitter Web App",
    "author_id": "564257648",
    "public_metrics": {
        "retweet_count": 21,
        "reply_count": 0,
        "like_count": 0,
        "quote_count": 0
    }
}

	lang	created_at	conversation_id	edit_history_tweet_ids	text	id	source	author_id	public_metrics	in_reply_to_user_id
0	en	2022-10-17T16:14:34.000Z	1582042394651873280	[1582042394651873280]	RT @GordonGChang: #America’s military analysts...	1582042394651873280	Twitter Web App	564257648	{'retweet_count': 21, 'reply_count': 0, 'like_...	NaN
1	zh	2022-10-17T16:14:29.000Z	1582000361283284998	[1582042375706116097]	@taiwan_davidwu 我來了 https://t.co/IbJEPtbJ2t	1582042375706116097	Twitter for iPhone	1196170120940310529	{'retweet_count': 0, 'reply_count': 0, 'like_c...	1530566857123434496
2	zh	2022-10-17T16:14:14.000Z	1582042309721751553	[1582042309721751553]	Taiwan Constellation \n2022／10／18的星相顯示：\n低頻率發生...	1582042309721751553	Twitter for Android	1292734971975618560	{'retweet_count': 0, 'reply_count': 0, 'like_c...	NaN
3	en	2022-10-17T16:14:04.000Z	1581986552246603776	[1582042269963517952]	@elonmusk @TsicsafPelosi @starrski71 @RenataKo...	1582042269963517952	Twitter for Android	1306881559463636994	{'retweet_count': 1, 'reply_count': 0, 'like_c...	44196397
4	es	2022-10-17T16:14:00.000Z	1582042182168346627	[1582042252716933120]	Subrayó que las acciones de Pekín jamás irán d...	1582042252716933120	Twitter Web App	839588194433773568	{'retweet_count': 0, 'reply_count': 0, 'like_c...	839588194433773568

# YOUR CODE SHOULD BE HERE

Q2. Verification#

Following output should be …

Index(['lang', 'created_at', 'conversation_id', 'text', 'id', 'source',
       'author_id', 'retweet_count', 'reply_count', 'like_count',
       'quote_count', 'in_reply_to_user_id'],
      dtype='object')

pd.DataFrame(tweets_all).columns

Index(['lang', 'created_at', 'conversation_id', 'edit_history_tweet_ids',
       'text', 'id', 'source', 'author_id', 'public_metrics',
       'in_reply_to_user_id'],
      dtype='object')

Q3. Join tweets and users data#

下一個任務是把tweets和users兩個資料整合起來，使得不同的users資料可以根據tweet中的author_id加入tweets資料中，整合成一份大的資料。要注意的是，此時，每一則tweets都是唯一的，但不同tweets的作者卻可能相同。要整合這樣的資料有兩種做法，

傳統做法是，把每一則tweet掃過一遍，看看他的author_id是誰，就從users資料中抽取出該id的資料，加入該則tweet中。
但我可以把兩個份資料通通轉為DataFrame，然後運用資料庫合併資料表的概念來合併資料。

無論用哪一種方法，將資料整合後，最終的結果必須是一個變數名稱為merged_df的DataFrame。 HINT: 其實本題是希望你嘗試用第二種方法，參考資料如下，這個動作在資料庫稱為LEFT_JOIN。 https://towardsdatascience.com/left-join-with-pandas-data-frames-in-python-c29c85089ba4

tweets_df = pd.DataFrame(tweets_all)
users_df = pd.DataFrame(users_all)

# YOUR CODE SHOULD BE HERE
# merged_df = 

Q3. Verification#

以下程式碼的輸出結果應該是

Index(['lang', 'created_at', 'conversation_id', 'text', 'id', 'source',
       'author_id', 'retweet_count', 'reply_count', 'like_count',
       'quote_count', 'in_reply_to_user_id', 'description', 'verified',
       'username', 'account_created_at', 'name', 'followers_count',
       'following_count', 'tweet_count', 'listed_count', 'location'],
      dtype='object')

merged_df.columns

---------------------------------------------------------------------------
NameError                                 Traceback (most recent call last)
Input In [12], in <cell line: 1>()
----> 1 merged_df.columns

NameError: name 'merged_df' is not defined

Explore twitter information operation data#

以下主要作業是練習篩選資料（Filter）、簡單地視覺化資料。

閱讀READr關於資訊操作的相關報導：

READr (2019a). Twitter 大戰中國網軍！？解密被刪帳號資料集 - READr 讀+. READr. https://www.readr.tw/post/2013
READr (2019b). 【Twitter 大戰中國網軍】剖析網軍互動！長期低度使用、蹭熱點、小夥伴支援 - READr 讀+. READr. https://www.readr.tw/post/2028
READr (2019c). 【Twitter 大戰中國網軍】Twitter 如何辨認政治網軍？ - READr 讀+. READr. https://www.readr.tw/post/2029

以下匯入的資料集，tweets包含的是Twitter Transparency Center在2019-08所公布的問題帳號的推特文，摘出2019-03-15之後的推特文。而users則包含該次公布的第一個問題帳號集（該次公布了兩個資料集，詳細說明可見這篇報導https://www.readr.tw/post/2029 ）

import pandas as pd
tweets = pd.read_csv("https://raw.githubusercontent.com/p4css/py4css/main/data/tweets20190315.csv")
users = pd.read_csv("https://raw.githubusercontent.com/p4css/py4css/main/data/twitter_user1_hashed.csv")

pat = "港警|逃犯條例|反修例|遊行|修例|反送中|anti-extradition|hongkong|hkpolicebrutality|soshk|hongkongprotesters|HongKongPolice|hkpoliceforce|freedomHK|antiELAB|HongKongProtests|antiextraditionlaw|HongKongProtest|七一|游行|民阵|HongKong|逃犯条例|民陣|撐警|香港眾志|HongKongProterst|林鄭|警队|力撑|HK|香港|港"
print(tweets.shape)
tweets.info()

(15579, 31)
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 15579 entries, 0 to 15578
Data columns (total 31 columns):
 #   Column                    Non-Null Count  Dtype  
---  ------                    --------------  -----  
 tweetid                   15579 non-null  float64
 userid                    15579 non-null  object 
 user_display_name         15579 non-null  object 
 user_screen_name          15579 non-null  object 
 user_reported_location    12406 non-null  object 
 user_profile_description  12585 non-null  object 
 user_profile_url          658 non-null    object 
 follower_count            15579 non-null  int64  
 following_count           15579 non-null  int64  
 account_creation_date     15579 non-null  object 
account_language          15579 non-null  object 
tweet_language            15579 non-null  object 
tweet_text                15579 non-null  object 
tweet_time                15579 non-null  object 
tweet_client_name         15579 non-null  object 
in_reply_to_userid        1096 non-null   object 
in_reply_to_tweetid       1071 non-null   float64
quoted_tweet_tweetid      1138 non-null   float64
is_retweet                15579 non-null  bool   
retweet_userid            5184 non-null   object 
retweet_tweetid           5184 non-null   float64
latitude                  15579 non-null  object 
longitude                 15579 non-null  object 
quote_count               15536 non-null  float64
reply_count               15536 non-null  float64
like_count                15536 non-null  float64
retweet_count             15536 non-null  float64
hashtags                  15579 non-null  object 
urls                      15579 non-null  object 
user_mentions             15579 non-null  object 
poll_choices              0 non-null      float64
dtypes: bool(1), float64(9), int64(2), object(19)
memory usage: 3.6+ MB

Q6 Account languages of questionable accounts#

仿照https://www.readr.tw/post/2013 之「被刪除的帳號註冊語言」一圖，繪製長條圖以視覺化hk_users的帳號註冊語言。

# YOUR CODE SHOULD BE HERE

Q7 Activities of questionable accounts#

仿照https://www.readr.tw/post/2013 之「反送中帳號過去推文時間分佈」一圖，資料繪製折線圖以呈現這些帳號的發文活動。

# YOUR CODE SHOULD BE HERE

AS04 Twitter API and Pandas Visualization

Contents

AS04 Twitter API and Pandas Visualization#

Cleaning twitter data#

Q1. Keep unique user#

Q1. Verification#

Q2. flatten multi-layered data#

Q2. Verification#

Repeats on tweets data#

Q2. Verification#

Q3. Join tweets and users data#

Q3. Verification#

Explore twitter information operation data#

Q4. Filter HK-related tweets#

Q5. Filter users data in HL-related tweets#

Q6 Account languages of questionable accounts#

Q7 Activities of questionable accounts#

Q8 Social reputation of questionable accounts#