P04 Practice03 Twitter Users#
Read twitter users data#
# Colab or Jupyterlab
!wget https://raw.githubusercontent.com/p4css/py4css/main/data/twitter_user1_hashed.csv -O twitter_user1_hashed.csv
fpath = "twitter_user1_hashed.csv"
import pandas as pd
df = pd.read_csv(fpath, on_bad_lines='skip')
users = df.to_dict('records')
type(users)
--2022-09-27 18:20:11-- https://raw.githubusercontent.com/p4css/py4css/main/data/twitter_user1_hashed.csv
Resolving raw.githubusercontent.com (raw.githubusercontent.com)... 185.199.109.133, 185.199.108.133, 185.199.110.133, ...
Connecting to raw.githubusercontent.com (raw.githubusercontent.com)|185.199.109.133|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 98747 (96K) [text/plain]
Saving to: 'twitter_user1_hashed.csv'
twitter_user1_hashe 100%[===================>] 96.43K --.-KB/s in 0.1s
2022-09-27 18:20:12 (976 KB/s) - 'twitter_user1_hashed.csv' saved [98747/98747]
list
Download the file and put into your working directory https://raw.githubusercontent.com/p4css/py4css/main/data/twitter_user1_hashed.csv -o twitter_user1_hashed.csv
# Jupyterlab
fpath = "../_build/html/data/twitter_user1_hashed.csv"
import pandas as pd
df = pd.read_csv(fpath, on_bad_lines='skip')
users = df.to_dict('records')
type(users)
list
print(len(users))
print(type(users[0]))
print(users[0].keys())
users[0]
744
<class 'dict'>
dict_keys(['userid', 'user_display_name', 'user_screen_name', 'user_reported_location', 'user_profile_description', 'user_profile_url', 'follower_count', 'following_count', 'account_creation_date', 'account_language'])
{'userid': 'vMm2zemFOF7kmXoDyX24Bo+TorqhNutpZlATYyxsE=',
'user_display_name': 'vMm2zemFOF7kmXoDyX24Bo+TorqhNutpZlATYyxsE=',
'user_screen_name': 'vMm2zemFOF7kmXoDyX24Bo+TorqhNutpZlATYyxsE=',
'user_reported_location': nan,
'user_profile_description': nan,
'user_profile_url': nan,
'follower_count': 1,
'following_count': 52,
'account_creation_date': '2017-08-30',
'account_language': 'zh-cn'}
Counting#
Count by for-each#
from collections import Counter
lang_dict = Counter()
for user in users:
lang_dict[user['account_language']] += 1
for k, v in lang_dict.most_common(10):
print("{}\t{:<3} {}".format(k, v, k))
print("%s\t%-3s %s"%(k, v, k))
print("{}\t{:>3} {}".format(k, v, k))
print("%s\t%3s %s"%(k, v, k))
# print("{:3}\t{:-3d}\t{:3}".format(k, v, k))
zh-cn 569 zh-cn
zh-cn 569 zh-cn
zh-cn 569 zh-cn
zh-cn 569 zh-cn
en 104 en
en 104 en
en 104 en
en 104 en
ru 36 ru
ru 36 ru
ru 36 ru
ru 36 ru
zh-CN 13 zh-CN
zh-CN 13 zh-CN
zh-CN 13 zh-CN
zh-CN 13 zh-CN
zh-tw 10 zh-tw
zh-tw 10 zh-tw
zh-tw 10 zh-tw
zh-tw 10 zh-tw
es 8 es
es 8 es
es 8 es
es 8 es
en-gb 3 en-gb
en-gb 3 en-gb
en-gb 3 en-gb
en-gb 3 en-gb
ja 1 ja
ja 1 ja
ja 1 ja
ja 1 ja
Sampling#
Traverse list by index and range()
#
如果你要做sampling,只想均勻地取出一半的資料出來做分析,那要怎麼做?
lang_dict = Counter()
for i in range(len(users)):
if i%2==0:
lang_dict[users[i]['account_language']] += 1
for k, v in lang_dict.most_common(10):
print("{}\t{}".format(k, v))
zh-cn 275
en 63
ru 18
zh-CN 6
es 5
zh-tw 4
en-gb 1
Traverse list by enumerate()
#
如果你要做sampling,只想均勻地取出一半的資料出來做分析,那要怎麼做?
lang_dict = Counter()
for i, user in enumerate(users):
if i%2==0:
lang_dict[user['account_language']] += 1
for k, v in lang_dict.most_common(10):
print("{}\t{}".format(k, v))
zh-cn 275
en 63
ru 18
zh-CN 6
es 5
zh-tw 4
en-gb 1
Traverse list by list slicing#
如果你要做sampling,只想均勻地取出一半的資料出來做分析,那要怎麼做?
lang_dict = Counter()
for user in users[::2]:
lang_dict[user['account_language']] += 1
for k, v in lang_dict.most_common(10):
print("{}\t{}".format(k, v))
zh-cn 275
en 63
ru 18
zh-CN 6
es 5
zh-tw 4
en-gb 1
Sorting a list of dict by dict value#
Trick 1#
sorted_users = sorted(users, key=lambda d: d['follower_count'], reverse=True)
for user in sorted_users[:20]:
print("{:>20}\t{}".format(user["user_screen_name"], user["follower_count"]))
benjaminkudla39 170155
nurlanyc5sr 130077
motor0529 105754
zhangbide9600 105215
nessniven 100847
Trina31Owens 93490
_srk_Ciell 84772
Rodrigu14132402 76750
homerbros7780 76718
Rodrigu_beauty 69841
mH6OwMaYxK33tpT 69659
tattazueva2 67522
wangduoyu121 66675
oDrjiwIOmq09XaZ 65063
CarriexWalker 64519
ISfVQ1b1vwAK9IP 64398
belousovasofiy2 63602
bobylevamaina 63428
jiajiashijiajia 61815
3xrVKytdmflyeox 58880
Trick 2#
from operator import itemgetter
sorted_users2 = sorted(users, key=itemgetter('follower_count'), reverse=True)
for user in sorted_users2[:20]:
print("{:>20}\t{}".format(user["user_screen_name"], user["follower_count"]))
benjaminkudla39 170155
nurlanyc5sr 130077
motor0529 105754
zhangbide9600 105215
nessniven 100847
Trina31Owens 93490
_srk_Ciell 84772
Rodrigu14132402 76750
homerbros7780 76718
Rodrigu_beauty 69841
mH6OwMaYxK33tpT 69659
tattazueva2 67522
wangduoyu121 66675
oDrjiwIOmq09XaZ 65063
CarriexWalker 64519
ISfVQ1b1vwAK9IP 64398
belousovasofiy2 63602
bobylevamaina 63428
jiajiashijiajia 61815
3xrVKytdmflyeox 58880