P04 Practice03 Twitter Users#

Read twitter users data#

# Colab or Jupyterlab
!wget https://raw.githubusercontent.com/p4css/py4css/main/data/twitter_user1_hashed.csv -O twitter_user1_hashed.csv
fpath = "twitter_user1_hashed.csv"
import pandas as pd
df = pd.read_csv(fpath, on_bad_lines='skip') 
users = df.to_dict('records')
type(users)
--2022-09-27 18:20:11--  https://raw.githubusercontent.com/p4css/py4css/main/data/twitter_user1_hashed.csv
Resolving raw.githubusercontent.com (raw.githubusercontent.com)... 185.199.109.133, 185.199.108.133, 185.199.110.133, ...
Connecting to raw.githubusercontent.com (raw.githubusercontent.com)|185.199.109.133|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 98747 (96K) [text/plain]
Saving to: 'twitter_user1_hashed.csv'

twitter_user1_hashe 100%[===================>]  96.43K  --.-KB/s    in 0.1s    

2022-09-27 18:20:12 (976 KB/s) - 'twitter_user1_hashed.csv' saved [98747/98747]
list

Download the file and put into your working directory https://raw.githubusercontent.com/p4css/py4css/main/data/twitter_user1_hashed.csv -o twitter_user1_hashed.csv

# Jupyterlab
fpath = "../_build/html/data/twitter_user1_hashed.csv"
import pandas as pd
df = pd.read_csv(fpath, on_bad_lines='skip') 
users = df.to_dict('records')
type(users)
list
print(len(users))
print(type(users[0]))
print(users[0].keys())
users[0]
744
<class 'dict'>
dict_keys(['userid', 'user_display_name', 'user_screen_name', 'user_reported_location', 'user_profile_description', 'user_profile_url', 'follower_count', 'following_count', 'account_creation_date', 'account_language'])
{'userid': 'vMm2zemFOF7kmXoDyX24Bo+TorqhNutpZlATYyxsE=',
 'user_display_name': 'vMm2zemFOF7kmXoDyX24Bo+TorqhNutpZlATYyxsE=',
 'user_screen_name': 'vMm2zemFOF7kmXoDyX24Bo+TorqhNutpZlATYyxsE=',
 'user_reported_location': nan,
 'user_profile_description': nan,
 'user_profile_url': nan,
 'follower_count': 1,
 'following_count': 52,
 'account_creation_date': '2017-08-30',
 'account_language': 'zh-cn'}

Counting#

Count by for-each#

from collections import Counter
lang_dict = Counter()
for user in users:
    lang_dict[user['account_language']] += 1

for k, v in lang_dict.most_common(10):
    print("{}\t{:<3} {}".format(k, v, k))
    print("%s\t%-3s %s"%(k, v, k))
    print("{}\t{:>3} {}".format(k, v, k))
    print("%s\t%3s %s"%(k, v, k))
    # print("{:3}\t{:-3d}\t{:3}".format(k, v, k))
zh-cn	569 zh-cn
zh-cn	569 zh-cn
zh-cn	569 zh-cn
zh-cn	569 zh-cn
en	104 en
en	104 en
en	104 en
en	104 en
ru	36  ru
ru	36  ru
ru	 36 ru
ru	 36 ru
zh-CN	13  zh-CN
zh-CN	13  zh-CN
zh-CN	 13 zh-CN
zh-CN	 13 zh-CN
zh-tw	10  zh-tw
zh-tw	10  zh-tw
zh-tw	 10 zh-tw
zh-tw	 10 zh-tw
es	8   es
es	8   es
es	  8 es
es	  8 es
en-gb	3   en-gb
en-gb	3   en-gb
en-gb	  3 en-gb
en-gb	  3 en-gb
ja	1   ja
ja	1   ja
ja	  1 ja
ja	  1 ja

Sampling#

Traverse list by index and range()#

如果你要做sampling,只想均勻地取出一半的資料出來做分析,那要怎麼做?

lang_dict = Counter()
for i in range(len(users)):
    if i%2==0:
        lang_dict[users[i]['account_language']] += 1
for k, v in lang_dict.most_common(10):
    print("{}\t{}".format(k, v)) 
zh-cn	275
en	63
ru	18
zh-CN	6
es	5
zh-tw	4
en-gb	1

Traverse list by enumerate()#

如果你要做sampling,只想均勻地取出一半的資料出來做分析,那要怎麼做?

lang_dict = Counter()
for i, user in enumerate(users):
    if i%2==0:
        lang_dict[user['account_language']] += 1
for k, v in lang_dict.most_common(10):
    print("{}\t{}".format(k, v))
zh-cn	275
en	63
ru	18
zh-CN	6
es	5
zh-tw	4
en-gb	1

Traverse list by list slicing#

如果你要做sampling,只想均勻地取出一半的資料出來做分析,那要怎麼做?

lang_dict = Counter()
for user in users[::2]:
    lang_dict[user['account_language']] += 1
for k, v in lang_dict.most_common(10):
    print("{}\t{}".format(k, v))
zh-cn	275
en	63
ru	18
zh-CN	6
es	5
zh-tw	4
en-gb	1

Sorting a list of dict by dict value#

Trick 1#

sorted_users = sorted(users, key=lambda d: d['follower_count'], reverse=True) 
for user in sorted_users[:20]:
    print("{:>20}\t{}".format(user["user_screen_name"], user["follower_count"]))
     benjaminkudla39	170155
         nurlanyc5sr	130077
           motor0529	105754
       zhangbide9600	105215
           nessniven	100847
        Trina31Owens	93490
          _srk_Ciell	84772
     Rodrigu14132402	76750
       homerbros7780	76718
      Rodrigu_beauty	69841
     mH6OwMaYxK33tpT	69659
         tattazueva2	67522
        wangduoyu121	66675
     oDrjiwIOmq09XaZ	65063
       CarriexWalker	64519
     ISfVQ1b1vwAK9IP	64398
     belousovasofiy2	63602
       bobylevamaina	63428
     jiajiashijiajia	61815
     3xrVKytdmflyeox	58880

Trick 2#

from operator import itemgetter
sorted_users2 = sorted(users, key=itemgetter('follower_count'), reverse=True) 
for user in sorted_users2[:20]:
    print("{:>20}\t{}".format(user["user_screen_name"], user["follower_count"]))
     benjaminkudla39	170155
         nurlanyc5sr	130077
           motor0529	105754
       zhangbide9600	105215
           nessniven	100847
        Trina31Owens	93490
          _srk_Ciell	84772
     Rodrigu14132402	76750
       homerbros7780	76718
      Rodrigu_beauty	69841
     mH6OwMaYxK33tpT	69659
         tattazueva2	67522
        wangduoyu121	66675
     oDrjiwIOmq09XaZ	65063
       CarriexWalker	64519
     ISfVQ1b1vwAK9IP	64398
     belousovasofiy2	63602
       bobylevamaina	63428
     jiajiashijiajia	61815
     3xrVKytdmflyeox	58880