P01-Counting

P01-Counting#

如何用程式邏輯來計算數量？
計算數量完後，要如何呈現計數後的「資訊」？
1. 如何把計算的結果輸出來觀察？
2. 最多／最少的是哪一/幾個類別？
3. 如何將計數的結果視覺化？
程式語言有沒有更好的方法來計算數量？
其他類似計數的問題有哪些？例如，計算詞頻、計算分數區間等等。

計數#

程式設計是對解決問題邏輯的模仿，這種模仿使得電腦能夠協助您進行大規模的重複性工作。因此，解決問題的邏輯成為至關重要之處。例如，「計數」就是一種簡單而實用的演算法邏輯。計數的程式邏輯的目的在於對於某一集合中的每個元素進行數量計算。以下列舉了一些相關的範例：

假使我們身處於一個班級環境，其中學生眾多。您有意願計算每個不同科系的學生人數，以使得未來的專題分配能夠考慮到各科系的比例。
您被指派負責購買飲料，然而現有的飲料種類眾多，且每款飲料皆具有不同的糖分和冰塊選項。一般情況下，您需要逐一詢問每位人員的選擇，進而將這些選項逐一紀錄，同時也需對於重複選擇進行次數累計。
在統計學領域中，常見的情境為基於不同的區間（例如 [0, 10)、[10, 20) 等），進行學生成績分佈的計算。
在社會科學的範疇中，一個典型的任務涉及計算一篇文章中每個詞彙的出現次數（詞頻）。

以下範例將引導您從基本語法開始，使用Python中最基本的資料型態來執行「計數」操作。

計數任務的定義#

「想像老師遇到如下情境：有一個班級的人，約略六七十人，我不知道這些人來自哪些系所，我想對每個科系進行統計，我該如何運用程式來計數？」

運算思維的思考過程#

首先，我們擁有一系列不同科系的學生：jour(journalism), jour, econ(economic), jour, econ, poli(politics), econ, jour，等等。
想像一下，如果你要實際數這些學生，你會如何進行？你會從第一個開始，一個接著一個地進行每個科系人數的累計，一直到結束為止（有多少位學生，就需要數多少次）。
那麼該如何列計每一種科系學生的數量呢？我們可以使用一張紙，當看到一種之前未出現過的科系學生時，就新增一個「對應」。將該科系的名稱的首個字母與數量對應，一開始設置為0，然後對應的數量欄位遞增一。如果已經見過該科系學生，則直接找到對應的欄位，將數量遞增一即可。

iteration= 1, jour, jour: 1
iteration= 2, jour, jour: 2
iteration= 3, econ, jour: 2, econ: 1
iteration= 4, jour, jour: 3, econ: 1
iteration= 5, econ, jour: 3, econ: 2
iteration= 6, poli, jour: 3, econ: 2, poli: 1
iteration= 7, econ, jour: 3, econ: 3, poli: 1
iteration= 8, jour, jour: 4, econ: 3, poli: 1

以更接近程式邏輯的方式思考#

取一張紙用以建立對應表。在左邊一欄中寫下科系名稱，在右邊一欄中則記錄相對應的科系數量。
排列每個學生的科系，將它們依序排成一列，為逐一數量作好準備。
開始對排好的的每個科系進行數量統計：
- 若該種科系尚未出現在對應表中：
  - 新增一個對應表項目，將該科系名稱記錄下來，並將其數量設為1個。
- 若該種科系已經存在於對應表中：
  - 將對應表中該科系所對應的數字加1，表示出現的次數增加。

用英文來撰寫上述的邏輯思考#

程式語言的邏輯事實上和要解決問題的邏輯表述很像，連文字和關鍵字都很像。包含if-else（如果滿足條件就怎樣，不然就怎樣）等條件判斷結構，和for a in alist對於alist中的每個元素。

Build a look-up table to record each dep and number of the dep(calls it dictionary), naming as dep_count
Keep all deps in a list named dep_list
For each dep in dep_list:
- If the dep does not appear in dep_count
  - Create a mapping in dep_count to map the dep name to 1
- else
  - increase the mapped value of the dep name in dep_count

Thinking: 別的程式語言會怎麼做？

如果是Google Sheet或Excel要做到這件事的話，要先用unique()找出相異的data value，然後用countif()函式來計算，每個相異值個別出現過幾次。作法如Excel skills: How do I get the distinct/uniq。
如果你是用R來處理的話，那就會是count(vec)。

Convert to Python#

將上述用英文撰寫的程式邏輯改用Python語法來撰寫程式碼如下。其中包含了Dict(ionary)、List的資料結構和變數、以及for-loop反覆迴圈邏輯和if-else的條件式判斷。

# Create an empty Dict(ionry) to store the key to value pairs
# Assign an empty Dict(ionry) to a variable 'dep_count'
dep_count = {}

# Create a list to store all things to count
# Assign the List with value to a new variable 'dep_list'
dep_list = ['jour', 'jour', 'econ', 'jour', 'econ', 'poli', 'econ', 'jour'] # A list of dep

# For each dep in the list...
for dep in dep_list:
    if dep not in dep_count:
        dep_count[dep] = 1
    else:
        dep_count[dep] = dep_count[dep] + 1

print(dep_count)

{'jour': 4, 'econ': 3, 'poli': 1}

在上述程式碼中，有幾個重要的概念

Assignment#

Sure, here’s the modified content:

The symbol = is referred to as “assignment,” which assigns the data on the right side to the variable on the left side.
- dep_count = {} means assigning an empty dictionary to the variable on the left side, which is dep_count.
- dep_list = ['jour', 'jour', 'econ', 'jour', 'econ', 'poli', 'econ', 'jour'] is the process of assigning a list of department names (strings) to the variable on the left side, which is dep_list.

List資料型態（資料結構）#

List是一種有序（Ordered）結構，可以用整數作為索引（index）來存取，從0開始編號，第0個、第1個、第2個。
- Access(Read): 可以用小於List長度的整數作為索引來取出List中的某個值，例如一個比較簡短的例子alist = ['a', 'b', 'c']中，alist[0]的值為'a'，alist[2]的值為'c'。
- Access(Write): 也可以用覆寫List中某個元素值，例如alist[2]=3，那麼alist[2]的值就不再是'c'，而是3。

Dict資料型態#

Dict(tionary)是一種無序（Unordered）的鍵值（key-to-value）對應結構結構。如adict = {'a': 1, 'b': 2}。
Access(Read): 要知道Dict的某個key是mapping到哪一個value的話，亦是用[]來讀取，如上例adict['a']的值為1。
Access(Write)dep_count[dep] = 1是用來將值Assign給Dict某個key，承上例，如果adict['a'] = 3後再讀取一次adict['a']的值，便會是3。

Iteration: for#

這裡的for比較像是for-each的概念。for dep in dep_list:的意思是，依序將dep_list中的每一個值，assign給dep；簡而言之，dep這個變項會走訪過每個dep_list中的值。
for-loop中所要反覆執行的程式碼，必須向右多一個tab。

Condition: if, else#

if...else表示式的意義如英文字，如果滿足前述條件的話，就做什麼事，不然的話（else），就做什麼事。
if判斷式中，符合條件或不符合條件（else）所要執行的程式碼，亦必須向右多一個tab。
因此，綜合本程式範例的Iteration和Condition，程式碼會縮排成這個樣子

for dep in dep_list:
    if dep not in dep_count:
        dep_count[dep] = 1
    else:
        dep_count[dep] = dep_count[dep] + 1

Incremental#

a = a + 1是一種遞增運算，將原本的a加上1後Assign給原本的a，除了會覆蓋掉原本a的數值外，相當於將a遞增一。
dep_count[dep] = dep_count[dep] + 1即是將某種dep作為key所對應到的Dict值遞增一。

Printing#

在前面我們介紹到兩種Python中非常重要的資料型態，寫程式過程經常會需要把這些變數或資料給列印出來看看，看是否資料處理是否照自己的邏輯走。

印出整個變數內容#

如果是印出整個變數內容的話，那幾乎是差不多的，就直接用print()函式列印出該變數即可。如果沒有加上print()函式也會印出該變數的內容，但同一個cell，如果有多個變數在列，那只會印出最後一個，如下面的例子中，我希望印出dep_list和dep_count，在沒有用print()的狀況下，只會印出最後一個。

{'jour': 4, 'econ': 3, 'poli': 1}這種從key-to-value對應形態的資料稱之為Dictionary
['jour', 'jour', 'econ', 'jour']這種從0開始編號起，第0個、第1個、第2個的資料型態稱之為List。

# print out dep_list for observation
print(dep_list)

# print out dep_count
print(dep_count)

# print out dep_list
dep_list

# print out dep_count
dep_count

['jour', 'jour', 'econ', 'jour', 'econ', 'poli', 'econ', 'jour']
{'jour': 4, 'econ': 3, 'poli': 1}

{'jour': 4, 'econ': 3, 'poli': 1}

印出Dict中的所有Key或Value#

Dictionary既然稱之為一種key-to-value鍵值對應形態的資料，那這種資料型態就應該會有一群keys，然後每個keys對應至唯一的values，不會有一個keys對應到多個values的情形。

keys()可以傳回一個dictionary的所有keys
values()可以傳回一個dictionary的所有values
items()可傳回一個dictionary的所有key-to-value pairs，以(key1, value1), (key2, value2)的型態。

# Print the dictionary
print(dep_count)

# Print all keys
print(dep_count.keys())

# Print all values mapped by keys
print(dep_count.values())

# Print all key-to-value pairs
print(dep_count.items())

{'jour': 4, 'econ': 3, 'poli': 1}
dict_keys(['jour', 'econ', 'poli'])
dict_values([4, 3, 1])
dict_items([('jour', 4), ('econ', 3), ('poli', 1)])

印出List中的每個項目#

將List中每個項目依序印出非常容易，就只要把它當for-each指令，然後用print()函式印出即可。

for dep in dep_list:
    print(dep)

jour
jour
econ
jour
econ
poli
econ
jour

印出Dict中的每個鍵值對#

因為Dict是一個鍵值對應（key-to-value mapping）結構，所以要先想好，要印的是key還是value。但通常是

for k in dep_count:
    print(k, "\t", dep_count[k])
#     print(key, dep_count[key])

jour 	 4
econ 	 3
poli 	 1

for k, v in dep_count.items():
    print(k, "\t", v)
    

jour 	 4
econ 	 3
poli 	 1

for k, v in dep_count.items():
    print(f"{k} has {v}")

jour has 4
econ has 3
poli has 1

計數的其他邏輯與語法#

V1. Using list.count(key) to count the frequency of something#

用list.count(x)函式可以計算出一個List中為x的有多少個。所以，只要知道有多少種物件，就可以用list.count(x)來計算每種物件的數量。（註：list.count(x)的寫法，表示list類別中有一個成員函式（Member function）為count(x)，可以計算List類別中某個物件的數量。）

dep_list.count("jour")

因此，也可以走訪過所有物件，當遇到沒看過的物件時，便計算一下在該list中有幾個該物件，如果有計算過該物件了，便略過之。用Python來表示這樣的程式邏輯結果如下。

dep_count = {} # dictionary, key to value pairs

dep_list = ['jour', 'jour', 'econ', 'jour', 'econ', 'poli', 'econ', 'jour'] # A list of dep

for dep in dep_list:
    if dep not in dep_count:
        dep_count[dep] = dep_list.count(dep)
print(dep_count)

{'jour': 4, 'econ': 3, 'poli': 1}

V2. Using set(dep_list) to gaurantee unique value in the list¶#

前一個範例中引入了list類別的成員函式count()來計算List中某個物件的數量。但仍須要用if-not-in這樣的判斷式來判斷每個物件是否已經計算過數量。那有沒有辦法取出某個List中有哪些種類的物件？有的，可以用set(alist)這樣的集合函式來取出某個List中所有的物件種類。亦即某個List alist其實是由set(alist)所組成的。set(alist)中不會有重複的元素。這個網站PythoTutor可以視覺化程式邏輯的運作流程，可以幫助了解計數的意涵Visualized by PythonTutor。

如下例。雖然dep_list中的重複物件可能很多，但如果取set(dep_list)的話，僅會列出dep_list中的不重複物件。此時，只需要用for-loop把set(dep_list)中的每種物件都屬過一次各有幾個，便可以知道每種物件出現幾次。

至此，我們一共講述了三種不一樣的結構，List、Dict(ionary)、Set。

set(dep_list)

{'econ', 'jour', 'poli'}

dep_count = {}
for dep in set(dep_list):
    dep_count[k] = dep_list.count(k)
print(dep_count)

{'poli': 1}

V3. by Comprehensive dictionary#

Python有一套很特別的、稱為Comprehension的寫法，可以將上述的迴圈式子寫成單一指令的模樣如下。可以查找網路上「Python Comprehension」獲得範例與說明。常用Python的工程師也常會用Comprehension的方式來寫作，尤其是做資料分析的工程師。

dep_count = {k:dep_list.count(k) for k in set(dep_list)}
print(dep_count)

{'econ': 3, 'jour': 4, 'poli': 1}

V4. by `Counter()`#

最後一個方法是我最愛用的方法，利用了Collections這個容器資料型態。Collections底下包含了Counter()這個Dict的子類別，可用來計算可雜湊 (hashable) 物件的數量。只要直接用Counter(dep_list)便可以計算該List中每種物件的數量。

不過該函式不會在Python一啟動的時候就自動載入。所以需要先載入from collections import Counter。[Visualized by PythonTutor]

from collections import Counter

dep_list = ['jour', 'jour', 'econ', 'jour', 'econ', 'poli', 'econ', 'jour']
dep_count = Counter(dep_list)
for k, v, in dep_count.items():
    print(k, "\t", v)

jour 	 4
econ 	 3
poli 	 1

(Option) Visualize Counter results#

import matplotlib.pyplot as plt

# 從Counter()中提取鍵和相應的值
labels = list(dep_count.keys())
values = list(dep_count.values())

# 創建一個水平條形圖
plt.barh(labels, values)

# 添加標題和標籤
plt.xlabel('number of items')
plt.ylabel('items')

# 顯示圖形
plt.show()

../_images/6868ca47de6d1edb66b787cb2ff5306671f5c4eb64a7443666ea7831cb3017f3.png

Counting的應用#

案例：計算Wikipedia文章詞頻#

「計數」其實是很多程式會應用到的演算邏輯。其他的演算邏輯單元還包含搜尋、排序等等，這些都是運算思維的一部分。在這個例子中，我打算要用計數邏輯來計算一篇Wikipedia文章的詞頻（Term frequency），亦即該篇文章中每個字出現幾遍，相當於把字詞當成上述範例中的科系。因此，一開始我需要把一篇文章斷開成單詞並存成如前面例子中的List。我會用Python的Wikipedia套件來取得某個Wikipedia頁面的摘要，這個套件為第三方套件，通常你的電腦都還沒安裝過這個套件，所以你需要在Terminal.app或cmd.exe中用pip install wikipedia安裝這個套件。用wikipedia的函式把某頁wikipedia的內容抓少許回來後，要先把該篇文章拆成字（用str.split()）。然後再用前面教的方法來計算每個字的出現次數。

!pip install wikipedia  # install a new third-party library

Requirement already satisfied: wikipedia in /Users/jirlong/opt/anaconda3/lib/python3.9/site-packages (1.4.0)
Requirement already satisfied: requests<3.0.0,>=2.0.0 in /Users/jirlong/opt/anaconda3/lib/python3.9/site-packages (from wikipedia) (2.31.0)
Requirement already satisfied: beautifulsoup4 in /Users/jirlong/opt/anaconda3/lib/python3.9/site-packages (from wikipedia) (4.11.1)
Requirement already satisfied: idna<4,>=2.5 in /Users/jirlong/opt/anaconda3/lib/python3.9/site-packages (from requests<3.0.0,>=2.0.0->wikipedia) (3.3)
Requirement already satisfied: charset-normalizer<4,>=2 in /Users/jirlong/opt/anaconda3/lib/python3.9/site-packages (from requests<3.0.0,>=2.0.0->wikipedia) (2.0.4)
Requirement already satisfied: certifi>=2017.4.17 in /Users/jirlong/opt/anaconda3/lib/python3.9/site-packages (from requests<3.0.0,>=2.0.0->wikipedia) (2022.5.18.1)
Requirement already satisfied: urllib3<3,>=1.21.1 in /Users/jirlong/opt/anaconda3/lib/python3.9/site-packages (from requests<3.0.0,>=2.0.0->wikipedia) (1.26.14)
Requirement already satisfied: soupsieve>1.2 in /Users/jirlong/opt/anaconda3/lib/python3.9/site-packages (from beautifulsoup4->wikipedia) (2.3.1)

Obtaining some qurery result from wikipedia#

import wikipedia    # import a 3rd-party library into runtime
string_a  = wikipedia.summary("Big_data", sentences = 10)   # get data by 3rd-party function
print(type(string_a))
print(string_a)
# string_a = "In 2004, Obama received national attention during his campaign to represent Illinois in the United States Senate with his victory in the March Democratic Party primary, his keynote address at the Democratic National Convention in July, and his election to the Senate in November. He began his presidential campaign in 2007 and, after a close primary campaign against Hillary Rodham Clinton in 2008, he won sufficient delegates in the Democratic Party primaries to receive the presidential nomination. He then defeated Republican nominee John McCain in the general election, and was inaugurated as president on January 20, 2009. Nine months after his inauguration, Obama was named the 2009 Nobel Peace Prize laureate"

---------------------------------------------------------------------------
PageError                                 Traceback (most recent call last)
Input In [16], in <cell line: 2>()
      1 import wikipedia    # import a 3rd-party library into runtime
----> 2 string_a  = wikipedia.summary("Big_data", sentences = 10)   # get data by 3rd-party function
      3 print(type(string_a))
      4 print(string_a)

File ~/opt/anaconda3/lib/python3.9/site-packages/wikipedia/util.py:28, in cache.__call__(self, *args, **kwargs)
     26   ret = self._cache[key]
     27 else:
---> 28   ret = self._cache[key] = self.fn(*args, **kwargs)
     30 return ret

File ~/opt/anaconda3/lib/python3.9/site-packages/wikipedia/wikipedia.py:231, in summary(title, sentences, chars, auto_suggest, redirect)
    216 '''
    217 Plain text summary of the page.
    218 
   (...)
    226 * redirect - allow redirection without raising RedirectError
    227 '''
    229 # use auto_suggest and redirect to get the correct article
    230 # also, use page's error checking to raise DisambiguationError if necessary
--> 231 page_info = page(title, auto_suggest=auto_suggest, redirect=redirect)
    232 title = page_info.title
    233 pageid = page_info.pageid

File ~/opt/anaconda3/lib/python3.9/site-packages/wikipedia/wikipedia.py:276, in page(title, pageid, auto_suggest, redirect, preload)
    273     except IndexError:
    274       # if there is no suggestion or search results, the page doesn't exist
    275       raise PageError(title)
--> 276   return WikipediaPage(title, redirect=redirect, preload=preload)
    277 elif pageid is not None:
    278   return WikipediaPage(pageid=pageid, preload=preload)

File ~/opt/anaconda3/lib/python3.9/site-packages/wikipedia/wikipedia.py:299, in WikipediaPage.__init__(self, title, pageid, redirect, preload, original_title)
    296 else:
    297   raise ValueError("Either a title or a pageid must be specified")
--> 299 self.__load(redirect=redirect, preload=preload)
    301 if preload:
    302   for prop in ('content', 'summary', 'images', 'references', 'links', 'sections'):

File ~/opt/anaconda3/lib/python3.9/site-packages/wikipedia/wikipedia.py:345, in WikipediaPage.__load(self, redirect, preload)
    343 if 'missing' in page:
    344   if hasattr(self, 'title'):
--> 345     raise PageError(self.title)
    346   else:
    347     raise PageError(pageid=self.pageid)

PageError: Page id "big date" does not match any pages. Try another id!

Text process - replacing punctuation#

import string
print(type(string.punctuation))
print(string.punctuation)

<class 'str'>
!"#$%&'()*+,-./:;<=>?@[\]^_`{|}~

print(type(string.punctuation))
translator = str.maketrans('','',string.punctuation)
string_a = string_a.translate(translator)
string_a = string_a.lower()

words = string_a.split()
print(type(words))
print(words[:10])

<class 'str'>
<class 'list'>
['big', 'data', 'primarily', 'refers', 'to', 'data', 'sets', 'that', 'are', 'too']

Capitalization: Converting string to lower#

string_a = string_a.lower()
print(string_a)

big data primarily refers to data sets that are too large or complex to be dealt with by traditional dataprocessing application software data with many entries rows offer greater statistical power while data with higher complexity more attributes or columns may lead to a higher false discovery rate though used sometimes loosely partly due to a lack of formal definition the best interpretation is that it is a large body of information that cannot be comprehended when used in small amounts onlybig data analysis challenges include capturing data data storage data analysis search sharing transfer visualization querying updating information privacy and data source big data was originally associated with three key concepts volume variety and velocity the analysis of big data presents challenges in sampling and thus previously allowing for only observations and sampling thus a fourth concept veracity refers to the quality or insightfulness of the data without sufficient investment in expertise for big data veracity then the volume and variety of data can produce costs and risks that exceed an organizations capacity to create and capture value from big datacurrent usage of the term big data tends to refer to the use of predictive analytics user behavior analytics or certain other advanced data analytics methods that extract value from big data and seldom to a particular size of data set there is little doubt that the quantities of data now available are indeed large but thats not the most relevant characteristic of this new data ecosystem
analysis of data sets can find new correlations to spot business trends prevent diseases combat crime and so on scientists business executives medical practitioners advertising and governments alike regularly meet difficulties with large datasets in areas including internet searches fintech healthcare analytics geographic information systems urban informatics and business informatics scientists encounter limitations in escience work including meteorology genomics connectomics complex physics simulations biology and environmental researchthe size and number of available data sets have grown rapidly as data is collected by devices such as mobile devices cheap and numerous informationsensing internet of things devices aerial remote sensing software logs cameras microphones radiofrequency identification rfid readers and wireless sensor networks

Tokenization: splitting text to word#

# astring = astring.translate(None, string.punctuation) for python 2.x
words = string_a.split()
words[:10]

['big',
 'data',
 'primarily',
 'refers',
 'to',
 'data',
 'sets',
 'that',
 'are',
 'too']

Counting word frequency#

word_freq = {}
for w in words:
    if w not in word_freq: # dictionary key-value initilization
        word_freq[w] = 0
    word_freq[w] += 1
#         word_freq[w] = word_freq[w] + 1
len(word_freq)

(Option) Applications: Zip’f law#

圖書資訊學有個Zip’f law在描述每個文字（字詞）的出現機率分佈。在一篇或多篇文章中，會有極少部分的字出現非常多次（例如代名詞、定冠詞、冠詞）等、大部分的字在整篇文章中只會出現過一次，而其他文字則出現2~多次不等。這種字詞出現數量分佈會呈現Power Distribution冪次分佈，大部分的物件出現次數都很少，少部分物件出現次數幾乎佔掉全體數量的甚至80%。

在計算完詞頻後，只要在用Counter(word_freq)計算一次每個詞頻的出現次數（存在word_freq中），意即去計算出現1次的字有幾個、出現2次的字有幾個、…、出現n次的字有幾個，就可以得到此分佈。

from collections import Counter
word_freq = Counter(words)
print(len(word_freq))
print(word_freq.values())

value_freq = Counter(word_freq.values())
print(value_freq)

221
dict_values([7, 23, 1, 2, 10, 3, 6, 2, 1, 4, 4, 2, 2, 1, 5, 2, 1, 1, 1, 2, 1, 1, 1, 1, 1, 1, 1, 1, 2, 1, 1, 1, 1, 1, 1, 5, 1, 1, 1, 1, 2, 1, 1, 1, 1, 1, 13, 1, 1, 9, 1, 1, 4, 1, 1, 3, 1, 1, 1, 5, 1, 1, 1, 4, 2, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 15, 1, 1, 1, 1, 1, 1, 1, 2, 2, 1, 1, 2, 2, 1, 1, 2, 1, 1, 1, 1, 2, 1, 1, 1, 1, 1, 1, 1, 2, 1, 1, 1, 1, 1, 1, 1, 1, 1, 2, 2, 1, 1, 1, 1, 1, 1, 1, 4, 1, 1, 1, 1, 1, 1, 1, 1, 1, 2, 1, 1, 1, 1, 1, 1, 2, 1, 1, 1, 1, 1, 1, 1, 1, 2, 1, 1, 1, 1, 3, 1, 1, 1, 1, 1, 1, 1, 2, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 2, 2, 1, 1, 1, 1, 1, 1, 2, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 2, 1, 3, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1])
Counter({1: 176, 2: 26, 4: 5, 3: 4, 5: 3, 7: 1, 23: 1, 10: 1, 6: 1, 13: 1, 9: 1, 15: 1})

Plot histogram of word frequency#

%matplotlib inline
import matplotlib.pyplot as plt
plt.hist(word_freq.values(), facecolor='green', alpha=0.5, bins=20)

(array([202.,   4.,   5.,   3.,   1.,   1.,   0.,   1.,   1.,   0.,   1.,
          0.,   1.,   0.,   0.,   0.,   0.,   0.,   0.,   1.]),
 array([ 1. ,  2.1,  3.2,  4.3,  5.4,  6.5,  7.6,  8.7,  9.8, 10.9, 12. ,
        13.1, 14.2, 15.3, 16.4, 17.5, 18.6, 19.7, 20.8, 21.9, 23. ]),
 <BarContainer object of 20 artists>)

../_images/951be62cd2a4af8a81c659d57ed8487f6b1e3e5d7fcafdb0ffc1477e263570dd.png

More: A shortest version to count word frequency#

word_freq = {k:words.count(k) for k in set(words)}

More:Sorting all pairs of key and value#

import operator
sorted_x = sorted(word_freq.items(), 
                  key=operator.itemgetter(1), 
                  reverse=True)

for k in sorted_x[:6]:
    print(k)

('data', 23)
('and', 15)
('of', 13)
('to', 10)
('the', 9)
('big', 7)

案例：成績百分制轉等第制#

最近台灣也開始用等第制來給分，一方面是為了符合國際的標準，另一原因據說是為了避免學生對成績太錙銖必較，雖然我自己的感覺是，學生對於為何從A+和A的差別更計較了，甚至會來追問，原始成績是多少，為何會變成A。不過，很多老師在給分還是採用百分制，學校在讓老師上傳成績的時候，就有個選項是可自動把百分制轉為等第制，這個範例是這件事的簡化版，我們只轉出A、B、C和F四種等第。

首先，先造一筆測試資料，分別是75個由35~99間的隨機整數，可重複取出。再來，初始化一個Dictionary來存放每個等第的個數（一開始均為0）。最後，使用if-elif-else條件判斷式來判斷每一筆成績落在哪個區間。

Generating grades#

先不要管下面怎麼做的，先知道這樣的方式可以產生隨機75個35~99間的數字並儲存到grades這個list中。

import random
grades  =  [random.randint(35, 100) for i in range(0, 75)]
print("Length of: ", len(grades))
print("Grades: ", " ".join(str(g) for g in grades))

Length of:  75
Grades:  96 36 86 69 85 87 72 56 41 64 97 62 45 94 54 100 43 67 76 96 57 85 86 60 35 94 75 61 40 36 98 91 60 73 67 88 36 100 65 58 75 69 78 67 44 88 62 40 59 59 82 41 87 92 67 49 76 71 99 76 95 97 98 100 88 84 54 80 91 74 60 43 99 76 60

Initializing a dict to store the frequency#

# Initializing level frequency
grade_dict = {'F':0, "C":0, "B":0, "A":0}

# Counting level frequency
for g in grades:
    if 100 >= g >= 80:
        grade_dict["A"] += 1
    elif 79 >= g >= 72:
        grade_dict["B"] += 1
    elif 71 >= g >= 60:
        grade_dict["C"] += 1
    else:
        grade_dict["F"] += 1
print(grade_dict)

{'F': 20, 'C': 16, 'B': 10, 'A': 29}

Sort the data by value#

前面的教學涵蓋了各種技術方法和Python程式語言的基本邏輯，但回到這個問題本身，尤其是「詞頻」這個議題，通常我們會想問，哪些詞出現的頻率最高、或者哪些種類的物件出現頻率最高。要用數據的方式來解答這個問題最好的方式就是將key-to-value的對應按照value由大到小來排列，這樣就可以知道，是哪些key的出現次數最高，這樣的動作稱為排序（Sorting）。

如果是List的排序很簡單，只需要用sorted()這個函式就可以將一個List從小到大或由大而小排序。但Dict是一種Mapping型態，需要照value將key-value對一起排序。注意，此時我們只是希望將排序的結果列印出來而已，不見得要保留原本Dict的資料型態。這樣一共有三種作法如下：

sorted([3, 1, 2, 4, 5, 5, 5, 3, 1], reverse = True)

[5, 5, 5, 4, 3, 3, 2, 1, 1]

Method 1#

我們將dep_count字典使用.items()方法轉換成一個包含有數個tuple的列表。如[('jour', 4), ('econ', 3), ('poli', 1)]。在每個tuple中，第一個元素是鍵（key），第二個元素是值（value）。
接著，我們使用sorted()函式來對這個列表進行排序，排序的依據是tuple中的第二個元素，也就是值。使用operator.itemgetter(1)可以確保我們選取tuple的第二個位置作為排序的依據。而參數reverse=True則表示我們希望按照由大到小的順序進行排序。
最後，我們透過迴圈將排序後的tuple列表中的每個項目進行列印。這樣，我們就能夠清楚地對程式碼進行理解與解釋。

dep_count.items()

dict_items([('jour', 4), ('econ', 3), ('poli', 1)])

import operator
sorted_x = sorted(dep_count.items(), 
                  key=operator.itemgetter(1), 
                  reverse=True)

for k in sorted_x:
    print(k)

('jour', 4)
('econ', 3)
('poli', 1)

也有人會這麼寫

sorted_x = sorted(dep_count.items(), 
                  key=lambda item: item[1], 
                  reverse=True)
sorted_x

[('jour', 4), ('econ', 3), ('poli', 1)]

Method 2#

第二種寫法比較簡單，不用先用items()轉為tuple的型態，而直接用.get()這個Dict的成員函式指定用value來排序（dict.get()指的便是key-to-value的value）。

sorted_keys = sorted(dep_count, key=dep_count.get, reverse=True)
print(sorted_keys)

['jour', 'econ', 'poli']

for key in sorted_keys[:2]:
    print(key, dep_count[key])

jour 4
econ 3

Method 3#

第三種方式是我最愛用的方式，用Counter()直接計算個數後，用成員函式most_common()直接依照出現個數由大而小輸出。most_commnon(n)中可填入一數字，如果是10的話即為前10大。

for f, c in Counter(dep_list).most_common():
    print(f, c)

jour 4
econ 3
poli 1