TM01 Tokenization#
Loading Data#
Open the corpus in the file “hindu.txt” from https://en.wikipedia.org/wiki/Hindu
Other public resource: http://www.gutenberg.org/
Never “push” protected text to Github or other publicly available platforms.
An Example: Wikipedia#
# !pip install wikipedia
import wikipedia
import string
# cv = wikipedia.page("Taipei")
# text = cv.content
# print(cv.url)
# print("The length of Taipei page is ", len(text))
# print(text[:100])
# text = wikipedia.page("Rembrandt").content
# print(len(text))
text = wikipedia.summary("Rembrandt", sentences = 10)
print(type(text))
print("The length of Rembrandt summary is ", len(text))
<class 'str'>
The length of Rembrandt summary is 1487
One more example#
text = wikipedia.summary("Hindus", sentences = 10)
print(type(text))
print("The length of Hindus summary is ", len(text))
<class 'str'>
The length of Hindus summary is 1547
!wget -N https://raw.githubusercontent.com/P4CSS/PSS/master/data/hindu.txt
!ls
zsh:1: command not found: wget
TM00.ipynb TM08_doc_classification_pipeline.ipynb
TM01_tokenization(chi).ipynb TM09_BERTopic.ipynb
TM01_tokenization.ipynb TM09_re.ipynb
TM02_collocation.ipynb TM0X Using CKIPtagger.ipynb
TM03_2_pos_sentiment_chi.ipynb TM10_Using_Open_AI.ipynb
TM03_POS Tagging.ipynb blessing.txt
TM04_IR_search.ipynb data
TM05_sentiment.ipynb hindu.txt
TM06_Clustering.ipynb img
TM07_1_embeddings.ipynb lib
TM07_2_doc2vec.ipynb stopwords_zh-tw.txt
TM07_3_get_wordEmbeddings.ipynb test.html
TM07_embedding_clustering.ipynb userdict.txt
TM08_doc_classification.ipynb
with open("hindu.txt") as fin:
text = fin.read()
Length of the corpus (in characters)#
print("The lenght of the corpus: %d" % len(text))
The lenght of the corpus: 2747
Content#
print(text)
Hindu refers to any person who regards themselves as culturally, ethnically, or religiously adhering to aspects of Hinduism.[1][2] It has historically been used as a geographical, cultural, and later religious identifier for people indigenous to the Indian subcontinent.[3][4]
The historical meaning of the term Hindu has evolved with time. Starting with the Persian and Greek references to the land of the Indus in the 1st millennium BCE through the texts of the medieval era,[5] the term Hindu implied a geographic, ethnic or cultural identifier for people living in the Indian subcontinent around or beyond the Sindhu (Indus) river.[6] By the 16th century, the term began to refer to residents of the subcontinent who were not Turkic or Muslims.[6][a][b]
The historical development of Hindu self-identity within the local South Asian population, in a religious or cultural sense, is unclear.[3][7] Competing theories state that Hindu identity developed in the British colonial era, or that it developed post-8th century CE after the Islamic invasion and medieval Hindu-Muslim wars.[7][8][9] A sense of Hindu identity and the term Hindu appears in some texts dated between the 13th and 18th century in Sanskrit and regional languages.[8][10] The 14th- and 18th-century Indian poets such as Vidyapati, Kabir and Eknath used the phrase Hindu dharma (Hinduism) and contrasted it with Turaka dharma (Islam).[11] The Christian friar Sebastiao Manrique used the term 'Hindu' in religious context in 1649.[12] In the 18th century, the European merchants and colonists began to refer to the followers of Indian religions collectively as Hindus, in contrast to Mohamedans for Mughals and Arabs following Islam.[3][6] By the mid-19th century, colonial orientalist texts further distinguished Hindus from Buddhists, Sikhs and Jains,[3] but the colonial laws continued to consider all of them to be within the scope of the term Hindu until about mid-20th century.[13] Scholars state that the custom of distinguishing between Hindus, Buddhists, Jains and Sikhs is a modern phenomenon.[14][15] Hindoo is an archaic spelling variant, whose use today may be considered derogatory.[16][17]
At more than 1.03 billion,[18] Hindus are the world's third largest group after Christians and Muslims. The vast majority of Hindus, approximately 966 million, live in India, according to India's 2011 census.[19] After India, the next 9 countries with the largest Hindu populations are, in decreasing order: Nepal, Bangladesh, Indonesia, Pakistan, Sri Lanka, United States, Malaysia, United Kingdom and Myanmar.[20] These together accounted for 99% of the world's Hindu population, and the remaining nations of the world together had about 6 million Hindus in 2010.[20]
Tokenization#
Method 1. by built-in .split()
#
sentence_a = "What’s in a name? That which we call a rose by any other name would smell as sweet."
print(sentence_a.split(" "))
sentence_b = "2020/04/07 00:08:00"
print(sentence_b.split("/"))
['What’s', 'in', 'a', 'name?', 'That', 'which', 'we', 'call', 'a', 'rose', 'by', 'any', 'other', 'name', 'would', 'smell', 'as', 'sweet.']
['2020', '04', '07 00:08:00']
print("123".isalpha())
print("abc".isalpha())
False
True
print(len(text.split(" ")))
print(text.split(" "))
410
['Hindu', 'refers', 'to', 'any', 'person', 'who', 'regards', 'themselves', 'as', 'culturally,', 'ethnically,', 'or', 'religiously', 'adhering', 'to', 'aspects', 'of', 'Hinduism.[1][2]', 'It', 'has', 'historically', 'been', 'used', 'as', 'a', 'geographical,', 'cultural,', 'and', 'later', 'religious', 'identifier', 'for', 'people', 'indigenous', 'to', 'the', 'Indian', 'subcontinent.[3][4]\n\nThe', 'historical', 'meaning', 'of', 'the', 'term', 'Hindu', 'has', 'evolved', 'with', 'time.', 'Starting', 'with', 'the', 'Persian', 'and', 'Greek', 'references', 'to', 'the', 'land', 'of', 'the', 'Indus', 'in', 'the', '1st', 'millennium', 'BCE', 'through', 'the', 'texts', 'of', 'the', 'medieval', 'era,[5]', 'the', 'term', 'Hindu', 'implied', 'a', 'geographic,', 'ethnic', 'or', 'cultural', 'identifier', 'for', 'people', 'living', 'in', 'the', 'Indian', 'subcontinent', 'around', 'or', 'beyond', 'the', 'Sindhu', '(Indus)', 'river.[6]', 'By', 'the', '16th', 'century,', 'the', 'term', 'began', 'to', 'refer', 'to', 'residents', 'of', 'the', 'subcontinent', 'who', 'were', 'not', 'Turkic', 'or', 'Muslims.[6][a][b]\n\nThe', 'historical', 'development', 'of', 'Hindu', 'self-identity', 'within', 'the', 'local', 'South', 'Asian', 'population,', 'in', 'a', 'religious', 'or', 'cultural', 'sense,', 'is', 'unclear.[3][7]', 'Competing', 'theories', 'state', 'that', 'Hindu', 'identity', 'developed', 'in', 'the', 'British', 'colonial', 'era,', 'or', 'that', 'it', 'developed', 'post-8th', 'century', 'CE', 'after', 'the', 'Islamic', 'invasion', 'and', 'medieval', 'Hindu-Muslim', 'wars.[7][8][9]', 'A', 'sense', 'of', 'Hindu', 'identity', 'and', 'the', 'term', 'Hindu', 'appears', 'in', 'some', 'texts', 'dated', 'between', 'the', '13th', 'and', '18th', 'century', 'in', 'Sanskrit', 'and', 'regional', 'languages.[8][10]', 'The', '14th-', 'and', '18th-century', 'Indian', 'poets', 'such', 'as', 'Vidyapati,', 'Kabir', 'and', 'Eknath', 'used', 'the', 'phrase', 'Hindu', 'dharma', '(Hinduism)', 'and', 'contrasted', 'it', 'with', 'Turaka', 'dharma', '(Islam).[11]', 'The', 'Christian', 'friar', 'Sebastiao', 'Manrique', 'used', 'the', 'term', "'Hindu'", 'in', 'religious', 'context', 'in', '1649.[12]', 'In', 'the', '18th', 'century,', 'the', 'European', 'merchants', 'and', 'colonists', 'began', 'to', 'refer', 'to', 'the', 'followers', 'of', 'Indian', 'religions', 'collectively', 'as', 'Hindus,', 'in', 'contrast', 'to', 'Mohamedans', 'for', 'Mughals', 'and', 'Arabs', 'following', 'Islam.[3][6]', 'By', 'the', 'mid-19th', 'century,', 'colonial', 'orientalist', 'texts', 'further', 'distinguished', 'Hindus', 'from', 'Buddhists,', 'Sikhs', 'and', 'Jains,[3]', 'but', 'the', 'colonial', 'laws', 'continued', 'to', 'consider', 'all', 'of', 'them', 'to', 'be', 'within', 'the', 'scope', 'of', 'the', 'term', 'Hindu', 'until', 'about', 'mid-20th', 'century.[13]', 'Scholars', 'state', 'that', 'the', 'custom', 'of', 'distinguishing', 'between', 'Hindus,', 'Buddhists,', 'Jains', 'and', 'Sikhs', 'is', 'a', 'modern', 'phenomenon.[14][15]', 'Hindoo', 'is', 'an', 'archaic', 'spelling', 'variant,', 'whose', 'use', 'today', 'may', 'be', 'considered', 'derogatory.[16][17]\n\nAt', 'more', 'than', '1.03', 'billion,[18]', 'Hindus', 'are', 'the', "world's", 'third', 'largest', 'group', 'after', 'Christians', 'and', 'Muslims.', 'The', 'vast', 'majority', 'of', 'Hindus,', 'approximately', '966', 'million,', 'live', 'in', 'India,', 'according', 'to', "India's", '2011', 'census.[19]', 'After', 'India,', 'the', 'next', '9', 'countries', 'with', 'the', 'largest', 'Hindu', 'populations', 'are,', 'in', 'decreasing', 'order:', 'Nepal,', 'Bangladesh,', 'Indonesia,', 'Pakistan,', 'Sri', 'Lanka,', 'United', 'States,', 'Malaysia,', 'United', 'Kingdom', 'and', 'Myanmar.[20]', 'These', 'together', 'accounted', 'for', '99%', 'of', 'the', "world's", 'Hindu', 'population,', 'and', 'the', 'remaining', 'nations', 'of', 'the', 'world', 'together', 'had', 'about', '6', 'million', 'Hindus', 'in', '2010.[20]']
Method 2. by nltk’s function#
import nltk
nltk.download('punkt')
from nltk.tokenize import word_tokenize
print(len(word_tokenize(text)))
print(word_tokenize(text))
566
['Hindu', 'refers', 'to', 'any', 'person', 'who', 'regards', 'themselves', 'as', 'culturally', ',', 'ethnically', ',', 'or', 'religiously', 'adhering', 'to', 'aspects', 'of', 'Hinduism', '.', '[', '1', ']', '[', '2', ']', 'It', 'has', 'historically', 'been', 'used', 'as', 'a', 'geographical', ',', 'cultural', ',', 'and', 'later', 'religious', 'identifier', 'for', 'people', 'indigenous', 'to', 'the', 'Indian', 'subcontinent', '.', '[', '3', ']', '[', '4', ']', 'The', 'historical', 'meaning', 'of', 'the', 'term', 'Hindu', 'has', 'evolved', 'with', 'time', '.', 'Starting', 'with', 'the', 'Persian', 'and', 'Greek', 'references', 'to', 'the', 'land', 'of', 'the', 'Indus', 'in', 'the', '1st', 'millennium', 'BCE', 'through', 'the', 'texts', 'of', 'the', 'medieval', 'era', ',', '[', '5', ']', 'the', 'term', 'Hindu', 'implied', 'a', 'geographic', ',', 'ethnic', 'or', 'cultural', 'identifier', 'for', 'people', 'living', 'in', 'the', 'Indian', 'subcontinent', 'around', 'or', 'beyond', 'the', 'Sindhu', '(', 'Indus', ')', 'river', '.', '[', '6', ']', 'By', 'the', '16th', 'century', ',', 'the', 'term', 'began', 'to', 'refer', 'to', 'residents', 'of', 'the', 'subcontinent', 'who', 'were', 'not', 'Turkic', 'or', 'Muslims', '.', '[', '6', ']', '[', 'a', ']', '[', 'b', ']', 'The', 'historical', 'development', 'of', 'Hindu', 'self-identity', 'within', 'the', 'local', 'South', 'Asian', 'population', ',', 'in', 'a', 'religious', 'or', 'cultural', 'sense', ',', 'is', 'unclear', '.', '[', '3', ']', '[', '7', ']', 'Competing', 'theories', 'state', 'that', 'Hindu', 'identity', 'developed', 'in', 'the', 'British', 'colonial', 'era', ',', 'or', 'that', 'it', 'developed', 'post-8th', 'century', 'CE', 'after', 'the', 'Islamic', 'invasion', 'and', 'medieval', 'Hindu-Muslim', 'wars', '.', '[', '7', ']', '[', '8', ']', '[', '9', ']', 'A', 'sense', 'of', 'Hindu', 'identity', 'and', 'the', 'term', 'Hindu', 'appears', 'in', 'some', 'texts', 'dated', 'between', 'the', '13th', 'and', '18th', 'century', 'in', 'Sanskrit', 'and', 'regional', 'languages', '.', '[', '8', ']', '[', '10', ']', 'The', '14th-', 'and', '18th-century', 'Indian', 'poets', 'such', 'as', 'Vidyapati', ',', 'Kabir', 'and', 'Eknath', 'used', 'the', 'phrase', 'Hindu', 'dharma', '(', 'Hinduism', ')', 'and', 'contrasted', 'it', 'with', 'Turaka', 'dharma', '(', 'Islam', ')', '.', '[', '11', ']', 'The', 'Christian', 'friar', 'Sebastiao', 'Manrique', 'used', 'the', 'term', "'Hindu", "'", 'in', 'religious', 'context', 'in', '1649', '.', '[', '12', ']', 'In', 'the', '18th', 'century', ',', 'the', 'European', 'merchants', 'and', 'colonists', 'began', 'to', 'refer', 'to', 'the', 'followers', 'of', 'Indian', 'religions', 'collectively', 'as', 'Hindus', ',', 'in', 'contrast', 'to', 'Mohamedans', 'for', 'Mughals', 'and', 'Arabs', 'following', 'Islam', '.', '[', '3', ']', '[', '6', ']', 'By', 'the', 'mid-19th', 'century', ',', 'colonial', 'orientalist', 'texts', 'further', 'distinguished', 'Hindus', 'from', 'Buddhists', ',', 'Sikhs', 'and', 'Jains', ',', '[', '3', ']', 'but', 'the', 'colonial', 'laws', 'continued', 'to', 'consider', 'all', 'of', 'them', 'to', 'be', 'within', 'the', 'scope', 'of', 'the', 'term', 'Hindu', 'until', 'about', 'mid-20th', 'century', '.', '[', '13', ']', 'Scholars', 'state', 'that', 'the', 'custom', 'of', 'distinguishing', 'between', 'Hindus', ',', 'Buddhists', ',', 'Jains', 'and', 'Sikhs', 'is', 'a', 'modern', 'phenomenon', '.', '[', '14', ']', '[', '15', ']', 'Hindoo', 'is', 'an', 'archaic', 'spelling', 'variant', ',', 'whose', 'use', 'today', 'may', 'be', 'considered', 'derogatory', '.', '[', '16', ']', '[', '17', ']', 'At', 'more', 'than', '1.03', 'billion', ',', '[', '18', ']', 'Hindus', 'are', 'the', 'world', "'s", 'third', 'largest', 'group', 'after', 'Christians', 'and', 'Muslims', '.', 'The', 'vast', 'majority', 'of', 'Hindus', ',', 'approximately', '966', 'million', ',', 'live', 'in', 'India', ',', 'according', 'to', 'India', "'s", '2011', 'census', '.', '[', '19', ']', 'After', 'India', ',', 'the', 'next', '9', 'countries', 'with', 'the', 'largest', 'Hindu', 'populations', 'are', ',', 'in', 'decreasing', 'order', ':', 'Nepal', ',', 'Bangladesh', ',', 'Indonesia', ',', 'Pakistan', ',', 'Sri', 'Lanka', ',', 'United', 'States', ',', 'Malaysia', ',', 'United', 'Kingdom', 'and', 'Myanmar', '.', '[', '20', ']', 'These', 'together', 'accounted', 'for', '99', '%', 'of', 'the', 'world', "'s", 'Hindu', 'population', ',', 'and', 'the', 'remaining', 'nations', 'of', 'the', 'world', 'together', 'had', 'about', '6', 'million', 'Hindus', 'in', '2010', '.', '[', '20', ']']
[nltk_data] Downloading package punkt to /Users/jirlong/nltk_data...
[nltk_data] Package punkt is already up-to-date!
Method 3. Design manually as Python function#
import math
def myfun(x, y):
return math.sqrt(x**2 + y**2), x, y
print(myfun(3, 4))
(5.0, 3, 4)
def my_tokenizer(txt):
tok = ""
word_list = []
for ch in txt:
if ch == " ":
word_list.append(tok)
tok = ""
# print("Word_list: ", word_list)
else:
tok += ch
# print(tok)
return word_list
word_list = my_tokenizer(text)
print(len(word_list))
409
tok = " "
if tok:
print("Yes")
else:
print("No")
Yes
# A problematic implementation for word tokenization.
def tokenize(text):
tokens = []
tok = ""
for ch in text:
if ch == " ":
if tok:
tokens.append(tok)
tok = ""
else:
tok += ch
if tok:
tokens.append(tok)
return tokens
print(len(tokenize(text)))
print(tokenize(text))
410
['Hindu', 'refers', 'to', 'any', 'person', 'who', 'regards', 'themselves', 'as', 'culturally,', 'ethnically,', 'or', 'religiously', 'adhering', 'to', 'aspects', 'of', 'Hinduism.[1][2]', 'It', 'has', 'historically', 'been', 'used', 'as', 'a', 'geographical,', 'cultural,', 'and', 'later', 'religious', 'identifier', 'for', 'people', 'indigenous', 'to', 'the', 'Indian', 'subcontinent.[3][4]\n\nThe', 'historical', 'meaning', 'of', 'the', 'term', 'Hindu', 'has', 'evolved', 'with', 'time.', 'Starting', 'with', 'the', 'Persian', 'and', 'Greek', 'references', 'to', 'the', 'land', 'of', 'the', 'Indus', 'in', 'the', '1st', 'millennium', 'BCE', 'through', 'the', 'texts', 'of', 'the', 'medieval', 'era,[5]', 'the', 'term', 'Hindu', 'implied', 'a', 'geographic,', 'ethnic', 'or', 'cultural', 'identifier', 'for', 'people', 'living', 'in', 'the', 'Indian', 'subcontinent', 'around', 'or', 'beyond', 'the', 'Sindhu', '(Indus)', 'river.[6]', 'By', 'the', '16th', 'century,', 'the', 'term', 'began', 'to', 'refer', 'to', 'residents', 'of', 'the', 'subcontinent', 'who', 'were', 'not', 'Turkic', 'or', 'Muslims.[6][a][b]\n\nThe', 'historical', 'development', 'of', 'Hindu', 'self-identity', 'within', 'the', 'local', 'South', 'Asian', 'population,', 'in', 'a', 'religious', 'or', 'cultural', 'sense,', 'is', 'unclear.[3][7]', 'Competing', 'theories', 'state', 'that', 'Hindu', 'identity', 'developed', 'in', 'the', 'British', 'colonial', 'era,', 'or', 'that', 'it', 'developed', 'post-8th', 'century', 'CE', 'after', 'the', 'Islamic', 'invasion', 'and', 'medieval', 'Hindu-Muslim', 'wars.[7][8][9]', 'A', 'sense', 'of', 'Hindu', 'identity', 'and', 'the', 'term', 'Hindu', 'appears', 'in', 'some', 'texts', 'dated', 'between', 'the', '13th', 'and', '18th', 'century', 'in', 'Sanskrit', 'and', 'regional', 'languages.[8][10]', 'The', '14th-', 'and', '18th-century', 'Indian', 'poets', 'such', 'as', 'Vidyapati,', 'Kabir', 'and', 'Eknath', 'used', 'the', 'phrase', 'Hindu', 'dharma', '(Hinduism)', 'and', 'contrasted', 'it', 'with', 'Turaka', 'dharma', '(Islam).[11]', 'The', 'Christian', 'friar', 'Sebastiao', 'Manrique', 'used', 'the', 'term', "'Hindu'", 'in', 'religious', 'context', 'in', '1649.[12]', 'In', 'the', '18th', 'century,', 'the', 'European', 'merchants', 'and', 'colonists', 'began', 'to', 'refer', 'to', 'the', 'followers', 'of', 'Indian', 'religions', 'collectively', 'as', 'Hindus,', 'in', 'contrast', 'to', 'Mohamedans', 'for', 'Mughals', 'and', 'Arabs', 'following', 'Islam.[3][6]', 'By', 'the', 'mid-19th', 'century,', 'colonial', 'orientalist', 'texts', 'further', 'distinguished', 'Hindus', 'from', 'Buddhists,', 'Sikhs', 'and', 'Jains,[3]', 'but', 'the', 'colonial', 'laws', 'continued', 'to', 'consider', 'all', 'of', 'them', 'to', 'be', 'within', 'the', 'scope', 'of', 'the', 'term', 'Hindu', 'until', 'about', 'mid-20th', 'century.[13]', 'Scholars', 'state', 'that', 'the', 'custom', 'of', 'distinguishing', 'between', 'Hindus,', 'Buddhists,', 'Jains', 'and', 'Sikhs', 'is', 'a', 'modern', 'phenomenon.[14][15]', 'Hindoo', 'is', 'an', 'archaic', 'spelling', 'variant,', 'whose', 'use', 'today', 'may', 'be', 'considered', 'derogatory.[16][17]\n\nAt', 'more', 'than', '1.03', 'billion,[18]', 'Hindus', 'are', 'the', "world's", 'third', 'largest', 'group', 'after', 'Christians', 'and', 'Muslims.', 'The', 'vast', 'majority', 'of', 'Hindus,', 'approximately', '966', 'million,', 'live', 'in', 'India,', 'according', 'to', "India's", '2011', 'census.[19]', 'After', 'India,', 'the', 'next', '9', 'countries', 'with', 'the', 'largest', 'Hindu', 'populations', 'are,', 'in', 'decreasing', 'order:', 'Nepal,', 'Bangladesh,', 'Indonesia,', 'Pakistan,', 'Sri', 'Lanka,', 'United', 'States,', 'Malaysia,', 'United', 'Kingdom', 'and', 'Myanmar.[20]', 'These', 'together', 'accounted', 'for', '99%', 'of', 'the', "world's", 'Hindu', 'population,', 'and', 'the', 'remaining', 'nations', 'of', 'the', 'world', 'together', 'had', 'about', '6', 'million', 'Hindus', 'in', '2010.[20]']
How to compare if two lists are identical?#
Counting#
from collections import Counter
tokens = word_tokenize(text)
word_count = Counter(tokens)
for w, c in word_count.most_common(20):
print("%s\t%d" % (w, c))
the 35
, 33
[ 30
] 30
. 18
and 16
of 14
to 12
in 12
Hindu 11
or 6
term 6
century 6
Hindus 6
a 5
The 5
as 4
for 4
Indian 4
3 4
Stopword and sign removal#
Method 1.Removal of Punctuation Marks#
import string
print(string.punctuation)
!"#$%&'()*+,-./:;<=>?@[\]^_`{|}~
def remove_punctuation_marks(tokens):
clean_tokens = []
for tok in tokens:
if tok not in string.punctuation:
clean_tokens.append(tok)
return clean_tokens
print(remove_punctuation_marks(tokens))
['Hindu', 'refers', 'to', 'any', 'person', 'who', 'regards', 'themselves', 'as', 'culturally', 'ethnically', 'or', 'religiously', 'adhering', 'to', 'aspects', 'of', 'Hinduism', '1', '2', 'It', 'has', 'historically', 'been', 'used', 'as', 'a', 'geographical', 'cultural', 'and', 'later', 'religious', 'identifier', 'for', 'people', 'indigenous', 'to', 'the', 'Indian', 'subcontinent', '3', '4', 'The', 'historical', 'meaning', 'of', 'the', 'term', 'Hindu', 'has', 'evolved', 'with', 'time', 'Starting', 'with', 'the', 'Persian', 'and', 'Greek', 'references', 'to', 'the', 'land', 'of', 'the', 'Indus', 'in', 'the', '1st', 'millennium', 'BCE', 'through', 'the', 'texts', 'of', 'the', 'medieval', 'era', '5', 'the', 'term', 'Hindu', 'implied', 'a', 'geographic', 'ethnic', 'or', 'cultural', 'identifier', 'for', 'people', 'living', 'in', 'the', 'Indian', 'subcontinent', 'around', 'or', 'beyond', 'the', 'Sindhu', 'Indus', 'river', '6', 'By', 'the', '16th', 'century', 'the', 'term', 'began', 'to', 'refer', 'to', 'residents', 'of', 'the', 'subcontinent', 'who', 'were', 'not', 'Turkic', 'or', 'Muslims', '6', 'a', 'b', 'The', 'historical', 'development', 'of', 'Hindu', 'self-identity', 'within', 'the', 'local', 'South', 'Asian', 'population', 'in', 'a', 'religious', 'or', 'cultural', 'sense', 'is', 'unclear', '3', '7', 'Competing', 'theories', 'state', 'that', 'Hindu', 'identity', 'developed', 'in', 'the', 'British', 'colonial', 'era', 'or', 'that', 'it', 'developed', 'post-8th', 'century', 'CE', 'after', 'the', 'Islamic', 'invasion', 'and', 'medieval', 'Hindu-Muslim', 'wars', '7', '8', '9', 'A', 'sense', 'of', 'Hindu', 'identity', 'and', 'the', 'term', 'Hindu', 'appears', 'in', 'some', 'texts', 'dated', 'between', 'the', '13th', 'and', '18th', 'century', 'in', 'Sanskrit', 'and', 'regional', 'languages', '8', '10', 'The', '14th-', 'and', '18th-century', 'Indian', 'poets', 'such', 'as', 'Vidyapati', 'Kabir', 'and', 'Eknath', 'used', 'the', 'phrase', 'Hindu', 'dharma', 'Hinduism', 'and', 'contrasted', 'it', 'with', 'Turaka', 'dharma', 'Islam', '11', 'The', 'Christian', 'friar', 'Sebastiao', 'Manrique', 'used', 'the', 'term', "'Hindu", 'in', 'religious', 'context', 'in', '1649', '12', 'In', 'the', '18th', 'century', 'the', 'European', 'merchants', 'and', 'colonists', 'began', 'to', 'refer', 'to', 'the', 'followers', 'of', 'Indian', 'religions', 'collectively', 'as', 'Hindus', 'in', 'contrast', 'to', 'Mohamedans', 'for', 'Mughals', 'and', 'Arabs', 'following', 'Islam', '3', '6', 'By', 'the', 'mid-19th', 'century', 'colonial', 'orientalist', 'texts', 'further', 'distinguished', 'Hindus', 'from', 'Buddhists', 'Sikhs', 'and', 'Jains', '3', 'but', 'the', 'colonial', 'laws', 'continued', 'to', 'consider', 'all', 'of', 'them', 'to', 'be', 'within', 'the', 'scope', 'of', 'the', 'term', 'Hindu', 'until', 'about', 'mid-20th', 'century', '13', 'Scholars', 'state', 'that', 'the', 'custom', 'of', 'distinguishing', 'between', 'Hindus', 'Buddhists', 'Jains', 'and', 'Sikhs', 'is', 'a', 'modern', 'phenomenon', '14', '15', 'Hindoo', 'is', 'an', 'archaic', 'spelling', 'variant', 'whose', 'use', 'today', 'may', 'be', 'considered', 'derogatory', '16', '17', 'At', 'more', 'than', '1.03', 'billion', '18', 'Hindus', 'are', 'the', 'world', "'s", 'third', 'largest', 'group', 'after', 'Christians', 'and', 'Muslims', 'The', 'vast', 'majority', 'of', 'Hindus', 'approximately', '966', 'million', 'live', 'in', 'India', 'according', 'to', 'India', "'s", '2011', 'census', '19', 'After', 'India', 'the', 'next', '9', 'countries', 'with', 'the', 'largest', 'Hindu', 'populations', 'are', 'in', 'decreasing', 'order', 'Nepal', 'Bangladesh', 'Indonesia', 'Pakistan', 'Sri', 'Lanka', 'United', 'States', 'Malaysia', 'United', 'Kingdom', 'and', 'Myanmar', '20', 'These', 'together', 'accounted', 'for', '99', 'of', 'the', 'world', "'s", 'Hindu', 'population', 'and', 'the', 'remaining', 'nations', 'of', 'the', 'world', 'together', 'had', 'about', '6', 'million', 'Hindus', 'in', '2010', '20']
Method 2. Removing all tokens that contain characters other than letters.#
def remove_punctuation_marks(tokens):
clean_tokens = []
for tok in tokens:
if tok.isalpha():
clean_tokens.append(tok)
return clean_tokens
print(remove_punctuation_marks(tokens))
['Hindu', 'refers', 'to', 'any', 'person', 'who', 'regards', 'themselves', 'as', 'culturally', 'ethnically', 'or', 'religiously', 'adhering', 'to', 'aspects', 'of', 'Hinduism', 'It', 'has', 'historically', 'been', 'used', 'as', 'a', 'geographical', 'cultural', 'and', 'later', 'religious', 'identifier', 'for', 'people', 'indigenous', 'to', 'the', 'Indian', 'subcontinent', 'The', 'historical', 'meaning', 'of', 'the', 'term', 'Hindu', 'has', 'evolved', 'with', 'time', 'Starting', 'with', 'the', 'Persian', 'and', 'Greek', 'references', 'to', 'the', 'land', 'of', 'the', 'Indus', 'in', 'the', 'millennium', 'BCE', 'through', 'the', 'texts', 'of', 'the', 'medieval', 'era', 'the', 'term', 'Hindu', 'implied', 'a', 'geographic', 'ethnic', 'or', 'cultural', 'identifier', 'for', 'people', 'living', 'in', 'the', 'Indian', 'subcontinent', 'around', 'or', 'beyond', 'the', 'Sindhu', 'Indus', 'river', 'By', 'the', 'century', 'the', 'term', 'began', 'to', 'refer', 'to', 'residents', 'of', 'the', 'subcontinent', 'who', 'were', 'not', 'Turkic', 'or', 'Muslims', 'a', 'b', 'The', 'historical', 'development', 'of', 'Hindu', 'within', 'the', 'local', 'South', 'Asian', 'population', 'in', 'a', 'religious', 'or', 'cultural', 'sense', 'is', 'unclear', 'Competing', 'theories', 'state', 'that', 'Hindu', 'identity', 'developed', 'in', 'the', 'British', 'colonial', 'era', 'or', 'that', 'it', 'developed', 'century', 'CE', 'after', 'the', 'Islamic', 'invasion', 'and', 'medieval', 'wars', 'A', 'sense', 'of', 'Hindu', 'identity', 'and', 'the', 'term', 'Hindu', 'appears', 'in', 'some', 'texts', 'dated', 'between', 'the', 'and', 'century', 'in', 'Sanskrit', 'and', 'regional', 'languages', 'The', 'and', 'Indian', 'poets', 'such', 'as', 'Vidyapati', 'Kabir', 'and', 'Eknath', 'used', 'the', 'phrase', 'Hindu', 'dharma', 'Hinduism', 'and', 'contrasted', 'it', 'with', 'Turaka', 'dharma', 'Islam', 'The', 'Christian', 'friar', 'Sebastiao', 'Manrique', 'used', 'the', 'term', 'in', 'religious', 'context', 'in', 'In', 'the', 'century', 'the', 'European', 'merchants', 'and', 'colonists', 'began', 'to', 'refer', 'to', 'the', 'followers', 'of', 'Indian', 'religions', 'collectively', 'as', 'Hindus', 'in', 'contrast', 'to', 'Mohamedans', 'for', 'Mughals', 'and', 'Arabs', 'following', 'Islam', 'By', 'the', 'century', 'colonial', 'orientalist', 'texts', 'further', 'distinguished', 'Hindus', 'from', 'Buddhists', 'Sikhs', 'and', 'Jains', 'but', 'the', 'colonial', 'laws', 'continued', 'to', 'consider', 'all', 'of', 'them', 'to', 'be', 'within', 'the', 'scope', 'of', 'the', 'term', 'Hindu', 'until', 'about', 'century', 'Scholars', 'state', 'that', 'the', 'custom', 'of', 'distinguishing', 'between', 'Hindus', 'Buddhists', 'Jains', 'and', 'Sikhs', 'is', 'a', 'modern', 'phenomenon', 'Hindoo', 'is', 'an', 'archaic', 'spelling', 'variant', 'whose', 'use', 'today', 'may', 'be', 'considered', 'derogatory', 'At', 'more', 'than', 'billion', 'Hindus', 'are', 'the', 'world', 'third', 'largest', 'group', 'after', 'Christians', 'and', 'Muslims', 'The', 'vast', 'majority', 'of', 'Hindus', 'approximately', 'million', 'live', 'in', 'India', 'according', 'to', 'India', 'census', 'After', 'India', 'the', 'next', 'countries', 'with', 'the', 'largest', 'Hindu', 'populations', 'are', 'in', 'decreasing', 'order', 'Nepal', 'Bangladesh', 'Indonesia', 'Pakistan', 'Sri', 'Lanka', 'United', 'States', 'Malaysia', 'United', 'Kingdom', 'and', 'Myanmar', 'These', 'together', 'accounted', 'for', 'of', 'the', 'world', 'Hindu', 'population', 'and', 'the', 'remaining', 'nations', 'of', 'the', 'world', 'together', 'had', 'about', 'million', 'Hindus', 'in']
A shorter implementation with Python generator.
def remove_punctuation_marks(tokens):
return [tok for tok in tokens if tok.isalpha()]
print(remove_punctuation_marks(tokens))
['Hindu', 'refers', 'to', 'any', 'person', 'who', 'regards', 'themselves', 'as', 'culturally', 'ethnically', 'or', 'religiously', 'adhering', 'to', 'aspects', 'of', 'Hinduism', 'It', 'has', 'historically', 'been', 'used', 'as', 'a', 'geographical', 'cultural', 'and', 'later', 'religious', 'identifier', 'for', 'people', 'indigenous', 'to', 'the', 'Indian', 'subcontinent', 'The', 'historical', 'meaning', 'of', 'the', 'term', 'Hindu', 'has', 'evolved', 'with', 'time', 'Starting', 'with', 'the', 'Persian', 'and', 'Greek', 'references', 'to', 'the', 'land', 'of', 'the', 'Indus', 'in', 'the', 'millennium', 'BCE', 'through', 'the', 'texts', 'of', 'the', 'medieval', 'era', 'the', 'term', 'Hindu', 'implied', 'a', 'geographic', 'ethnic', 'or', 'cultural', 'identifier', 'for', 'people', 'living', 'in', 'the', 'Indian', 'subcontinent', 'around', 'or', 'beyond', 'the', 'Sindhu', 'Indus', 'river', 'By', 'the', 'century', 'the', 'term', 'began', 'to', 'refer', 'to', 'residents', 'of', 'the', 'subcontinent', 'who', 'were', 'not', 'Turkic', 'or', 'Muslims', 'a', 'b', 'The', 'historical', 'development', 'of', 'Hindu', 'within', 'the', 'local', 'South', 'Asian', 'population', 'in', 'a', 'religious', 'or', 'cultural', 'sense', 'is', 'unclear', 'Competing', 'theories', 'state', 'that', 'Hindu', 'identity', 'developed', 'in', 'the', 'British', 'colonial', 'era', 'or', 'that', 'it', 'developed', 'century', 'CE', 'after', 'the', 'Islamic', 'invasion', 'and', 'medieval', 'wars', 'A', 'sense', 'of', 'Hindu', 'identity', 'and', 'the', 'term', 'Hindu', 'appears', 'in', 'some', 'texts', 'dated', 'between', 'the', 'and', 'century', 'in', 'Sanskrit', 'and', 'regional', 'languages', 'The', 'and', 'Indian', 'poets', 'such', 'as', 'Vidyapati', 'Kabir', 'and', 'Eknath', 'used', 'the', 'phrase', 'Hindu', 'dharma', 'Hinduism', 'and', 'contrasted', 'it', 'with', 'Turaka', 'dharma', 'Islam', 'The', 'Christian', 'friar', 'Sebastiao', 'Manrique', 'used', 'the', 'term', 'in', 'religious', 'context', 'in', 'In', 'the', 'century', 'the', 'European', 'merchants', 'and', 'colonists', 'began', 'to', 'refer', 'to', 'the', 'followers', 'of', 'Indian', 'religions', 'collectively', 'as', 'Hindus', 'in', 'contrast', 'to', 'Mohamedans', 'for', 'Mughals', 'and', 'Arabs', 'following', 'Islam', 'By', 'the', 'century', 'colonial', 'orientalist', 'texts', 'further', 'distinguished', 'Hindus', 'from', 'Buddhists', 'Sikhs', 'and', 'Jains', 'but', 'the', 'colonial', 'laws', 'continued', 'to', 'consider', 'all', 'of', 'them', 'to', 'be', 'within', 'the', 'scope', 'of', 'the', 'term', 'Hindu', 'until', 'about', 'century', 'Scholars', 'state', 'that', 'the', 'custom', 'of', 'distinguishing', 'between', 'Hindus', 'Buddhists', 'Jains', 'and', 'Sikhs', 'is', 'a', 'modern', 'phenomenon', 'Hindoo', 'is', 'an', 'archaic', 'spelling', 'variant', 'whose', 'use', 'today', 'may', 'be', 'considered', 'derogatory', 'At', 'more', 'than', 'billion', 'Hindus', 'are', 'the', 'world', 'third', 'largest', 'group', 'after', 'Christians', 'and', 'Muslims', 'The', 'vast', 'majority', 'of', 'Hindus', 'approximately', 'million', 'live', 'in', 'India', 'according', 'to', 'India', 'census', 'After', 'India', 'the', 'next', 'countries', 'with', 'the', 'largest', 'Hindu', 'populations', 'are', 'in', 'decreasing', 'order', 'Nepal', 'Bangladesh', 'Indonesia', 'Pakistan', 'Sri', 'Lanka', 'United', 'States', 'Malaysia', 'United', 'Kingdom', 'and', 'Myanmar', 'These', 'together', 'accounted', 'for', 'of', 'the', 'world', 'Hindu', 'population', 'and', 'the', 'remaining', 'nations', 'of', 'the', 'world', 'together', 'had', 'about', 'million', 'Hindus', 'in']
New counting results with the removal of punctuations and digits.
tokens = remove_punctuation_marks(tokens)
word_count = Counter(tokens)
for w, c in word_count.most_common(10):
print("%s\t%d" % (w, c))
the 35
and 16
of 14
to 12
in 12
Hindu 11
or 6
term 6
century 6
Hindus 6
Stopword Removal#
Load an English stopword list from NTLK.
import nltk
from nltk.corpus import stopwords
nltk.download('stopwords')
stopword_list = stopwords.words('english')
print(stopword_list)
['a', 'about', 'above', 'after', 'again', 'against', 'ain', 'all', 'am', 'an', 'and', 'any', 'are', 'aren', "aren't", 'as', 'at', 'be', 'because', 'been', 'before', 'being', 'below', 'between', 'both', 'but', 'by', 'can', 'couldn', "couldn't", 'd', 'did', 'didn', "didn't", 'do', 'does', 'doesn', "doesn't", 'doing', 'don', "don't", 'down', 'during', 'each', 'few', 'for', 'from', 'further', 'had', 'hadn', "hadn't", 'has', 'hasn', "hasn't", 'have', 'haven', "haven't", 'having', 'he', "he'd", "he'll", 'her', 'here', 'hers', 'herself', "he's", 'him', 'himself', 'his', 'how', 'i', "i'd", 'if', "i'll", "i'm", 'in', 'into', 'is', 'isn', "isn't", 'it', "it'd", "it'll", "it's", 'its', 'itself', "i've", 'just', 'll', 'm', 'ma', 'me', 'mightn', "mightn't", 'more', 'most', 'mustn', "mustn't", 'my', 'myself', 'needn', "needn't", 'no', 'nor', 'not', 'now', 'o', 'of', 'off', 'on', 'once', 'only', 'or', 'other', 'our', 'ours', 'ourselves', 'out', 'over', 'own', 're', 's', 'same', 'shan', "shan't", 'she', "she'd", "she'll", "she's", 'should', 'shouldn', "shouldn't", "should've", 'so', 'some', 'such', 't', 'than', 'that', "that'll", 'the', 'their', 'theirs', 'them', 'themselves', 'then', 'there', 'these', 'they', "they'd", "they'll", "they're", "they've", 'this', 'those', 'through', 'to', 'too', 'under', 'until', 'up', 've', 'very', 'was', 'wasn', "wasn't", 'we', "we'd", "we'll", "we're", 'were', 'weren', "weren't", "we've", 'what', 'when', 'where', 'which', 'while', 'who', 'whom', 'why', 'will', 'with', 'won', "won't", 'wouldn', "wouldn't", 'y', 'you', "you'd", "you'll", 'your', "you're", 'yours', 'yourself', 'yourselves', "you've"]
[nltk_data] Downloading package stopwords to
[nltk_data] /Users/jirlong/nltk_data...
[nltk_data] Package stopwords is already up-to-date!
Remove stopwords from the tokens.
def remove_stopwords(tokens):
tokens_clean = []
for tok in tokens:
if tok not in stopword_list:
tokens_clean.append(tok)
return tokens_clean
print(remove_stopwords(tokens))
['Hindu', 'refers', 'person', 'regards', 'culturally', 'ethnically', 'religiously', 'adhering', 'aspects', 'Hinduism', 'It', 'historically', 'used', 'geographical', 'cultural', 'later', 'religious', 'identifier', 'people', 'indigenous', 'Indian', 'subcontinent', 'The', 'historical', 'meaning', 'term', 'Hindu', 'evolved', 'time', 'Starting', 'Persian', 'Greek', 'references', 'land', 'Indus', 'millennium', 'BCE', 'texts', 'medieval', 'era', 'term', 'Hindu', 'implied', 'geographic', 'ethnic', 'cultural', 'identifier', 'people', 'living', 'Indian', 'subcontinent', 'around', 'beyond', 'Sindhu', 'Indus', 'river', 'By', 'century', 'term', 'began', 'refer', 'residents', 'subcontinent', 'Turkic', 'Muslims', 'b', 'The', 'historical', 'development', 'Hindu', 'within', 'local', 'South', 'Asian', 'population', 'religious', 'cultural', 'sense', 'unclear', 'Competing', 'theories', 'state', 'Hindu', 'identity', 'developed', 'British', 'colonial', 'era', 'developed', 'century', 'CE', 'Islamic', 'invasion', 'medieval', 'wars', 'A', 'sense', 'Hindu', 'identity', 'term', 'Hindu', 'appears', 'texts', 'dated', 'century', 'Sanskrit', 'regional', 'languages', 'The', 'Indian', 'poets', 'Vidyapati', 'Kabir', 'Eknath', 'used', 'phrase', 'Hindu', 'dharma', 'Hinduism', 'contrasted', 'Turaka', 'dharma', 'Islam', 'The', 'Christian', 'friar', 'Sebastiao', 'Manrique', 'used', 'term', 'religious', 'context', 'In', 'century', 'European', 'merchants', 'colonists', 'began', 'refer', 'followers', 'Indian', 'religions', 'collectively', 'Hindus', 'contrast', 'Mohamedans', 'Mughals', 'Arabs', 'following', 'Islam', 'By', 'century', 'colonial', 'orientalist', 'texts', 'distinguished', 'Hindus', 'Buddhists', 'Sikhs', 'Jains', 'colonial', 'laws', 'continued', 'consider', 'within', 'scope', 'term', 'Hindu', 'century', 'Scholars', 'state', 'custom', 'distinguishing', 'Hindus', 'Buddhists', 'Jains', 'Sikhs', 'modern', 'phenomenon', 'Hindoo', 'archaic', 'spelling', 'variant', 'whose', 'use', 'today', 'may', 'considered', 'derogatory', 'At', 'billion', 'Hindus', 'world', 'third', 'largest', 'group', 'Christians', 'Muslims', 'The', 'vast', 'majority', 'Hindus', 'approximately', 'million', 'live', 'India', 'according', 'India', 'census', 'After', 'India', 'next', 'countries', 'largest', 'Hindu', 'populations', 'decreasing', 'order', 'Nepal', 'Bangladesh', 'Indonesia', 'Pakistan', 'Sri', 'Lanka', 'United', 'States', 'Malaysia', 'United', 'Kingdom', 'Myanmar', 'These', 'together', 'accounted', 'world', 'Hindu', 'population', 'remaining', 'nations', 'world', 'together', 'million', 'Hindus']
Handle Capitalization in English#
Solution 1: Converting all characters to lowercase.#
def lowercase(tokens):
tokens_lower = []
for tok in tokens:
tokens_lower.append(tok.lower())
return tokens_lower
print(remove_stopwords(lowercase(tokens)))
['hindu', 'refers', 'person', 'regards', 'culturally', 'ethnically', 'religiously', 'adhering', 'aspects', 'hinduism', 'historically', 'used', 'geographical', 'cultural', 'later', 'religious', 'identifier', 'people', 'indigenous', 'indian', 'subcontinent', 'historical', 'meaning', 'term', 'hindu', 'evolved', 'time', 'starting', 'persian', 'greek', 'references', 'land', 'indus', 'millennium', 'bce', 'texts', 'medieval', 'era', 'term', 'hindu', 'implied', 'geographic', 'ethnic', 'cultural', 'identifier', 'people', 'living', 'indian', 'subcontinent', 'around', 'beyond', 'sindhu', 'indus', 'river', 'century', 'term', 'began', 'refer', 'residents', 'subcontinent', 'turkic', 'muslims', 'b', 'historical', 'development', 'hindu', 'within', 'local', 'south', 'asian', 'population', 'religious', 'cultural', 'sense', 'unclear', 'competing', 'theories', 'state', 'hindu', 'identity', 'developed', 'british', 'colonial', 'era', 'developed', 'century', 'ce', 'islamic', 'invasion', 'medieval', 'wars', 'sense', 'hindu', 'identity', 'term', 'hindu', 'appears', 'texts', 'dated', 'century', 'sanskrit', 'regional', 'languages', 'indian', 'poets', 'vidyapati', 'kabir', 'eknath', 'used', 'phrase', 'hindu', 'dharma', 'hinduism', 'contrasted', 'turaka', 'dharma', 'islam', 'christian', 'friar', 'sebastiao', 'manrique', 'used', 'term', 'religious', 'context', 'century', 'european', 'merchants', 'colonists', 'began', 'refer', 'followers', 'indian', 'religions', 'collectively', 'hindus', 'contrast', 'mohamedans', 'mughals', 'arabs', 'following', 'islam', 'century', 'colonial', 'orientalist', 'texts', 'distinguished', 'hindus', 'buddhists', 'sikhs', 'jains', 'colonial', 'laws', 'continued', 'consider', 'within', 'scope', 'term', 'hindu', 'century', 'scholars', 'state', 'custom', 'distinguishing', 'hindus', 'buddhists', 'jains', 'sikhs', 'modern', 'phenomenon', 'hindoo', 'archaic', 'spelling', 'variant', 'whose', 'use', 'today', 'may', 'considered', 'derogatory', 'billion', 'hindus', 'world', 'third', 'largest', 'group', 'christians', 'muslims', 'vast', 'majority', 'hindus', 'approximately', 'million', 'live', 'india', 'according', 'india', 'census', 'india', 'next', 'countries', 'largest', 'hindu', 'populations', 'decreasing', 'order', 'nepal', 'bangladesh', 'indonesia', 'pakistan', 'sri', 'lanka', 'united', 'states', 'malaysia', 'united', 'kingdom', 'myanmar', 'together', 'accounted', 'world', 'hindu', 'population', 'remaining', 'nations', 'world', 'together', 'million', 'hindus']
Solution 2: Maintain the capitalization.#
def remove_stopwords(tokens):
tokens_clean = []
for tok in tokens:
if tok.lower() not in stopword_list:
tokens_clean.append(tok)
return tokens_clean
print(remove_stopwords(tokens))
['Hindu', 'refers', 'person', 'regards', 'culturally', 'ethnically', 'religiously', 'adhering', 'aspects', 'Hinduism', 'historically', 'used', 'geographical', 'cultural', 'later', 'religious', 'identifier', 'people', 'indigenous', 'Indian', 'subcontinent', 'historical', 'meaning', 'term', 'Hindu', 'evolved', 'time', 'Starting', 'Persian', 'Greek', 'references', 'land', 'Indus', 'millennium', 'BCE', 'texts', 'medieval', 'era', 'term', 'Hindu', 'implied', 'geographic', 'ethnic', 'cultural', 'identifier', 'people', 'living', 'Indian', 'subcontinent', 'around', 'beyond', 'Sindhu', 'Indus', 'river', 'century', 'term', 'began', 'refer', 'residents', 'subcontinent', 'Turkic', 'Muslims', 'b', 'historical', 'development', 'Hindu', 'within', 'local', 'South', 'Asian', 'population', 'religious', 'cultural', 'sense', 'unclear', 'Competing', 'theories', 'state', 'Hindu', 'identity', 'developed', 'British', 'colonial', 'era', 'developed', 'century', 'CE', 'Islamic', 'invasion', 'medieval', 'wars', 'sense', 'Hindu', 'identity', 'term', 'Hindu', 'appears', 'texts', 'dated', 'century', 'Sanskrit', 'regional', 'languages', 'Indian', 'poets', 'Vidyapati', 'Kabir', 'Eknath', 'used', 'phrase', 'Hindu', 'dharma', 'Hinduism', 'contrasted', 'Turaka', 'dharma', 'Islam', 'Christian', 'friar', 'Sebastiao', 'Manrique', 'used', 'term', 'religious', 'context', 'century', 'European', 'merchants', 'colonists', 'began', 'refer', 'followers', 'Indian', 'religions', 'collectively', 'Hindus', 'contrast', 'Mohamedans', 'Mughals', 'Arabs', 'following', 'Islam', 'century', 'colonial', 'orientalist', 'texts', 'distinguished', 'Hindus', 'Buddhists', 'Sikhs', 'Jains', 'colonial', 'laws', 'continued', 'consider', 'within', 'scope', 'term', 'Hindu', 'century', 'Scholars', 'state', 'custom', 'distinguishing', 'Hindus', 'Buddhists', 'Jains', 'Sikhs', 'modern', 'phenomenon', 'Hindoo', 'archaic', 'spelling', 'variant', 'whose', 'use', 'today', 'may', 'considered', 'derogatory', 'billion', 'Hindus', 'world', 'third', 'largest', 'group', 'Christians', 'Muslims', 'vast', 'majority', 'Hindus', 'approximately', 'million', 'live', 'India', 'according', 'India', 'census', 'India', 'next', 'countries', 'largest', 'Hindu', 'populations', 'decreasing', 'order', 'Nepal', 'Bangladesh', 'Indonesia', 'Pakistan', 'Sri', 'Lanka', 'United', 'States', 'Malaysia', 'United', 'Kingdom', 'Myanmar', 'together', 'accounted', 'world', 'Hindu', 'population', 'remaining', 'nations', 'world', 'together', 'million', 'Hindus']
New counting results with the removal of stopwords.#
word_count = Counter(remove_stopwords(tokens))
for w, c in word_count.most_common(20):
print("%s\t%d" % (w, c))
Hindu 11
term 6
century 6
Hindus 6
Indian 4
used 3
cultural 3
religious 3
subcontinent 3
texts 3
colonial 3
world 3
India 3
Hinduism 2
identifier 2
people 2
historical 2
Indus 2
medieval 2
era 2
Unicase results with the removal of stopwords.#
word_count = Counter(remove_stopwords(lowercase(tokens)))
for w, c in word_count.most_common(20):
print("%s\t%d" % (w, c))
hindu 11
term 6
century 6
hindus 6
indian 4
used 3
cultural 3
religious 3
subcontinent 3
texts 3
colonial 3
world 3
india 3
hinduism 2
identifier 2
people 2
historical 2
indus 2
medieval 2
era 2
Stemming#
Stemming with Snowball algorithm implemented by NLTK.
Reference: http://snowball.tartarus.org/texts/introduction.html
from nltk.stem.snowball import SnowballStemmer
snowball_stemmer = SnowballStemmer("english")
stemmed_tokens = []
for tok in remove_stopwords(tokens):
stemmed_tokens.append(snowball_stemmer.stem(tok))
word_count = Counter(stemmed_tokens)
for w, c in word_count.most_common(50):
print("%s\t%d" % (w, c))
hindu 11
term 6
centuri 6
hindus 6
refer 4
cultur 4
religi 4
use 4
indian 4
histor 3
subcontin 3
text 3
develop 3
popul 3
state 3
coloni 3
islam 3
world 3
india 3
ethnic 2
hinduism 2
geograph 2
identifi 2
peopl 2
indus 2
mediev 2
era 2
live 2
began 2
muslim 2
within 2
sens 2
ident 2
dharma 2
contrast 2
christian 2
follow 2
distinguish 2
buddhist 2
sikh 2
jain 2
consid 2
largest 2
million 2
unit 2
togeth 2
person 1
regard 1
adher 1
aspect 1
Lemmatization#
Perform lemmatization with WordNet, a lexical ontology, via NLTK. This is a lazy version that does not require part-of-speech information given.
# import nltk
# nltk.download('wordnet')
from nltk.stem import WordNetLemmatizer
wordnet_lemmatizer = WordNetLemmatizer()
def lemmatize(token):
# ADJ (a), ADJ_SAT (s), ADV (r), NOUN (n) or VERB (v)
for p in ['v', 'n', 'a', 'r', 's']:
l = wordnet_lemmatizer.lemmatize(token, pos=p)
if l != token:
return l
return token
print(lemmatize('Dogs'))
print(lemmatize('dogs'))
print(lemmatize('hits'))
---------------------------------------------------------------------------
LookupError Traceback (most recent call last)
File ~/anaconda3/lib/python3.10/site-packages/nltk/corpus/util.py:84, in LazyCorpusLoader.__load(self)
83 try:
---> 84 root = nltk.data.find(f"{self.subdir}/{zip_name}")
85 except LookupError:
File ~/anaconda3/lib/python3.10/site-packages/nltk/data.py:583, in find(resource_name, paths)
582 resource_not_found = f"\n{sep}\n{msg}\n{sep}\n"
--> 583 raise LookupError(resource_not_found)
LookupError:
**********************************************************************
Resource wordnet not found.
Please use the NLTK Downloader to obtain the resource:
>>> import nltk
>>> nltk.download('wordnet')
For more information see: https://www.nltk.org/data.html
Attempted to load corpora/wordnet.zip/wordnet/
Searched in:
- '/Users/jirlong/nltk_data'
- '/Users/jirlong/anaconda3/nltk_data'
- '/Users/jirlong/anaconda3/share/nltk_data'
- '/Users/jirlong/anaconda3/lib/nltk_data'
- '/usr/share/nltk_data'
- '/usr/local/share/nltk_data'
- '/usr/lib/nltk_data'
- '/usr/local/lib/nltk_data'
**********************************************************************
During handling of the above exception, another exception occurred:
LookupError Traceback (most recent call last)
Cell In[28], line 14
11 return l
12 return token
---> 14 print(lemmatize('Dogs'))
15 print(lemmatize('dogs'))
16 print(lemmatize('hits'))
Cell In[28], line 9, in lemmatize(token)
6 def lemmatize(token):
7 # ADJ (a), ADJ_SAT (s), ADV (r), NOUN (n) or VERB (v)
8 for p in ['v', 'n', 'a', 'r', 's']:
----> 9 l = wordnet_lemmatizer.lemmatize(token, pos=p)
10 if l != token:
11 return l
File ~/anaconda3/lib/python3.10/site-packages/nltk/stem/wordnet.py:45, in WordNetLemmatizer.lemmatize(self, word, pos)
33 def lemmatize(self, word: str, pos: str = "n") -> str:
34 """Lemmatize `word` using WordNet's built-in morphy function.
35 Returns the input word unchanged if it cannot be found in WordNet.
36
(...)
43 :return: The lemma of `word`, for the given `pos`.
44 """
---> 45 lemmas = wn._morphy(word, pos)
46 return min(lemmas, key=len) if lemmas else word
File ~/anaconda3/lib/python3.10/site-packages/nltk/corpus/util.py:121, in LazyCorpusLoader.__getattr__(self, attr)
118 if attr == "__bases__":
119 raise AttributeError("LazyCorpusLoader object has no attribute '__bases__'")
--> 121 self.__load()
122 # This looks circular, but its not, since __load() changes our
123 # __class__ to something new:
124 return getattr(self, attr)
File ~/anaconda3/lib/python3.10/site-packages/nltk/corpus/util.py:86, in LazyCorpusLoader.__load(self)
84 root = nltk.data.find(f"{self.subdir}/{zip_name}")
85 except LookupError:
---> 86 raise e
88 # Load the corpus.
89 corpus = self.__reader_cls(root, *self.__args, **self.__kwargs)
File ~/anaconda3/lib/python3.10/site-packages/nltk/corpus/util.py:81, in LazyCorpusLoader.__load(self)
79 else:
80 try:
---> 81 root = nltk.data.find(f"{self.subdir}/{self.__name}")
82 except LookupError as e:
83 try:
File ~/anaconda3/lib/python3.10/site-packages/nltk/data.py:583, in find(resource_name, paths)
581 sep = "*" * 70
582 resource_not_found = f"\n{sep}\n{msg}\n{sep}\n"
--> 583 raise LookupError(resource_not_found)
LookupError:
**********************************************************************
Resource wordnet not found.
Please use the NLTK Downloader to obtain the resource:
>>> import nltk
>>> nltk.download('wordnet')
For more information see: https://www.nltk.org/data.html
Attempted to load corpora/wordnet
Searched in:
- '/Users/jirlong/nltk_data'
- '/Users/jirlong/anaconda3/nltk_data'
- '/Users/jirlong/anaconda3/share/nltk_data'
- '/Users/jirlong/anaconda3/lib/nltk_data'
- '/usr/share/nltk_data'
- '/usr/local/share/nltk_data'
- '/usr/lib/nltk_data'
- '/usr/local/lib/nltk_data'
**********************************************************************
Show the differences between stemming and lemmatization.
for w in [
'open', 'opens', 'opened', 'opening', 'unopened',
'talk', 'talks', 'talked', 'talking',
'decompose', 'decomposes', 'decomposed', 'decomposing',
'do', 'does', 'did',
'wrote', 'written', 'ran', 'gave', 'held', 'went', 'gone',
'lied', 'lies', 'lay', 'lain', 'lying',
'cats', 'people', 'feet', 'women', 'smoothly', 'firstly', 'secondly',
'install', 'installed', 'uninstall',
'internalization', 'internationalization',
'decontextualization', 'decontextualized', 'decentralization', 'decentralized']:
s = snowball_stemmer.stem(w)
l = lemmatize(w)
if s != l:
print("%s\t%s\t%s" % (w, s, l))
unopened unopen unopened
decompose decompos decompose
decomposes decompos decompose
decomposed decompos decompose
decomposing decompos decompose
does doe do
did did do
wrote wrote write
written written write
ran ran run
gave gave give
held held hold
went went go
gone gone go
lain lain lie
people peopl people
feet feet foot
women women woman
smoothly smooth smoothly
firstly first firstly
secondly second secondly
install instal install
uninstall uninstal uninstall
internalization intern internalization
internationalization internation internationalization
decontextualization decontextu decontextualization
decontextualized decontextu decontextualized
decentralization decentr decentralization
decentralized decentr decentralize
New counting results with lemmatization.
lemmatized_tokens = []
for tok in remove_stopwords(tokens):
lemmatized_tokens.append(lemmatize(tok))
word_count = Counter(lemmatized_tokens)
for w, c in word_count.most_common(50):
print("%s\t%d" % (w, c))
Hindu 11
term 6
century 6
Hindus 6
use 4
Indian 4
refer 3
cultural 3
religious 3
subcontinent 3
text 3
population 3
colonial 3
world 3
India 3
Hinduism 2
identifier 2
people 2
historical 2
Indus 2
medieval 2
era 2
live 2
begin 2
Muslims 2
within 2
sense 2
state 2
identity 2
develop 2
dharma 2
contrast 2
Islam 2
distinguish 2
Buddhists 2
Sikhs 2
Jains 2
consider 2
large 2
million 2
United 2
together 2
person 1
regard 1
culturally 1
ethnically 1
religiously 1
adhere 1
aspect 1
historically 1
Applications: Genearte data for WordCloud rendering.#
https://www.jasondavies.com/wordcloud/
repeated_tokens = []
for w, c in word_count.most_common():
for i in range(c):
repeated_tokens.append(w)
print(" ".join(repeated_tokens))
Hindu Hindu Hindu Hindu Hindu Hindu Hindu Hindu Hindu Hindu Hindu term term term term term term century century century century century century Hindus Hindus Hindus Hindus Hindus Hindus use use use use Indian Indian Indian Indian refer refer refer cultural cultural cultural religious religious religious subcontinent subcontinent subcontinent text text text population population population colonial colonial colonial world world world India India India Hinduism Hinduism identifier identifier people people historical historical Indus Indus medieval medieval era era live live begin begin Muslims Muslims within within sense sense state state identity identity develop develop dharma dharma contrast contrast Islam Islam distinguish distinguish Buddhists Buddhists Sikhs Sikhs Jains Jains consider consider large large million million United United together together person regard culturally ethnically religiously adhere aspect historically geographical late indigenous mean evolve time Starting Persian Greek reference land millennium BCE imply geographic ethnic around beyond Sindhu river resident Turkic b development local South Asian unclear Competing theory British CE Islamic invasion war appear date Sanskrit regional language poet Vidyapati Kabir Eknath phrase Turaka Christian friar Sebastiao Manrique context European merchant colonist follower religion collectively Mohamedans Mughals Arabs follow orientalist law continue scope Scholars custom modern phenomenon Hindoo archaic spell variant whose today may derogatory billion third group Christians vast majority approximately accord census next country decrease order Nepal Bangladesh Indonesia Pakistan Sri Lanka States Malaysia Kingdom Myanmar account remain nation