TM01 Tokenization

TM01 Tokenization#

Loading Data#

Open the corpus in the file “hindu.txt” from https://en.wikipedia.org/wiki/Hindu
Other public resource: http://www.gutenberg.org/
Never “push” protected text to Github or other publicly available platforms.

An Example: Wikipedia#

# !pip install wikipedia
import wikipedia 
import string
# cv = wikipedia.page("Taipei")
# text = cv.content
# print(cv.url)
# print("The length of Taipei page is ", len(text))
# print(text[:100])

# text = wikipedia.page("Rembrandt").content
# print(len(text))

text  = wikipedia.summary("Rembrandt", sentences = 10)
print(type(text))
print("The length of Rembrandt summary is ", len(text))

<class 'str'>
The length of Rembrandt summary is  1796

One more example#

text  = wikipedia.summary("Hindus", sentences = 10)
print(type(text))
print("The length of Hindus summary is ", len(text))

<class 'str'>
The length of Hindus summary is  1547

!wget -N https://raw.githubusercontent.com/P4CSS/PSS/master/data/hindu.txt
!ls

--2025-09-21 22:03:18--  https://raw.githubusercontent.com/P4CSS/PSS/master/data/hindu.txt
Resolving raw.githubusercontent.com (raw.githubusercontent.com)... 185.199.111.133, 185.199.108.133, 185.199.109.133, ...
Connecting to raw.githubusercontent.com (raw.githubusercontent.com)|185.199.111.133|:443... 

connected.

HTTP request sent, awaiting response...

200 OK
Length: 2747 (2.7K) [text/plain]
Saving to: 'hindu.txt'


hindu.txt             0%[                    ]       0  --.-KB/s               

hindu.txt            99%[==================> ]   2.68K  3.79KB/s               
hindu.txt           100%[===================>]   2.68K  3.77KB/s    in 0.7s    

Last-modified header missing -- time-stamps turned off.
2025-09-21 22:03:27 (3.77 KB/s) - 'hindu.txt' saved [2747/2747]

blessing.txt                           TM05_sentiment.ipynb
data                                   TM06_Clustering.ipynb
hindu.txt                              TM07_1_embeddings.ipynb
img                                    TM07_2_doc2vec.ipynb
lib                                    TM07_3_get_wordEmbeddings.ipynb
stopwords_zh-tw.txt                    TM07_embedding_clustering.ipynb
test.html                              TM08_doc_classification_pipeline.ipynb
TM00.ipynb                             TM08_doc_classification.ipynb
TM01_tokenization.ipynb                TM09_BERTopic.ipynb
TM01_tokenization(chi).ipynb           TM09_re.ipynb
TM02_collocation.ipynb                 TM0X Using CKIPtagger.ipynb
TM03_2_pos_sentiment_chi.ipynb         TM10_Using_Open_AI.ipynb
TM03_POS Tagging.ipynb                 userdict.txt
TM04_IR_search.ipynb

with open("hindu.txt") as fin:
    text = fin.read()

Length of the corpus (in characters)#

print("The lenght of the corpus: %d" % len(text))

The lenght of the corpus: 2747

Content#

print(text)

Hindu refers to any person who regards themselves as culturally, ethnically, or religiously adhering to aspects of Hinduism.[1][2] It has historically been used as a geographical, cultural, and later religious identifier for people indigenous to the Indian subcontinent.[3][4]

The historical meaning of the term Hindu has evolved with time. Starting with the Persian and Greek references to the land of the Indus in the 1st millennium BCE through the texts of the medieval era,[5] the term Hindu implied a geographic, ethnic or cultural identifier for people living in the Indian subcontinent around or beyond the Sindhu (Indus) river.[6] By the 16th century, the term began to refer to residents of the subcontinent who were not Turkic or Muslims.[6][a][b]

The historical development of Hindu self-identity within the local South Asian population, in a religious or cultural sense, is unclear.[3][7] Competing theories state that Hindu identity developed in the British colonial era, or that it developed post-8th century CE after the Islamic invasion and medieval Hindu-Muslim wars.[7][8][9] A sense of Hindu identity and the term Hindu appears in some texts dated between the 13th and 18th century in Sanskrit and regional languages.[8][10] The 14th- and 18th-century Indian poets such as Vidyapati, Kabir and Eknath used the phrase Hindu dharma (Hinduism) and contrasted it with Turaka dharma (Islam).[11] The Christian friar Sebastiao Manrique used the term 'Hindu' in religious context in 1649.[12] In the 18th century, the European merchants and colonists began to refer to the followers of Indian religions collectively as Hindus, in contrast to Mohamedans for Mughals and Arabs following Islam.[3][6] By the mid-19th century, colonial orientalist texts further distinguished Hindus from Buddhists, Sikhs and Jains,[3] but the colonial laws continued to consider all of them to be within the scope of the term Hindu until about mid-20th century.[13] Scholars state that the custom of distinguishing between Hindus, Buddhists, Jains and Sikhs is a modern phenomenon.[14][15] Hindoo is an archaic spelling variant, whose use today may be considered derogatory.[16][17]

At more than 1.03 billion,[18] Hindus are the world's third largest group after Christians and Muslims. The vast majority of Hindus, approximately 966 million, live in India, according to India's 2011 census.[19] After India, the next 9 countries with the largest Hindu populations are, in decreasing order: Nepal, Bangladesh, Indonesia, Pakistan, Sri Lanka, United States, Malaysia, United Kingdom and Myanmar.[20] These together accounted for 99% of the world's Hindu population, and the remaining nations of the world together had about 6 million Hindus in 2010.[20]

Tokenization#

Method 1. by built-in `.split()`#

sentence_a = "What’s in a name? That which we call a rose by any other name would smell as sweet."
print(sentence_a.split(" "))

sentence_b = "2020/04/07 00:08:00"
print(sentence_b.split("/"))

['What’s', 'in', 'a', 'name?', 'That', 'which', 'we', 'call', 'a', 'rose', 'by', 'any', 'other', 'name', 'would', 'smell', 'as', 'sweet.']
['2020', '04', '07 00:08:00']

print("123".isalpha())
print("abc".isalpha())

False
True

print(len(text.split(" ")))
print(text.split(" "))

410
['Hindu', 'refers', 'to', 'any', 'person', 'who', 'regards', 'themselves', 'as', 'culturally,', 'ethnically,', 'or', 'religiously', 'adhering', 'to', 'aspects', 'of', 'Hinduism.[1][2]', 'It', 'has', 'historically', 'been', 'used', 'as', 'a', 'geographical,', 'cultural,', 'and', 'later', 'religious', 'identifier', 'for', 'people', 'indigenous', 'to', 'the', 'Indian', 'subcontinent.[3][4]\n\nThe', 'historical', 'meaning', 'of', 'the', 'term', 'Hindu', 'has', 'evolved', 'with', 'time.', 'Starting', 'with', 'the', 'Persian', 'and', 'Greek', 'references', 'to', 'the', 'land', 'of', 'the', 'Indus', 'in', 'the', '1st', 'millennium', 'BCE', 'through', 'the', 'texts', 'of', 'the', 'medieval', 'era,[5]', 'the', 'term', 'Hindu', 'implied', 'a', 'geographic,', 'ethnic', 'or', 'cultural', 'identifier', 'for', 'people', 'living', 'in', 'the', 'Indian', 'subcontinent', 'around', 'or', 'beyond', 'the', 'Sindhu', '(Indus)', 'river.[6]', 'By', 'the', '16th', 'century,', 'the', 'term', 'began', 'to', 'refer', 'to', 'residents', 'of', 'the', 'subcontinent', 'who', 'were', 'not', 'Turkic', 'or', 'Muslims.[6][a][b]\n\nThe', 'historical', 'development', 'of', 'Hindu', 'self-identity', 'within', 'the', 'local', 'South', 'Asian', 'population,', 'in', 'a', 'religious', 'or', 'cultural', 'sense,', 'is', 'unclear.[3][7]', 'Competing', 'theories', 'state', 'that', 'Hindu', 'identity', 'developed', 'in', 'the', 'British', 'colonial', 'era,', 'or', 'that', 'it', 'developed', 'post-8th', 'century', 'CE', 'after', 'the', 'Islamic', 'invasion', 'and', 'medieval', 'Hindu-Muslim', 'wars.[7][8][9]', 'A', 'sense', 'of', 'Hindu', 'identity', 'and', 'the', 'term', 'Hindu', 'appears', 'in', 'some', 'texts', 'dated', 'between', 'the', '13th', 'and', '18th', 'century', 'in', 'Sanskrit', 'and', 'regional', 'languages.[8][10]', 'The', '14th-', 'and', '18th-century', 'Indian', 'poets', 'such', 'as', 'Vidyapati,', 'Kabir', 'and', 'Eknath', 'used', 'the', 'phrase', 'Hindu', 'dharma', '(Hinduism)', 'and', 'contrasted', 'it', 'with', 'Turaka', 'dharma', '(Islam).[11]', 'The', 'Christian', 'friar', 'Sebastiao', 'Manrique', 'used', 'the', 'term', "'Hindu'", 'in', 'religious', 'context', 'in', '1649.[12]', 'In', 'the', '18th', 'century,', 'the', 'European', 'merchants', 'and', 'colonists', 'began', 'to', 'refer', 'to', 'the', 'followers', 'of', 'Indian', 'religions', 'collectively', 'as', 'Hindus,', 'in', 'contrast', 'to', 'Mohamedans', 'for', 'Mughals', 'and', 'Arabs', 'following', 'Islam.[3][6]', 'By', 'the', 'mid-19th', 'century,', 'colonial', 'orientalist', 'texts', 'further', 'distinguished', 'Hindus', 'from', 'Buddhists,', 'Sikhs', 'and', 'Jains,[3]', 'but', 'the', 'colonial', 'laws', 'continued', 'to', 'consider', 'all', 'of', 'them', 'to', 'be', 'within', 'the', 'scope', 'of', 'the', 'term', 'Hindu', 'until', 'about', 'mid-20th', 'century.[13]', 'Scholars', 'state', 'that', 'the', 'custom', 'of', 'distinguishing', 'between', 'Hindus,', 'Buddhists,', 'Jains', 'and', 'Sikhs', 'is', 'a', 'modern', 'phenomenon.[14][15]', 'Hindoo', 'is', 'an', 'archaic', 'spelling', 'variant,', 'whose', 'use', 'today', 'may', 'be', 'considered', 'derogatory.[16][17]\n\nAt', 'more', 'than', '1.03', 'billion,[18]', 'Hindus', 'are', 'the', "world's", 'third', 'largest', 'group', 'after', 'Christians', 'and', 'Muslims.', 'The', 'vast', 'majority', 'of', 'Hindus,', 'approximately', '966', 'million,', 'live', 'in', 'India,', 'according', 'to', "India's", '2011', 'census.[19]', 'After', 'India,', 'the', 'next', '9', 'countries', 'with', 'the', 'largest', 'Hindu', 'populations', 'are,', 'in', 'decreasing', 'order:', 'Nepal,', 'Bangladesh,', 'Indonesia,', 'Pakistan,', 'Sri', 'Lanka,', 'United', 'States,', 'Malaysia,', 'United', 'Kingdom', 'and', 'Myanmar.[20]', 'These', 'together', 'accounted', 'for', '99%', 'of', 'the', "world's", 'Hindu', 'population,', 'and', 'the', 'remaining', 'nations', 'of', 'the', 'world', 'together', 'had', 'about', '6', 'million', 'Hindus', 'in', '2010.[20]']

Method 2. by nltk’s function#

import nltk
nltk.download('punkt')
from nltk.tokenize import word_tokenize
print(len(word_tokenize(text)))
print(word_tokenize(text))

566
['Hindu', 'refers', 'to', 'any', 'person', 'who', 'regards', 'themselves', 'as', 'culturally', ',', 'ethnically', ',', 'or', 'religiously', 'adhering', 'to', 'aspects', 'of', 'Hinduism', '.', '[', '1', ']', '[', '2', ']', 'It', 'has', 'historically', 'been', 'used', 'as', 'a', 'geographical', ',', 'cultural', ',', 'and', 'later', 'religious', 'identifier', 'for', 'people', 'indigenous', 'to', 'the', 'Indian', 'subcontinent', '.', '[', '3', ']', '[', '4', ']', 'The', 'historical', 'meaning', 'of', 'the', 'term', 'Hindu', 'has', 'evolved', 'with', 'time', '.', 'Starting', 'with', 'the', 'Persian', 'and', 'Greek', 'references', 'to', 'the', 'land', 'of', 'the', 'Indus', 'in', 'the', '1st', 'millennium', 'BCE', 'through', 'the', 'texts', 'of', 'the', 'medieval', 'era', ',', '[', '5', ']', 'the', 'term', 'Hindu', 'implied', 'a', 'geographic', ',', 'ethnic', 'or', 'cultural', 'identifier', 'for', 'people', 'living', 'in', 'the', 'Indian', 'subcontinent', 'around', 'or', 'beyond', 'the', 'Sindhu', '(', 'Indus', ')', 'river', '.', '[', '6', ']', 'By', 'the', '16th', 'century', ',', 'the', 'term', 'began', 'to', 'refer', 'to', 'residents', 'of', 'the', 'subcontinent', 'who', 'were', 'not', 'Turkic', 'or', 'Muslims', '.', '[', '6', ']', '[', 'a', ']', '[', 'b', ']', 'The', 'historical', 'development', 'of', 'Hindu', 'self-identity', 'within', 'the', 'local', 'South', 'Asian', 'population', ',', 'in', 'a', 'religious', 'or', 'cultural', 'sense', ',', 'is', 'unclear', '.', '[', '3', ']', '[', '7', ']', 'Competing', 'theories', 'state', 'that', 'Hindu', 'identity', 'developed', 'in', 'the', 'British', 'colonial', 'era', ',', 'or', 'that', 'it', 'developed', 'post-8th', 'century', 'CE', 'after', 'the', 'Islamic', 'invasion', 'and', 'medieval', 'Hindu-Muslim', 'wars', '.', '[', '7', ']', '[', '8', ']', '[', '9', ']', 'A', 'sense', 'of', 'Hindu', 'identity', 'and', 'the', 'term', 'Hindu', 'appears', 'in', 'some', 'texts', 'dated', 'between', 'the', '13th', 'and', '18th', 'century', 'in', 'Sanskrit', 'and', 'regional', 'languages', '.', '[', '8', ']', '[', '10', ']', 'The', '14th-', 'and', '18th-century', 'Indian', 'poets', 'such', 'as', 'Vidyapati', ',', 'Kabir', 'and', 'Eknath', 'used', 'the', 'phrase', 'Hindu', 'dharma', '(', 'Hinduism', ')', 'and', 'contrasted', 'it', 'with', 'Turaka', 'dharma', '(', 'Islam', ')', '.', '[', '11', ']', 'The', 'Christian', 'friar', 'Sebastiao', 'Manrique', 'used', 'the', 'term', "'Hindu", "'", 'in', 'religious', 'context', 'in', '1649', '.', '[', '12', ']', 'In', 'the', '18th', 'century', ',', 'the', 'European', 'merchants', 'and', 'colonists', 'began', 'to', 'refer', 'to', 'the', 'followers', 'of', 'Indian', 'religions', 'collectively', 'as', 'Hindus', ',', 'in', 'contrast', 'to', 'Mohamedans', 'for', 'Mughals', 'and', 'Arabs', 'following', 'Islam', '.', '[', '3', ']', '[', '6', ']', 'By', 'the', 'mid-19th', 'century', ',', 'colonial', 'orientalist', 'texts', 'further', 'distinguished', 'Hindus', 'from', 'Buddhists', ',', 'Sikhs', 'and', 'Jains', ',', '[', '3', ']', 'but', 'the', 'colonial', 'laws', 'continued', 'to', 'consider', 'all', 'of', 'them', 'to', 'be', 'within', 'the', 'scope', 'of', 'the', 'term', 'Hindu', 'until', 'about', 'mid-20th', 'century', '.', '[', '13', ']', 'Scholars', 'state', 'that', 'the', 'custom', 'of', 'distinguishing', 'between', 'Hindus', ',', 'Buddhists', ',', 'Jains', 'and', 'Sikhs', 'is', 'a', 'modern', 'phenomenon', '.', '[', '14', ']', '[', '15', ']', 'Hindoo', 'is', 'an', 'archaic', 'spelling', 'variant', ',', 'whose', 'use', 'today', 'may', 'be', 'considered', 'derogatory', '.', '[', '16', ']', '[', '17', ']', 'At', 'more', 'than', '1.03', 'billion', ',', '[', '18', ']', 'Hindus', 'are', 'the', 'world', "'s", 'third', 'largest', 'group', 'after', 'Christians', 'and', 'Muslims', '.', 'The', 'vast', 'majority', 'of', 'Hindus', ',', 'approximately', '966', 'million', ',', 'live', 'in', 'India', ',', 'according', 'to', 'India', "'s", '2011', 'census', '.', '[', '19', ']', 'After', 'India', ',', 'the', 'next', '9', 'countries', 'with', 'the', 'largest', 'Hindu', 'populations', 'are', ',', 'in', 'decreasing', 'order', ':', 'Nepal', ',', 'Bangladesh', ',', 'Indonesia', ',', 'Pakistan', ',', 'Sri', 'Lanka', ',', 'United', 'States', ',', 'Malaysia', ',', 'United', 'Kingdom', 'and', 'Myanmar', '.', '[', '20', ']', 'These', 'together', 'accounted', 'for', '99', '%', 'of', 'the', 'world', "'s", 'Hindu', 'population', ',', 'and', 'the', 'remaining', 'nations', 'of', 'the', 'world', 'together', 'had', 'about', '6', 'million', 'Hindus', 'in', '2010', '.', '[', '20', ']']

[nltk_data] Downloading package punkt to /Users/jirlong/nltk_data...
[nltk_data]   Package punkt is already up-to-date!

Method 3. Design manually as Python function#

import math
def myfun(x, y):
    return math.sqrt(x**2 + y**2), x, y
print(myfun(3, 4))

(5.0, 3, 4)

def my_tokenizer(txt):
    tok = ""
    word_list = []

    for ch in txt:
        if ch == " ":
            word_list.append(tok)
            tok = ""
#             print("Word_list: ", word_list)
        else:
            tok += ch
#             print(tok)
    return word_list
            
word_list = my_tokenizer(text)
print(len(word_list))
        

tok = " "
if tok:
    print("Yes")
else:
    print("No")

Yes

# A problematic implementation for word tokenization.

def tokenize(text):
    tokens = []
    tok = ""
    for ch in text:
        if ch == " ":
            if tok:
                tokens.append(tok)
                tok = ""
        else:
            tok += ch
    if tok:
        tokens.append(tok)
    return tokens

print(len(tokenize(text)))
print(tokenize(text))

410
['Hindu', 'refers', 'to', 'any', 'person', 'who', 'regards', 'themselves', 'as', 'culturally,', 'ethnically,', 'or', 'religiously', 'adhering', 'to', 'aspects', 'of', 'Hinduism.[1][2]', 'It', 'has', 'historically', 'been', 'used', 'as', 'a', 'geographical,', 'cultural,', 'and', 'later', 'religious', 'identifier', 'for', 'people', 'indigenous', 'to', 'the', 'Indian', 'subcontinent.[3][4]\n\nThe', 'historical', 'meaning', 'of', 'the', 'term', 'Hindu', 'has', 'evolved', 'with', 'time.', 'Starting', 'with', 'the', 'Persian', 'and', 'Greek', 'references', 'to', 'the', 'land', 'of', 'the', 'Indus', 'in', 'the', '1st', 'millennium', 'BCE', 'through', 'the', 'texts', 'of', 'the', 'medieval', 'era,[5]', 'the', 'term', 'Hindu', 'implied', 'a', 'geographic,', 'ethnic', 'or', 'cultural', 'identifier', 'for', 'people', 'living', 'in', 'the', 'Indian', 'subcontinent', 'around', 'or', 'beyond', 'the', 'Sindhu', '(Indus)', 'river.[6]', 'By', 'the', '16th', 'century,', 'the', 'term', 'began', 'to', 'refer', 'to', 'residents', 'of', 'the', 'subcontinent', 'who', 'were', 'not', 'Turkic', 'or', 'Muslims.[6][a][b]\n\nThe', 'historical', 'development', 'of', 'Hindu', 'self-identity', 'within', 'the', 'local', 'South', 'Asian', 'population,', 'in', 'a', 'religious', 'or', 'cultural', 'sense,', 'is', 'unclear.[3][7]', 'Competing', 'theories', 'state', 'that', 'Hindu', 'identity', 'developed', 'in', 'the', 'British', 'colonial', 'era,', 'or', 'that', 'it', 'developed', 'post-8th', 'century', 'CE', 'after', 'the', 'Islamic', 'invasion', 'and', 'medieval', 'Hindu-Muslim', 'wars.[7][8][9]', 'A', 'sense', 'of', 'Hindu', 'identity', 'and', 'the', 'term', 'Hindu', 'appears', 'in', 'some', 'texts', 'dated', 'between', 'the', '13th', 'and', '18th', 'century', 'in', 'Sanskrit', 'and', 'regional', 'languages.[8][10]', 'The', '14th-', 'and', '18th-century', 'Indian', 'poets', 'such', 'as', 'Vidyapati,', 'Kabir', 'and', 'Eknath', 'used', 'the', 'phrase', 'Hindu', 'dharma', '(Hinduism)', 'and', 'contrasted', 'it', 'with', 'Turaka', 'dharma', '(Islam).[11]', 'The', 'Christian', 'friar', 'Sebastiao', 'Manrique', 'used', 'the', 'term', "'Hindu'", 'in', 'religious', 'context', 'in', '1649.[12]', 'In', 'the', '18th', 'century,', 'the', 'European', 'merchants', 'and', 'colonists', 'began', 'to', 'refer', 'to', 'the', 'followers', 'of', 'Indian', 'religions', 'collectively', 'as', 'Hindus,', 'in', 'contrast', 'to', 'Mohamedans', 'for', 'Mughals', 'and', 'Arabs', 'following', 'Islam.[3][6]', 'By', 'the', 'mid-19th', 'century,', 'colonial', 'orientalist', 'texts', 'further', 'distinguished', 'Hindus', 'from', 'Buddhists,', 'Sikhs', 'and', 'Jains,[3]', 'but', 'the', 'colonial', 'laws', 'continued', 'to', 'consider', 'all', 'of', 'them', 'to', 'be', 'within', 'the', 'scope', 'of', 'the', 'term', 'Hindu', 'until', 'about', 'mid-20th', 'century.[13]', 'Scholars', 'state', 'that', 'the', 'custom', 'of', 'distinguishing', 'between', 'Hindus,', 'Buddhists,', 'Jains', 'and', 'Sikhs', 'is', 'a', 'modern', 'phenomenon.[14][15]', 'Hindoo', 'is', 'an', 'archaic', 'spelling', 'variant,', 'whose', 'use', 'today', 'may', 'be', 'considered', 'derogatory.[16][17]\n\nAt', 'more', 'than', '1.03', 'billion,[18]', 'Hindus', 'are', 'the', "world's", 'third', 'largest', 'group', 'after', 'Christians', 'and', 'Muslims.', 'The', 'vast', 'majority', 'of', 'Hindus,', 'approximately', '966', 'million,', 'live', 'in', 'India,', 'according', 'to', "India's", '2011', 'census.[19]', 'After', 'India,', 'the', 'next', '9', 'countries', 'with', 'the', 'largest', 'Hindu', 'populations', 'are,', 'in', 'decreasing', 'order:', 'Nepal,', 'Bangladesh,', 'Indonesia,', 'Pakistan,', 'Sri', 'Lanka,', 'United', 'States,', 'Malaysia,', 'United', 'Kingdom', 'and', 'Myanmar.[20]', 'These', 'together', 'accounted', 'for', '99%', 'of', 'the', "world's", 'Hindu', 'population,', 'and', 'the', 'remaining', 'nations', 'of', 'the', 'world', 'together', 'had', 'about', '6', 'million', 'Hindus', 'in', '2010.[20]']

How to compare if two lists are identical?#

Counting#

from collections import Counter

tokens = word_tokenize(text)
word_count = Counter(tokens)

for w, c in word_count.most_common(20):
    print("%s\t%d" % (w, c))

the	35
,	33
[	30
]	30
.	18
and	16
of	14
to	12
in	12
Hindu	11
or	6
term	6
century	6
Hindus	6
a	5
The	5
as	4
for	4
Indian	4
3	4

Stopword and sign removal#

Method 1.Removal of Punctuation Marks#

import string
print(string.punctuation)       

!"#$%&'()*+,-./:;<=>?@[\]^_`{|}~

def remove_punctuation_marks(tokens):
    clean_tokens = []
    for tok in tokens:
        if tok not in string.punctuation:
            clean_tokens.append(tok)
    return clean_tokens

print(remove_punctuation_marks(tokens))

['Hindu', 'refers', 'to', 'any', 'person', 'who', 'regards', 'themselves', 'as', 'culturally', 'ethnically', 'or', 'religiously', 'adhering', 'to', 'aspects', 'of', 'Hinduism', '1', '2', 'It', 'has', 'historically', 'been', 'used', 'as', 'a', 'geographical', 'cultural', 'and', 'later', 'religious', 'identifier', 'for', 'people', 'indigenous', 'to', 'the', 'Indian', 'subcontinent', '3', '4', 'The', 'historical', 'meaning', 'of', 'the', 'term', 'Hindu', 'has', 'evolved', 'with', 'time', 'Starting', 'with', 'the', 'Persian', 'and', 'Greek', 'references', 'to', 'the', 'land', 'of', 'the', 'Indus', 'in', 'the', '1st', 'millennium', 'BCE', 'through', 'the', 'texts', 'of', 'the', 'medieval', 'era', '5', 'the', 'term', 'Hindu', 'implied', 'a', 'geographic', 'ethnic', 'or', 'cultural', 'identifier', 'for', 'people', 'living', 'in', 'the', 'Indian', 'subcontinent', 'around', 'or', 'beyond', 'the', 'Sindhu', 'Indus', 'river', '6', 'By', 'the', '16th', 'century', 'the', 'term', 'began', 'to', 'refer', 'to', 'residents', 'of', 'the', 'subcontinent', 'who', 'were', 'not', 'Turkic', 'or', 'Muslims', '6', 'a', 'b', 'The', 'historical', 'development', 'of', 'Hindu', 'self-identity', 'within', 'the', 'local', 'South', 'Asian', 'population', 'in', 'a', 'religious', 'or', 'cultural', 'sense', 'is', 'unclear', '3', '7', 'Competing', 'theories', 'state', 'that', 'Hindu', 'identity', 'developed', 'in', 'the', 'British', 'colonial', 'era', 'or', 'that', 'it', 'developed', 'post-8th', 'century', 'CE', 'after', 'the', 'Islamic', 'invasion', 'and', 'medieval', 'Hindu-Muslim', 'wars', '7', '8', '9', 'A', 'sense', 'of', 'Hindu', 'identity', 'and', 'the', 'term', 'Hindu', 'appears', 'in', 'some', 'texts', 'dated', 'between', 'the', '13th', 'and', '18th', 'century', 'in', 'Sanskrit', 'and', 'regional', 'languages', '8', '10', 'The', '14th-', 'and', '18th-century', 'Indian', 'poets', 'such', 'as', 'Vidyapati', 'Kabir', 'and', 'Eknath', 'used', 'the', 'phrase', 'Hindu', 'dharma', 'Hinduism', 'and', 'contrasted', 'it', 'with', 'Turaka', 'dharma', 'Islam', '11', 'The', 'Christian', 'friar', 'Sebastiao', 'Manrique', 'used', 'the', 'term', "'Hindu", 'in', 'religious', 'context', 'in', '1649', '12', 'In', 'the', '18th', 'century', 'the', 'European', 'merchants', 'and', 'colonists', 'began', 'to', 'refer', 'to', 'the', 'followers', 'of', 'Indian', 'religions', 'collectively', 'as', 'Hindus', 'in', 'contrast', 'to', 'Mohamedans', 'for', 'Mughals', 'and', 'Arabs', 'following', 'Islam', '3', '6', 'By', 'the', 'mid-19th', 'century', 'colonial', 'orientalist', 'texts', 'further', 'distinguished', 'Hindus', 'from', 'Buddhists', 'Sikhs', 'and', 'Jains', '3', 'but', 'the', 'colonial', 'laws', 'continued', 'to', 'consider', 'all', 'of', 'them', 'to', 'be', 'within', 'the', 'scope', 'of', 'the', 'term', 'Hindu', 'until', 'about', 'mid-20th', 'century', '13', 'Scholars', 'state', 'that', 'the', 'custom', 'of', 'distinguishing', 'between', 'Hindus', 'Buddhists', 'Jains', 'and', 'Sikhs', 'is', 'a', 'modern', 'phenomenon', '14', '15', 'Hindoo', 'is', 'an', 'archaic', 'spelling', 'variant', 'whose', 'use', 'today', 'may', 'be', 'considered', 'derogatory', '16', '17', 'At', 'more', 'than', '1.03', 'billion', '18', 'Hindus', 'are', 'the', 'world', "'s", 'third', 'largest', 'group', 'after', 'Christians', 'and', 'Muslims', 'The', 'vast', 'majority', 'of', 'Hindus', 'approximately', '966', 'million', 'live', 'in', 'India', 'according', 'to', 'India', "'s", '2011', 'census', '19', 'After', 'India', 'the', 'next', '9', 'countries', 'with', 'the', 'largest', 'Hindu', 'populations', 'are', 'in', 'decreasing', 'order', 'Nepal', 'Bangladesh', 'Indonesia', 'Pakistan', 'Sri', 'Lanka', 'United', 'States', 'Malaysia', 'United', 'Kingdom', 'and', 'Myanmar', '20', 'These', 'together', 'accounted', 'for', '99', 'of', 'the', 'world', "'s", 'Hindu', 'population', 'and', 'the', 'remaining', 'nations', 'of', 'the', 'world', 'together', 'had', 'about', '6', 'million', 'Hindus', 'in', '2010', '20']

Method 2. Removing all tokens that contain characters other than letters.#

def remove_punctuation_marks(tokens):
    clean_tokens = []
    for tok in tokens:
        if tok.isalpha():
            clean_tokens.append(tok)
    return clean_tokens

print(remove_punctuation_marks(tokens))

['Hindu', 'refers', 'to', 'any', 'person', 'who', 'regards', 'themselves', 'as', 'culturally', 'ethnically', 'or', 'religiously', 'adhering', 'to', 'aspects', 'of', 'Hinduism', 'It', 'has', 'historically', 'been', 'used', 'as', 'a', 'geographical', 'cultural', 'and', 'later', 'religious', 'identifier', 'for', 'people', 'indigenous', 'to', 'the', 'Indian', 'subcontinent', 'The', 'historical', 'meaning', 'of', 'the', 'term', 'Hindu', 'has', 'evolved', 'with', 'time', 'Starting', 'with', 'the', 'Persian', 'and', 'Greek', 'references', 'to', 'the', 'land', 'of', 'the', 'Indus', 'in', 'the', 'millennium', 'BCE', 'through', 'the', 'texts', 'of', 'the', 'medieval', 'era', 'the', 'term', 'Hindu', 'implied', 'a', 'geographic', 'ethnic', 'or', 'cultural', 'identifier', 'for', 'people', 'living', 'in', 'the', 'Indian', 'subcontinent', 'around', 'or', 'beyond', 'the', 'Sindhu', 'Indus', 'river', 'By', 'the', 'century', 'the', 'term', 'began', 'to', 'refer', 'to', 'residents', 'of', 'the', 'subcontinent', 'who', 'were', 'not', 'Turkic', 'or', 'Muslims', 'a', 'b', 'The', 'historical', 'development', 'of', 'Hindu', 'within', 'the', 'local', 'South', 'Asian', 'population', 'in', 'a', 'religious', 'or', 'cultural', 'sense', 'is', 'unclear', 'Competing', 'theories', 'state', 'that', 'Hindu', 'identity', 'developed', 'in', 'the', 'British', 'colonial', 'era', 'or', 'that', 'it', 'developed', 'century', 'CE', 'after', 'the', 'Islamic', 'invasion', 'and', 'medieval', 'wars', 'A', 'sense', 'of', 'Hindu', 'identity', 'and', 'the', 'term', 'Hindu', 'appears', 'in', 'some', 'texts', 'dated', 'between', 'the', 'and', 'century', 'in', 'Sanskrit', 'and', 'regional', 'languages', 'The', 'and', 'Indian', 'poets', 'such', 'as', 'Vidyapati', 'Kabir', 'and', 'Eknath', 'used', 'the', 'phrase', 'Hindu', 'dharma', 'Hinduism', 'and', 'contrasted', 'it', 'with', 'Turaka', 'dharma', 'Islam', 'The', 'Christian', 'friar', 'Sebastiao', 'Manrique', 'used', 'the', 'term', 'in', 'religious', 'context', 'in', 'In', 'the', 'century', 'the', 'European', 'merchants', 'and', 'colonists', 'began', 'to', 'refer', 'to', 'the', 'followers', 'of', 'Indian', 'religions', 'collectively', 'as', 'Hindus', 'in', 'contrast', 'to', 'Mohamedans', 'for', 'Mughals', 'and', 'Arabs', 'following', 'Islam', 'By', 'the', 'century', 'colonial', 'orientalist', 'texts', 'further', 'distinguished', 'Hindus', 'from', 'Buddhists', 'Sikhs', 'and', 'Jains', 'but', 'the', 'colonial', 'laws', 'continued', 'to', 'consider', 'all', 'of', 'them', 'to', 'be', 'within', 'the', 'scope', 'of', 'the', 'term', 'Hindu', 'until', 'about', 'century', 'Scholars', 'state', 'that', 'the', 'custom', 'of', 'distinguishing', 'between', 'Hindus', 'Buddhists', 'Jains', 'and', 'Sikhs', 'is', 'a', 'modern', 'phenomenon', 'Hindoo', 'is', 'an', 'archaic', 'spelling', 'variant', 'whose', 'use', 'today', 'may', 'be', 'considered', 'derogatory', 'At', 'more', 'than', 'billion', 'Hindus', 'are', 'the', 'world', 'third', 'largest', 'group', 'after', 'Christians', 'and', 'Muslims', 'The', 'vast', 'majority', 'of', 'Hindus', 'approximately', 'million', 'live', 'in', 'India', 'according', 'to', 'India', 'census', 'After', 'India', 'the', 'next', 'countries', 'with', 'the', 'largest', 'Hindu', 'populations', 'are', 'in', 'decreasing', 'order', 'Nepal', 'Bangladesh', 'Indonesia', 'Pakistan', 'Sri', 'Lanka', 'United', 'States', 'Malaysia', 'United', 'Kingdom', 'and', 'Myanmar', 'These', 'together', 'accounted', 'for', 'of', 'the', 'world', 'Hindu', 'population', 'and', 'the', 'remaining', 'nations', 'of', 'the', 'world', 'together', 'had', 'about', 'million', 'Hindus', 'in']

A shorter implementation with Python generator.

def remove_punctuation_marks(tokens):
    return [tok for tok in tokens if tok.isalpha()]

print(remove_punctuation_marks(tokens))

['Hindu', 'refers', 'to', 'any', 'person', 'who', 'regards', 'themselves', 'as', 'culturally', 'ethnically', 'or', 'religiously', 'adhering', 'to', 'aspects', 'of', 'Hinduism', 'It', 'has', 'historically', 'been', 'used', 'as', 'a', 'geographical', 'cultural', 'and', 'later', 'religious', 'identifier', 'for', 'people', 'indigenous', 'to', 'the', 'Indian', 'subcontinent', 'The', 'historical', 'meaning', 'of', 'the', 'term', 'Hindu', 'has', 'evolved', 'with', 'time', 'Starting', 'with', 'the', 'Persian', 'and', 'Greek', 'references', 'to', 'the', 'land', 'of', 'the', 'Indus', 'in', 'the', 'millennium', 'BCE', 'through', 'the', 'texts', 'of', 'the', 'medieval', 'era', 'the', 'term', 'Hindu', 'implied', 'a', 'geographic', 'ethnic', 'or', 'cultural', 'identifier', 'for', 'people', 'living', 'in', 'the', 'Indian', 'subcontinent', 'around', 'or', 'beyond', 'the', 'Sindhu', 'Indus', 'river', 'By', 'the', 'century', 'the', 'term', 'began', 'to', 'refer', 'to', 'residents', 'of', 'the', 'subcontinent', 'who', 'were', 'not', 'Turkic', 'or', 'Muslims', 'a', 'b', 'The', 'historical', 'development', 'of', 'Hindu', 'within', 'the', 'local', 'South', 'Asian', 'population', 'in', 'a', 'religious', 'or', 'cultural', 'sense', 'is', 'unclear', 'Competing', 'theories', 'state', 'that', 'Hindu', 'identity', 'developed', 'in', 'the', 'British', 'colonial', 'era', 'or', 'that', 'it', 'developed', 'century', 'CE', 'after', 'the', 'Islamic', 'invasion', 'and', 'medieval', 'wars', 'A', 'sense', 'of', 'Hindu', 'identity', 'and', 'the', 'term', 'Hindu', 'appears', 'in', 'some', 'texts', 'dated', 'between', 'the', 'and', 'century', 'in', 'Sanskrit', 'and', 'regional', 'languages', 'The', 'and', 'Indian', 'poets', 'such', 'as', 'Vidyapati', 'Kabir', 'and', 'Eknath', 'used', 'the', 'phrase', 'Hindu', 'dharma', 'Hinduism', 'and', 'contrasted', 'it', 'with', 'Turaka', 'dharma', 'Islam', 'The', 'Christian', 'friar', 'Sebastiao', 'Manrique', 'used', 'the', 'term', 'in', 'religious', 'context', 'in', 'In', 'the', 'century', 'the', 'European', 'merchants', 'and', 'colonists', 'began', 'to', 'refer', 'to', 'the', 'followers', 'of', 'Indian', 'religions', 'collectively', 'as', 'Hindus', 'in', 'contrast', 'to', 'Mohamedans', 'for', 'Mughals', 'and', 'Arabs', 'following', 'Islam', 'By', 'the', 'century', 'colonial', 'orientalist', 'texts', 'further', 'distinguished', 'Hindus', 'from', 'Buddhists', 'Sikhs', 'and', 'Jains', 'but', 'the', 'colonial', 'laws', 'continued', 'to', 'consider', 'all', 'of', 'them', 'to', 'be', 'within', 'the', 'scope', 'of', 'the', 'term', 'Hindu', 'until', 'about', 'century', 'Scholars', 'state', 'that', 'the', 'custom', 'of', 'distinguishing', 'between', 'Hindus', 'Buddhists', 'Jains', 'and', 'Sikhs', 'is', 'a', 'modern', 'phenomenon', 'Hindoo', 'is', 'an', 'archaic', 'spelling', 'variant', 'whose', 'use', 'today', 'may', 'be', 'considered', 'derogatory', 'At', 'more', 'than', 'billion', 'Hindus', 'are', 'the', 'world', 'third', 'largest', 'group', 'after', 'Christians', 'and', 'Muslims', 'The', 'vast', 'majority', 'of', 'Hindus', 'approximately', 'million', 'live', 'in', 'India', 'according', 'to', 'India', 'census', 'After', 'India', 'the', 'next', 'countries', 'with', 'the', 'largest', 'Hindu', 'populations', 'are', 'in', 'decreasing', 'order', 'Nepal', 'Bangladesh', 'Indonesia', 'Pakistan', 'Sri', 'Lanka', 'United', 'States', 'Malaysia', 'United', 'Kingdom', 'and', 'Myanmar', 'These', 'together', 'accounted', 'for', 'of', 'the', 'world', 'Hindu', 'population', 'and', 'the', 'remaining', 'nations', 'of', 'the', 'world', 'together', 'had', 'about', 'million', 'Hindus', 'in']

New counting results with the removal of punctuations and digits.

tokens = remove_punctuation_marks(tokens)
word_count = Counter(tokens)

for w, c in word_count.most_common(10):
    print("%s\t%d" % (w, c))
    

the	35
and	16
of	14
to	12
in	12
Hindu	11
or	6
term	6
century	6
Hindus	6

Stopword Removal#

Load an English stopword list from NTLK.

import nltk
from nltk.corpus import stopwords
nltk.download('stopwords')
stopword_list = stopwords.words('english')
print(stopword_list)

['a', 'about', 'above', 'after', 'again', 'against', 'ain', 'all', 'am', 'an', 'and', 'any', 'are', 'aren', "aren't", 'as', 'at', 'be', 'because', 'been', 'before', 'being', 'below', 'between', 'both', 'but', 'by', 'can', 'couldn', "couldn't", 'd', 'did', 'didn', "didn't", 'do', 'does', 'doesn', "doesn't", 'doing', 'don', "don't", 'down', 'during', 'each', 'few', 'for', 'from', 'further', 'had', 'hadn', "hadn't", 'has', 'hasn', "hasn't", 'have', 'haven', "haven't", 'having', 'he', "he'd", "he'll", 'her', 'here', 'hers', 'herself', "he's", 'him', 'himself', 'his', 'how', 'i', "i'd", 'if', "i'll", "i'm", 'in', 'into', 'is', 'isn', "isn't", 'it', "it'd", "it'll", "it's", 'its', 'itself', "i've", 'just', 'll', 'm', 'ma', 'me', 'mightn', "mightn't", 'more', 'most', 'mustn', "mustn't", 'my', 'myself', 'needn', "needn't", 'no', 'nor', 'not', 'now', 'o', 'of', 'off', 'on', 'once', 'only', 'or', 'other', 'our', 'ours', 'ourselves', 'out', 'over', 'own', 're', 's', 'same', 'shan', "shan't", 'she', "she'd", "she'll", "she's", 'should', 'shouldn', "shouldn't", "should've", 'so', 'some', 'such', 't', 'than', 'that', "that'll", 'the', 'their', 'theirs', 'them', 'themselves', 'then', 'there', 'these', 'they', "they'd", "they'll", "they're", "they've", 'this', 'those', 'through', 'to', 'too', 'under', 'until', 'up', 've', 'very', 'was', 'wasn', "wasn't", 'we', "we'd", "we'll", "we're", 'were', 'weren', "weren't", "we've", 'what', 'when', 'where', 'which', 'while', 'who', 'whom', 'why', 'will', 'with', 'won', "won't", 'wouldn', "wouldn't", 'y', 'you', "you'd", "you'll", 'your', "you're", 'yours', 'yourself', 'yourselves', "you've"]

[nltk_data] Downloading package stopwords to
[nltk_data]     /Users/jirlong/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!

Remove stopwords from the tokens.

def remove_stopwords(tokens):
    tokens_clean = []
    for tok in tokens:
        if tok not in stopword_list:
            tokens_clean.append(tok)
    return tokens_clean

print(remove_stopwords(tokens))

['Hindu', 'refers', 'person', 'regards', 'culturally', 'ethnically', 'religiously', 'adhering', 'aspects', 'Hinduism', 'It', 'historically', 'used', 'geographical', 'cultural', 'later', 'religious', 'identifier', 'people', 'indigenous', 'Indian', 'subcontinent', 'The', 'historical', 'meaning', 'term', 'Hindu', 'evolved', 'time', 'Starting', 'Persian', 'Greek', 'references', 'land', 'Indus', 'millennium', 'BCE', 'texts', 'medieval', 'era', 'term', 'Hindu', 'implied', 'geographic', 'ethnic', 'cultural', 'identifier', 'people', 'living', 'Indian', 'subcontinent', 'around', 'beyond', 'Sindhu', 'Indus', 'river', 'By', 'century', 'term', 'began', 'refer', 'residents', 'subcontinent', 'Turkic', 'Muslims', 'b', 'The', 'historical', 'development', 'Hindu', 'within', 'local', 'South', 'Asian', 'population', 'religious', 'cultural', 'sense', 'unclear', 'Competing', 'theories', 'state', 'Hindu', 'identity', 'developed', 'British', 'colonial', 'era', 'developed', 'century', 'CE', 'Islamic', 'invasion', 'medieval', 'wars', 'A', 'sense', 'Hindu', 'identity', 'term', 'Hindu', 'appears', 'texts', 'dated', 'century', 'Sanskrit', 'regional', 'languages', 'The', 'Indian', 'poets', 'Vidyapati', 'Kabir', 'Eknath', 'used', 'phrase', 'Hindu', 'dharma', 'Hinduism', 'contrasted', 'Turaka', 'dharma', 'Islam', 'The', 'Christian', 'friar', 'Sebastiao', 'Manrique', 'used', 'term', 'religious', 'context', 'In', 'century', 'European', 'merchants', 'colonists', 'began', 'refer', 'followers', 'Indian', 'religions', 'collectively', 'Hindus', 'contrast', 'Mohamedans', 'Mughals', 'Arabs', 'following', 'Islam', 'By', 'century', 'colonial', 'orientalist', 'texts', 'distinguished', 'Hindus', 'Buddhists', 'Sikhs', 'Jains', 'colonial', 'laws', 'continued', 'consider', 'within', 'scope', 'term', 'Hindu', 'century', 'Scholars', 'state', 'custom', 'distinguishing', 'Hindus', 'Buddhists', 'Jains', 'Sikhs', 'modern', 'phenomenon', 'Hindoo', 'archaic', 'spelling', 'variant', 'whose', 'use', 'today', 'may', 'considered', 'derogatory', 'At', 'billion', 'Hindus', 'world', 'third', 'largest', 'group', 'Christians', 'Muslims', 'The', 'vast', 'majority', 'Hindus', 'approximately', 'million', 'live', 'India', 'according', 'India', 'census', 'After', 'India', 'next', 'countries', 'largest', 'Hindu', 'populations', 'decreasing', 'order', 'Nepal', 'Bangladesh', 'Indonesia', 'Pakistan', 'Sri', 'Lanka', 'United', 'States', 'Malaysia', 'United', 'Kingdom', 'Myanmar', 'These', 'together', 'accounted', 'world', 'Hindu', 'population', 'remaining', 'nations', 'world', 'together', 'million', 'Hindus']

Handle Capitalization in English#

Solution 1: Converting all characters to lowercase.#

def lowercase(tokens):
    tokens_lower = []
    for tok in tokens:
        tokens_lower.append(tok.lower())
    return tokens_lower

print(remove_stopwords(lowercase(tokens)))

['hindu', 'refers', 'person', 'regards', 'culturally', 'ethnically', 'religiously', 'adhering', 'aspects', 'hinduism', 'historically', 'used', 'geographical', 'cultural', 'later', 'religious', 'identifier', 'people', 'indigenous', 'indian', 'subcontinent', 'historical', 'meaning', 'term', 'hindu', 'evolved', 'time', 'starting', 'persian', 'greek', 'references', 'land', 'indus', 'millennium', 'bce', 'texts', 'medieval', 'era', 'term', 'hindu', 'implied', 'geographic', 'ethnic', 'cultural', 'identifier', 'people', 'living', 'indian', 'subcontinent', 'around', 'beyond', 'sindhu', 'indus', 'river', 'century', 'term', 'began', 'refer', 'residents', 'subcontinent', 'turkic', 'muslims', 'b', 'historical', 'development', 'hindu', 'within', 'local', 'south', 'asian', 'population', 'religious', 'cultural', 'sense', 'unclear', 'competing', 'theories', 'state', 'hindu', 'identity', 'developed', 'british', 'colonial', 'era', 'developed', 'century', 'ce', 'islamic', 'invasion', 'medieval', 'wars', 'sense', 'hindu', 'identity', 'term', 'hindu', 'appears', 'texts', 'dated', 'century', 'sanskrit', 'regional', 'languages', 'indian', 'poets', 'vidyapati', 'kabir', 'eknath', 'used', 'phrase', 'hindu', 'dharma', 'hinduism', 'contrasted', 'turaka', 'dharma', 'islam', 'christian', 'friar', 'sebastiao', 'manrique', 'used', 'term', 'religious', 'context', 'century', 'european', 'merchants', 'colonists', 'began', 'refer', 'followers', 'indian', 'religions', 'collectively', 'hindus', 'contrast', 'mohamedans', 'mughals', 'arabs', 'following', 'islam', 'century', 'colonial', 'orientalist', 'texts', 'distinguished', 'hindus', 'buddhists', 'sikhs', 'jains', 'colonial', 'laws', 'continued', 'consider', 'within', 'scope', 'term', 'hindu', 'century', 'scholars', 'state', 'custom', 'distinguishing', 'hindus', 'buddhists', 'jains', 'sikhs', 'modern', 'phenomenon', 'hindoo', 'archaic', 'spelling', 'variant', 'whose', 'use', 'today', 'may', 'considered', 'derogatory', 'billion', 'hindus', 'world', 'third', 'largest', 'group', 'christians', 'muslims', 'vast', 'majority', 'hindus', 'approximately', 'million', 'live', 'india', 'according', 'india', 'census', 'india', 'next', 'countries', 'largest', 'hindu', 'populations', 'decreasing', 'order', 'nepal', 'bangladesh', 'indonesia', 'pakistan', 'sri', 'lanka', 'united', 'states', 'malaysia', 'united', 'kingdom', 'myanmar', 'together', 'accounted', 'world', 'hindu', 'population', 'remaining', 'nations', 'world', 'together', 'million', 'hindus']

Solution 2: Maintain the capitalization.#

def remove_stopwords(tokens):
    tokens_clean = []
    for tok in tokens:
        if tok.lower() not in stopword_list:
            tokens_clean.append(tok)
    return tokens_clean
print(remove_stopwords(tokens))

['Hindu', 'refers', 'person', 'regards', 'culturally', 'ethnically', 'religiously', 'adhering', 'aspects', 'Hinduism', 'historically', 'used', 'geographical', 'cultural', 'later', 'religious', 'identifier', 'people', 'indigenous', 'Indian', 'subcontinent', 'historical', 'meaning', 'term', 'Hindu', 'evolved', 'time', 'Starting', 'Persian', 'Greek', 'references', 'land', 'Indus', 'millennium', 'BCE', 'texts', 'medieval', 'era', 'term', 'Hindu', 'implied', 'geographic', 'ethnic', 'cultural', 'identifier', 'people', 'living', 'Indian', 'subcontinent', 'around', 'beyond', 'Sindhu', 'Indus', 'river', 'century', 'term', 'began', 'refer', 'residents', 'subcontinent', 'Turkic', 'Muslims', 'b', 'historical', 'development', 'Hindu', 'within', 'local', 'South', 'Asian', 'population', 'religious', 'cultural', 'sense', 'unclear', 'Competing', 'theories', 'state', 'Hindu', 'identity', 'developed', 'British', 'colonial', 'era', 'developed', 'century', 'CE', 'Islamic', 'invasion', 'medieval', 'wars', 'sense', 'Hindu', 'identity', 'term', 'Hindu', 'appears', 'texts', 'dated', 'century', 'Sanskrit', 'regional', 'languages', 'Indian', 'poets', 'Vidyapati', 'Kabir', 'Eknath', 'used', 'phrase', 'Hindu', 'dharma', 'Hinduism', 'contrasted', 'Turaka', 'dharma', 'Islam', 'Christian', 'friar', 'Sebastiao', 'Manrique', 'used', 'term', 'religious', 'context', 'century', 'European', 'merchants', 'colonists', 'began', 'refer', 'followers', 'Indian', 'religions', 'collectively', 'Hindus', 'contrast', 'Mohamedans', 'Mughals', 'Arabs', 'following', 'Islam', 'century', 'colonial', 'orientalist', 'texts', 'distinguished', 'Hindus', 'Buddhists', 'Sikhs', 'Jains', 'colonial', 'laws', 'continued', 'consider', 'within', 'scope', 'term', 'Hindu', 'century', 'Scholars', 'state', 'custom', 'distinguishing', 'Hindus', 'Buddhists', 'Jains', 'Sikhs', 'modern', 'phenomenon', 'Hindoo', 'archaic', 'spelling', 'variant', 'whose', 'use', 'today', 'may', 'considered', 'derogatory', 'billion', 'Hindus', 'world', 'third', 'largest', 'group', 'Christians', 'Muslims', 'vast', 'majority', 'Hindus', 'approximately', 'million', 'live', 'India', 'according', 'India', 'census', 'India', 'next', 'countries', 'largest', 'Hindu', 'populations', 'decreasing', 'order', 'Nepal', 'Bangladesh', 'Indonesia', 'Pakistan', 'Sri', 'Lanka', 'United', 'States', 'Malaysia', 'United', 'Kingdom', 'Myanmar', 'together', 'accounted', 'world', 'Hindu', 'population', 'remaining', 'nations', 'world', 'together', 'million', 'Hindus']

New counting results with the removal of stopwords.#

word_count = Counter(remove_stopwords(tokens))

for w, c in word_count.most_common(20):
    print("%s\t%d" % (w, c))

Hindu	11
term	6
century	6
Hindus	6
Indian	4
used	3
cultural	3
religious	3
subcontinent	3
texts	3
colonial	3
world	3
India	3
Hinduism	2
identifier	2
people	2
historical	2
Indus	2
medieval	2
era	2

Unicase results with the removal of stopwords.#

word_count = Counter(remove_stopwords(lowercase(tokens)))

for w, c in word_count.most_common(20):
    print("%s\t%d" % (w, c))

hindu	11
term	6
century	6
hindus	6
indian	4
used	3
cultural	3
religious	3
subcontinent	3
texts	3
colonial	3
world	3
india	3
hinduism	2
identifier	2
people	2
historical	2
indus	2
medieval	2
era	2

Stemming#

Stemming with Snowball algorithm implemented by NLTK.

Reference: http://snowball.tartarus.org/texts/introduction.html

from nltk.stem.snowball import SnowballStemmer
snowball_stemmer = SnowballStemmer("english")

stemmed_tokens = []
for tok in remove_stopwords(tokens):
    stemmed_tokens.append(snowball_stemmer.stem(tok))
word_count = Counter(stemmed_tokens)

for w, c in word_count.most_common(50):
    print("%s\t%d" % (w, c))

hindu	11
term	6
centuri	6
hindus	6
refer	4
cultur	4
religi	4
use	4
indian	4
histor	3
subcontin	3
text	3
develop	3
popul	3
state	3
coloni	3
islam	3
world	3
india	3
ethnic	2
hinduism	2
geograph	2
identifi	2
peopl	2
indus	2
mediev	2
era	2
live	2
began	2
muslim	2
within	2
sens	2
ident	2
dharma	2
contrast	2
christian	2
follow	2
distinguish	2
buddhist	2
sikh	2
jain	2
consid	2
largest	2
million	2
unit	2
togeth	2
person	1
regard	1
adher	1
aspect	1

Lemmatization#

Perform lemmatization with WordNet, a lexical ontology, via NLTK. This is a lazy version that does not require part-of-speech information given.

# import nltk
# nltk.download('wordnet')
from nltk.stem import WordNetLemmatizer
wordnet_lemmatizer = WordNetLemmatizer()

def lemmatize(token):
    # ADJ (a), ADJ_SAT (s), ADV (r), NOUN (n) or VERB (v)
    for p in ['v', 'n', 'a', 'r', 's']:
        l = wordnet_lemmatizer.lemmatize(token, pos=p)
        if l != token:
            return l
    return token

print(lemmatize('Dogs'))
print(lemmatize('dogs'))
print(lemmatize('hits'))

Dogs
dog
hit

Show the differences between stemming and lemmatization.

for w in [
    'open', 'opens', 'opened', 'opening', 'unopened',
    'talk', 'talks', 'talked', 'talking',
    'decompose', 'decomposes', 'decomposed', 'decomposing',
    'do', 'does', 'did', 
    'wrote', 'written', 'ran', 'gave', 'held', 'went', 'gone',
    'lied', 'lies', 'lay', 'lain', 'lying', 
    'cats', 'people', 'feet', 'women', 'smoothly', 'firstly', 'secondly', 
    'install', 'installed', 'uninstall',
    'internalization', 'internationalization',
    'decontextualization', 'decontextualized', 'decentralization', 'decentralized']:
    s = snowball_stemmer.stem(w)
    l = lemmatize(w)
    if s != l:
        print("%s\t%s\t%s" % (w, s, l))
    

unopened	unopen	unopened
decompose	decompos	decompose
decomposes	decompos	decompose
decomposed	decompos	decompose
decomposing	decompos	decompose
does	doe	do
did	did	do
wrote	wrote	write
written	written	write
ran	ran	run
gave	gave	give
held	held	hold
went	went	go
gone	gone	go
lain	lain	lie
people	peopl	people
feet	feet	foot
women	women	woman
smoothly	smooth	smoothly
firstly	first	firstly
secondly	second	secondly
install	instal	install
uninstall	uninstal	uninstall
internalization	intern	internalization
internationalization	internation	internationalization
decontextualization	decontextu	decontextualization
decontextualized	decontextu	decontextualized
decentralization	decentr	decentralization
decentralized	decentr	decentralize

New counting results with lemmatization.

lemmatized_tokens = []
for tok in remove_stopwords(tokens):
    lemmatized_tokens.append(lemmatize(tok))
word_count = Counter(lemmatized_tokens)

for w, c in word_count.most_common(50):
    print("%s\t%d" % (w, c))

Hindu	11
term	6
century	6
Hindus	6
use	4
Indian	4
refer	3
cultural	3
religious	3
subcontinent	3
text	3
population	3
colonial	3
world	3
India	3
Hinduism	2
identifier	2
people	2
historical	2
Indus	2
medieval	2
era	2
live	2
begin	2
Muslims	2
within	2
sense	2
state	2
identity	2
develop	2
dharma	2
contrast	2
Islam	2
distinguish	2
Buddhists	2
Sikhs	2
Jains	2
consider	2
large	2
million	2
United	2
together	2
person	1
regard	1
culturally	1
ethnically	1
religiously	1
adhere	1
aspect	1
historically	1

Applications: Genearte data for WordCloud rendering.#

https://www.jasondavies.com/wordcloud/

repeated_tokens = []
for w, c in word_count.most_common():
    for i in range(c):
        repeated_tokens.append(w)
print(" ".join(repeated_tokens))

Hindu Hindu Hindu Hindu Hindu Hindu Hindu Hindu Hindu Hindu Hindu term term term term term term century century century century century century Hindus Hindus Hindus Hindus Hindus Hindus use use use use Indian Indian Indian Indian refer refer refer cultural cultural cultural religious religious religious subcontinent subcontinent subcontinent text text text population population population colonial colonial colonial world world world India India India Hinduism Hinduism identifier identifier people people historical historical Indus Indus medieval medieval era era live live begin begin Muslims Muslims within within sense sense state state identity identity develop develop dharma dharma contrast contrast Islam Islam distinguish distinguish Buddhists Buddhists Sikhs Sikhs Jains Jains consider consider large large million million United United together together person regard culturally ethnically religiously adhere aspect historically geographical late indigenous mean evolve time Starting Persian Greek reference land millennium BCE imply geographic ethnic around beyond Sindhu river resident Turkic b development local South Asian unclear Competing theory British CE Islamic invasion war appear date Sanskrit regional language poet Vidyapati Kabir Eknath phrase Turaka Christian friar Sebastiao Manrique context European merchant colonist follower religion collectively Mohamedans Mughals Arabs follow orientalist law continue scope Scholars custom modern phenomenon Hindoo archaic spell variant whose today may derogatory billion third group Christians vast majority approximately accord census next country decrease order Nepal Bangladesh Indonesia Pakistan Sri Lanka States Malaysia Kingdom Myanmar account remain nation