# TM01 Tokenization

## Loading Data

* Open the corpus in the file "hindu.txt" from https://en.wikipedia.org/wiki/Hindu
* Other public resource: http://www.gutenberg.org/
* Never "push" protected text to Github or other publicly available platforms.  

### An Example: Wikipedia

In [1]:
# !pip install wikipedia
import wikipedia 
import string
# cv = wikipedia.page("Taipei")
# text = cv.content
# print(cv.url)
# print("The length of Taipei page is ", len(text))
# print(text[:100])

# text = wikipedia.page("Rembrandt").content
# print(len(text))

text  = wikipedia.summary("Rembrandt", sentences = 10)
print(type(text))
print("The length of Rembrandt summary is ", len(text))

<class 'str'>
The length of Rembrandt summary is  1796


### One more example

In [2]:
text  = wikipedia.summary("Hindus", sentences = 10)
print(type(text))
print("The length of Hindus summary is ", len(text))

<class 'str'>
The length of Hindus summary is  1549


In [11]:
!wget -N https://raw.githubusercontent.com/P4CSS/PSS/master/data/hindu.txt
!ls

--2021-05-01 14:05:12--  https://raw.githubusercontent.com/P4CSS/PSS/master/data/hindu.txt
Resolving raw.githubusercontent.com (raw.githubusercontent.com)... 185.199.108.133, 185.199.109.133, 185.199.110.133, ...
Connecting to raw.githubusercontent.com (raw.githubusercontent.com)|185.199.108.133|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 2747 (2.7K) [text/plain]
Saving to: ‘hindu.txt’


Last-modified header missing -- time-stamps turned off.
2021-05-01 14:05:12 (67.5 MB/s) - ‘hindu.txt’ saved [2747/2747]

hindu.txt  hindu.txt.1	hindu.txt.2  sample_data


In [4]:
with open("hindu.txt") as fin:
    text = fin.read()

### Length of the corpus (in characters) 

In [5]:
print("The lenght of the corpus: %d" % len(text))

The lenght of the corpus: 2747


### Content

In [6]:
print(text)

Hindu refers to any person who regards themselves as culturally, ethnically, or religiously adhering to aspects of Hinduism.[1][2] It has historically been used as a geographical, cultural, and later religious identifier for people indigenous to the Indian subcontinent.[3][4]

The historical meaning of the term Hindu has evolved with time. Starting with the Persian and Greek references to the land of the Indus in the 1st millennium BCE through the texts of the medieval era,[5] the term Hindu implied a geographic, ethnic or cultural identifier for people living in the Indian subcontinent around or beyond the Sindhu (Indus) river.[6] By the 16th century, the term began to refer to residents of the subcontinent who were not Turkic or Muslims.[6][a][b]

The historical development of Hindu self-identity within the local South Asian population, in a religious or cultural sense, is unclear.[3][7] Competing theories state that Hindu identity developed in the British colonial era, or that it de

## Tokenization

### Method 1. by built-in `.split()`

In [7]:
sentence_a = "What’s in a name? That which we call a rose by any other name would smell as sweet."
print(sentence_a.split(" "))

sentence_b = "2020/04/07 00:08:00"
print(sentence_b.split("/"))

['What’s', 'in', 'a', 'name?', 'That', 'which', 'we', 'call', 'a', 'rose', 'by', 'any', 'other', 'name', 'would', 'smell', 'as', 'sweet.']
['2020', '04', '07 00:08:00']


In [17]:
print("123".isalpha())
print("abc".isalpha())

False
True


In [18]:
print(len(text.split(" ")))
print(text.split(" "))


410
['Hindu', 'refers', 'to', 'any', 'person', 'who', 'regards', 'themselves', 'as', 'culturally,', 'ethnically,', 'or', 'religiously', 'adhering', 'to', 'aspects', 'of', 'Hinduism.[1][2]', 'It', 'has', 'historically', 'been', 'used', 'as', 'a', 'geographical,', 'cultural,', 'and', 'later', 'religious', 'identifier', 'for', 'people', 'indigenous', 'to', 'the', 'Indian', 'subcontinent.[3][4]\n\nThe', 'historical', 'meaning', 'of', 'the', 'term', 'Hindu', 'has', 'evolved', 'with', 'time.', 'Starting', 'with', 'the', 'Persian', 'and', 'Greek', 'references', 'to', 'the', 'land', 'of', 'the', 'Indus', 'in', 'the', '1st', 'millennium', 'BCE', 'through', 'the', 'texts', 'of', 'the', 'medieval', 'era,[5]', 'the', 'term', 'Hindu', 'implied', 'a', 'geographic,', 'ethnic', 'or', 'cultural', 'identifier', 'for', 'people', 'living', 'in', 'the', 'Indian', 'subcontinent', 'around', 'or', 'beyond', 'the', 'Sindhu', '(Indus)', 'river.[6]', 'By', 'the', '16th', 'century,', 'the', 'term', 'began', 'to',

### Method 2. by nltk's function

In [20]:
import nltk
nltk.download('punkt')
from nltk.tokenize import word_tokenize
print(len(word_tokenize(text)))
print(word_tokenize(text))


[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt.zip.
566
['Hindu', 'refers', 'to', 'any', 'person', 'who', 'regards', 'themselves', 'as', 'culturally', ',', 'ethnically', ',', 'or', 'religiously', 'adhering', 'to', 'aspects', 'of', 'Hinduism', '.', '[', '1', ']', '[', '2', ']', 'It', 'has', 'historically', 'been', 'used', 'as', 'a', 'geographical', ',', 'cultural', ',', 'and', 'later', 'religious', 'identifier', 'for', 'people', 'indigenous', 'to', 'the', 'Indian', 'subcontinent', '.', '[', '3', ']', '[', '4', ']', 'The', 'historical', 'meaning', 'of', 'the', 'term', 'Hindu', 'has', 'evolved', 'with', 'time', '.', 'Starting', 'with', 'the', 'Persian', 'and', 'Greek', 'references', 'to', 'the', 'land', 'of', 'the', 'Indus', 'in', 'the', '1st', 'millennium', 'BCE', 'through', 'the', 'texts', 'of', 'the', 'medieval', 'era', ',', '[', '5', ']', 'the', 'term', 'Hindu', 'implied', 'a', 'geographic', ',', 'ethnic', 'or', 'cultural', 'identi

### Method 3. Design manually as Python function

In [21]:
import math
def myfun(x, y):
    return math.sqrt(x**2 + y**2), x, y
print(myfun(3, 4))

(5.0, 3, 4)


In [22]:
def my_tokenizer(txt):
    tok = ""
    word_list = []

    for ch in txt:
        if ch == " ":
            word_list.append(tok)
            tok = ""
#             print("Word_list: ", word_list)
        else:
            tok += ch
#             print(tok)
    return word_list
            
word_list = my_tokenizer(text)
print(len(word_list))
        

409


In [23]:
tok = " "
if tok:
    print("Yes")
else:
    print("No")

Yes


In [24]:
# A problematic implementation for word tokenization.

def tokenize(text):
    tokens = []
    tok = ""
    for ch in text:
        if ch == " ":
            if tok:
                tokens.append(tok)
                tok = ""
        else:
            tok += ch
    if tok:
        tokens.append(tok)
    return tokens

print(len(tokenize(text)))
print(tokenize(text))


410
['Hindu', 'refers', 'to', 'any', 'person', 'who', 'regards', 'themselves', 'as', 'culturally,', 'ethnically,', 'or', 'religiously', 'adhering', 'to', 'aspects', 'of', 'Hinduism.[1][2]', 'It', 'has', 'historically', 'been', 'used', 'as', 'a', 'geographical,', 'cultural,', 'and', 'later', 'religious', 'identifier', 'for', 'people', 'indigenous', 'to', 'the', 'Indian', 'subcontinent.[3][4]\n\nThe', 'historical', 'meaning', 'of', 'the', 'term', 'Hindu', 'has', 'evolved', 'with', 'time.', 'Starting', 'with', 'the', 'Persian', 'and', 'Greek', 'references', 'to', 'the', 'land', 'of', 'the', 'Indus', 'in', 'the', '1st', 'millennium', 'BCE', 'through', 'the', 'texts', 'of', 'the', 'medieval', 'era,[5]', 'the', 'term', 'Hindu', 'implied', 'a', 'geographic,', 'ethnic', 'or', 'cultural', 'identifier', 'for', 'people', 'living', 'in', 'the', 'Indian', 'subcontinent', 'around', 'or', 'beyond', 'the', 'Sindhu', '(Indus)', 'river.[6]', 'By', 'the', '16th', 'century,', 'the', 'term', 'began', 'to',

### How to compare if two lists are identical?


### Counting

In [25]:
from collections import Counter

tokens = word_tokenize(text)
word_count = Counter(tokens)

for w, c in word_count.most_common(20):
    print("%s\t%d" % (w, c))

the	35
,	33
[	30
]	30
.	18
and	16
of	14
to	12
in	12
Hindu	11
or	6
term	6
century	6
Hindus	6
a	5
The	5
as	4
for	4
Indian	4
3	4


## Stopword and sign removal

### Method 1.Removal of Punctuation Marks

In [26]:
import string
print(string.punctuation)       

!"#$%&'()*+,-./:;<=>?@[\]^_`{|}~


In [27]:
def remove_punctuation_marks(tokens):
    clean_tokens = []
    for tok in tokens:
        if tok not in string.punctuation:
            clean_tokens.append(tok)
    return clean_tokens

print(remove_punctuation_marks(tokens))

['Hindu', 'refers', 'to', 'any', 'person', 'who', 'regards', 'themselves', 'as', 'culturally', 'ethnically', 'or', 'religiously', 'adhering', 'to', 'aspects', 'of', 'Hinduism', '1', '2', 'It', 'has', 'historically', 'been', 'used', 'as', 'a', 'geographical', 'cultural', 'and', 'later', 'religious', 'identifier', 'for', 'people', 'indigenous', 'to', 'the', 'Indian', 'subcontinent', '3', '4', 'The', 'historical', 'meaning', 'of', 'the', 'term', 'Hindu', 'has', 'evolved', 'with', 'time', 'Starting', 'with', 'the', 'Persian', 'and', 'Greek', 'references', 'to', 'the', 'land', 'of', 'the', 'Indus', 'in', 'the', '1st', 'millennium', 'BCE', 'through', 'the', 'texts', 'of', 'the', 'medieval', 'era', '5', 'the', 'term', 'Hindu', 'implied', 'a', 'geographic', 'ethnic', 'or', 'cultural', 'identifier', 'for', 'people', 'living', 'in', 'the', 'Indian', 'subcontinent', 'around', 'or', 'beyond', 'the', 'Sindhu', 'Indus', 'river', '6', 'By', 'the', '16th', 'century', 'the', 'term', 'began', 'to', 'ref

### Method 2. Removing all tokens that contain characters other than letters. 

In [28]:
def remove_punctuation_marks(tokens):
    clean_tokens = []
    for tok in tokens:
        if tok.isalpha():
            clean_tokens.append(tok)
    return clean_tokens

print(remove_punctuation_marks(tokens))

['Hindu', 'refers', 'to', 'any', 'person', 'who', 'regards', 'themselves', 'as', 'culturally', 'ethnically', 'or', 'religiously', 'adhering', 'to', 'aspects', 'of', 'Hinduism', 'It', 'has', 'historically', 'been', 'used', 'as', 'a', 'geographical', 'cultural', 'and', 'later', 'religious', 'identifier', 'for', 'people', 'indigenous', 'to', 'the', 'Indian', 'subcontinent', 'The', 'historical', 'meaning', 'of', 'the', 'term', 'Hindu', 'has', 'evolved', 'with', 'time', 'Starting', 'with', 'the', 'Persian', 'and', 'Greek', 'references', 'to', 'the', 'land', 'of', 'the', 'Indus', 'in', 'the', 'millennium', 'BCE', 'through', 'the', 'texts', 'of', 'the', 'medieval', 'era', 'the', 'term', 'Hindu', 'implied', 'a', 'geographic', 'ethnic', 'or', 'cultural', 'identifier', 'for', 'people', 'living', 'in', 'the', 'Indian', 'subcontinent', 'around', 'or', 'beyond', 'the', 'Sindhu', 'Indus', 'river', 'By', 'the', 'century', 'the', 'term', 'began', 'to', 'refer', 'to', 'residents', 'of', 'the', 'subcont

A shorter implementation with Python generator.

In [29]:
def remove_punctuation_marks(tokens):
    return [tok for tok in tokens if tok.isalpha()]

print(remove_punctuation_marks(tokens))

['Hindu', 'refers', 'to', 'any', 'person', 'who', 'regards', 'themselves', 'as', 'culturally', 'ethnically', 'or', 'religiously', 'adhering', 'to', 'aspects', 'of', 'Hinduism', 'It', 'has', 'historically', 'been', 'used', 'as', 'a', 'geographical', 'cultural', 'and', 'later', 'religious', 'identifier', 'for', 'people', 'indigenous', 'to', 'the', 'Indian', 'subcontinent', 'The', 'historical', 'meaning', 'of', 'the', 'term', 'Hindu', 'has', 'evolved', 'with', 'time', 'Starting', 'with', 'the', 'Persian', 'and', 'Greek', 'references', 'to', 'the', 'land', 'of', 'the', 'Indus', 'in', 'the', 'millennium', 'BCE', 'through', 'the', 'texts', 'of', 'the', 'medieval', 'era', 'the', 'term', 'Hindu', 'implied', 'a', 'geographic', 'ethnic', 'or', 'cultural', 'identifier', 'for', 'people', 'living', 'in', 'the', 'Indian', 'subcontinent', 'around', 'or', 'beyond', 'the', 'Sindhu', 'Indus', 'river', 'By', 'the', 'century', 'the', 'term', 'began', 'to', 'refer', 'to', 'residents', 'of', 'the', 'subcont

New counting results with the removal of punctuations and digits.

In [30]:
tokens = remove_punctuation_marks(tokens)
word_count = Counter(tokens)

for w, c in word_count.most_common(10):
    print("%s\t%d" % (w, c))
    

the	35
and	16
of	14
to	12
in	12
Hindu	11
or	6
term	6
century	6
Hindus	6


### Stopword Removal

Load an English stopword list from NTLK.

In [35]:
import nltk
from nltk.corpus import stopwords
nltk.download('stopwords')
stopword_list = stopwords.words('english')
print(stopword_list)

[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Unzipping corpora/stopwords.zip.
['i', 'me', 'my', 'myself', 'we', 'our', 'ours', 'ourselves', 'you', "you're", "you've", "you'll", "you'd", 'your', 'yours', 'yourself', 'yourselves', 'he', 'him', 'his', 'himself', 'she', "she's", 'her', 'hers', 'herself', 'it', "it's", 'its', 'itself', 'they', 'them', 'their', 'theirs', 'themselves', 'what', 'which', 'who', 'whom', 'this', 'that', "that'll", 'these', 'those', 'am', 'is', 'are', 'was', 'were', 'be', 'been', 'being', 'have', 'has', 'had', 'having', 'do', 'does', 'did', 'doing', 'a', 'an', 'the', 'and', 'but', 'if', 'or', 'because', 'as', 'until', 'while', 'of', 'at', 'by', 'for', 'with', 'about', 'against', 'between', 'into', 'through', 'during', 'before', 'after', 'above', 'below', 'to', 'from', 'up', 'down', 'in', 'out', 'on', 'off', 'over', 'under', 'again', 'further', 'then', 'once', 'here', 'there', 'when', 'where', 'why', 'how', 'all', 'any', 'both', 'ea

Remove stopwords from the tokens.

In [36]:
def remove_stopwords(tokens):
    tokens_clean = []
    for tok in tokens:
        if tok not in stopword_list:
            tokens_clean.append(tok)
    return tokens_clean

print(remove_stopwords(tokens))

['Hindu', 'refers', 'person', 'regards', 'culturally', 'ethnically', 'religiously', 'adhering', 'aspects', 'Hinduism', 'It', 'historically', 'used', 'geographical', 'cultural', 'later', 'religious', 'identifier', 'people', 'indigenous', 'Indian', 'subcontinent', 'The', 'historical', 'meaning', 'term', 'Hindu', 'evolved', 'time', 'Starting', 'Persian', 'Greek', 'references', 'land', 'Indus', 'millennium', 'BCE', 'texts', 'medieval', 'era', 'term', 'Hindu', 'implied', 'geographic', 'ethnic', 'cultural', 'identifier', 'people', 'living', 'Indian', 'subcontinent', 'around', 'beyond', 'Sindhu', 'Indus', 'river', 'By', 'century', 'term', 'began', 'refer', 'residents', 'subcontinent', 'Turkic', 'Muslims', 'b', 'The', 'historical', 'development', 'Hindu', 'within', 'local', 'South', 'Asian', 'population', 'religious', 'cultural', 'sense', 'unclear', 'Competing', 'theories', 'state', 'Hindu', 'identity', 'developed', 'British', 'colonial', 'era', 'developed', 'century', 'CE', 'Islamic', 'invasi

## Handle Capitalization in English

### Solution 1: Converting all characters to lowercase. 

In [37]:
def lowercase(tokens):
    tokens_lower = []
    for tok in tokens:
        tokens_lower.append(tok.lower())
    return tokens_lower

print(remove_stopwords(lowercase(tokens)))


['hindu', 'refers', 'person', 'regards', 'culturally', 'ethnically', 'religiously', 'adhering', 'aspects', 'hinduism', 'historically', 'used', 'geographical', 'cultural', 'later', 'religious', 'identifier', 'people', 'indigenous', 'indian', 'subcontinent', 'historical', 'meaning', 'term', 'hindu', 'evolved', 'time', 'starting', 'persian', 'greek', 'references', 'land', 'indus', 'millennium', 'bce', 'texts', 'medieval', 'era', 'term', 'hindu', 'implied', 'geographic', 'ethnic', 'cultural', 'identifier', 'people', 'living', 'indian', 'subcontinent', 'around', 'beyond', 'sindhu', 'indus', 'river', 'century', 'term', 'began', 'refer', 'residents', 'subcontinent', 'turkic', 'muslims', 'b', 'historical', 'development', 'hindu', 'within', 'local', 'south', 'asian', 'population', 'religious', 'cultural', 'sense', 'unclear', 'competing', 'theories', 'state', 'hindu', 'identity', 'developed', 'british', 'colonial', 'era', 'developed', 'century', 'ce', 'islamic', 'invasion', 'medieval', 'wars', '

### Solution 2: Maintain the capitalization.

In [38]:
def remove_stopwords(tokens):
    tokens_clean = []
    for tok in tokens:
        if tok.lower() not in stopword_list:
            tokens_clean.append(tok)
    return tokens_clean
print(remove_stopwords(tokens))


['Hindu', 'refers', 'person', 'regards', 'culturally', 'ethnically', 'religiously', 'adhering', 'aspects', 'Hinduism', 'historically', 'used', 'geographical', 'cultural', 'later', 'religious', 'identifier', 'people', 'indigenous', 'Indian', 'subcontinent', 'historical', 'meaning', 'term', 'Hindu', 'evolved', 'time', 'Starting', 'Persian', 'Greek', 'references', 'land', 'Indus', 'millennium', 'BCE', 'texts', 'medieval', 'era', 'term', 'Hindu', 'implied', 'geographic', 'ethnic', 'cultural', 'identifier', 'people', 'living', 'Indian', 'subcontinent', 'around', 'beyond', 'Sindhu', 'Indus', 'river', 'century', 'term', 'began', 'refer', 'residents', 'subcontinent', 'Turkic', 'Muslims', 'b', 'historical', 'development', 'Hindu', 'within', 'local', 'South', 'Asian', 'population', 'religious', 'cultural', 'sense', 'unclear', 'Competing', 'theories', 'state', 'Hindu', 'identity', 'developed', 'British', 'colonial', 'era', 'developed', 'century', 'CE', 'Islamic', 'invasion', 'medieval', 'wars', '

### New counting results with the removal of stopwords.

In [39]:
word_count = Counter(remove_stopwords(tokens))

for w, c in word_count.most_common(20):
    print("%s\t%d" % (w, c))

Hindu	11
term	6
century	6
Hindus	6
Indian	4
used	3
cultural	3
religious	3
subcontinent	3
texts	3
colonial	3
world	3
India	3
Hinduism	2
identifier	2
people	2
historical	2
Indus	2
medieval	2
era	2


### Unicase results with the removal of stopwords.

In [40]:
word_count = Counter(remove_stopwords(lowercase(tokens)))

for w, c in word_count.most_common(20):
    print("%s\t%d" % (w, c))

hindu	11
term	6
century	6
hindus	6
indian	4
used	3
cultural	3
religious	3
subcontinent	3
texts	3
colonial	3
world	3
india	3
hinduism	2
identifier	2
people	2
historical	2
indus	2
medieval	2
era	2


## Stemming

Stemming with Snowball algorithm implemented by NLTK.

Reference: http://snowball.tartarus.org/texts/introduction.html

In [41]:
from nltk.stem.snowball import SnowballStemmer
snowball_stemmer = SnowballStemmer("english")

stemmed_tokens = []
for tok in remove_stopwords(tokens):
    stemmed_tokens.append(snowball_stemmer.stem(tok))
word_count = Counter(stemmed_tokens)

for w, c in word_count.most_common(50):
    print("%s\t%d" % (w, c))


hindu	11
term	6
centuri	6
hindus	6
refer	4
cultur	4
religi	4
use	4
indian	4
histor	3
subcontin	3
text	3
develop	3
popul	3
state	3
coloni	3
islam	3
world	3
india	3
ethnic	2
hinduism	2
geograph	2
identifi	2
peopl	2
indus	2
mediev	2
era	2
live	2
began	2
muslim	2
within	2
sens	2
ident	2
dharma	2
contrast	2
christian	2
follow	2
distinguish	2
buddhist	2
sikh	2
jain	2
consid	2
largest	2
million	2
unit	2
togeth	2
person	1
regard	1
adher	1
aspect	1


## Lemmatization

Perform lemmatization with WordNet, a lexical ontology, via NLTK. This is a lazy version that does not require part-of-speech information given. 

In [44]:
# import nltk
# nltk.download('wordnet')
from nltk.stem import WordNetLemmatizer
wordnet_lemmatizer = WordNetLemmatizer()

def lemmatize(token):
    # ADJ (a), ADJ_SAT (s), ADV (r), NOUN (n) or VERB (v)
    for p in ['v', 'n', 'a', 'r', 's']:
        l = wordnet_lemmatizer.lemmatize(token, pos=p)
        if l != token:
            return l
    return token

print(lemmatize('Dogs'))
print(lemmatize('dogs'))
print(lemmatize('hits'))


Dogs
dog
hit


Show the differences between stemming and lemmatization.

In [45]:
for w in [
    'open', 'opens', 'opened', 'opening', 'unopened',
    'talk', 'talks', 'talked', 'talking',
    'decompose', 'decomposes', 'decomposed', 'decomposing',
    'do', 'does', 'did', 
    'wrote', 'written', 'ran', 'gave', 'held', 'went', 'gone',
    'lied', 'lies', 'lay', 'lain', 'lying', 
    'cats', 'people', 'feet', 'women', 'smoothly', 'firstly', 'secondly', 
    'install', 'installed', 'uninstall',
    'internalization', 'internationalization',
    'decontextualization', 'decontextualized', 'decentralization', 'decentralized']:
    s = snowball_stemmer.stem(w)
    l = lemmatize(w)
    if s != l:
        print("%s\t%s\t%s" % (w, s, l))
    


unopened	unopen	unopened
decompose	decompos	decompose
decomposes	decompos	decompose
decomposed	decompos	decompose
decomposing	decompos	decompose
does	doe	do
did	did	do
wrote	wrote	write
written	written	write
ran	ran	run
gave	gave	give
held	held	hold
went	went	go
gone	gone	go
lain	lain	lie
people	peopl	people
feet	feet	foot
women	women	woman
smoothly	smooth	smoothly
firstly	first	firstly
secondly	second	secondly
install	instal	install
uninstall	uninstal	uninstall
internalization	intern	internalization
internationalization	internation	internationalization
decontextualization	decontextu	decontextualization
decontextualized	decontextu	decontextualized
decentralization	decentr	decentralization
decentralized	decentr	decentralize


New counting results with lemmatization. 

In [46]:
lemmatized_tokens = []
for tok in remove_stopwords(tokens):
    lemmatized_tokens.append(lemmatize(tok))
word_count = Counter(lemmatized_tokens)

for w, c in word_count.most_common(50):
    print("%s\t%d" % (w, c))


Hindu	11
term	6
century	6
Hindus	6
use	4
Indian	4
refer	3
cultural	3
religious	3
subcontinent	3
text	3
population	3
colonial	3
world	3
India	3
Hinduism	2
identifier	2
people	2
historical	2
Indus	2
medieval	2
era	2
live	2
begin	2
Muslims	2
within	2
sense	2
state	2
identity	2
develop	2
dharma	2
contrast	2
Islam	2
distinguish	2
Buddhists	2
Sikhs	2
Jains	2
consider	2
large	2
million	2
United	2
together	2
person	1
regard	1
culturally	1
ethnically	1
religiously	1
adhere	1
aspect	1
historically	1


## Applications: Genearte data for WordCloud rendering. 

https://www.jasondavies.com/wordcloud/

In [47]:
repeated_tokens = []
for w, c in word_count.most_common():
    for i in range(c):
        repeated_tokens.append(w)
print(" ".join(repeated_tokens))


Hindu Hindu Hindu Hindu Hindu Hindu Hindu Hindu Hindu Hindu Hindu term term term term term term century century century century century century Hindus Hindus Hindus Hindus Hindus Hindus use use use use Indian Indian Indian Indian refer refer refer cultural cultural cultural religious religious religious subcontinent subcontinent subcontinent text text text population population population colonial colonial colonial world world world India India India Hinduism Hinduism identifier identifier people people historical historical Indus Indus medieval medieval era era live live begin begin Muslims Muslims within within sense sense state state identity identity develop develop dharma dharma contrast contrast Islam Islam distinguish distinguish Buddhists Buddhists Sikhs Sikhs Jains Jains consider consider large large million million United United together together person regard culturally ethnically religiously adhere aspect historically geographical late indigenous mean evolve time Starting P