Skip to main content
Ctrl+K
Logo image
  • Programming for Social Scientists

Basic Logic

  • P01-Counting
    • P01-Practice
  • P02-Python Basic
    • Practices 02 for Python basic
    • Exceptions(Errors) Handling
  • P03-List and Dictionary
    • P03 Practice: Accessing dict and list
    • P03 Practice: Accessing dict and list
  • P04 Flow Control
    • P04 Practice01: for loop
    • P04 Practice02 Search & Maximum
    • P04 Practice03 Twitter Users

Data Exploration

  • P05 READ Files
    • P05-1 Read CSV
    • P05-2 Load and dump ubike json data
    • Read more formats
  • P06 Pandas
    • Pandas: Counting and Summarizing
    • Pandas: filter, select, and timeline process
    • P06-3 Processing timestamp
  • P07 Visualization
    • Making Charts with Pandas
    • Bokeh & Seaborn(Vaccinating)

Data Acquisition

  • P08 Web Crawler
    • Web API: cnyes & 104.com
    • Scraping cases
  • P08 HTML Parsing - Crawling PTT
  • P09 Database

Text Mining

  • TM01 Tokenization
    • Chinese Processing
  • TM02 Collocation
  • TM03 POS Tagging
  • TM04 Feature Selection & IR & Search
  • TM05 Sentiment analysis

Machine Learning

  • TM06 Clustering
  • TM07 embedding and clustering
  • TM07 word2vec vs doc2vec
  • TM08 Classification
  • TM09 Topic Modeling by BERtopic
  • TM10 Using Open AI

Using deep learning

  • Transformer-Clusteirng
  • Transformer-Classification

Compuational Social Science

  • CSS01 Network Analysis
  • CSS02 Simulation

Assignments

  • AS01: Python Basics
  • AS02 Youbike Applications
  • AS02 Reading PTT data
  • AS02 List-dict-for-if
  • AS03: Summarizing Data by Pandas
  • AS04 Twitter API and Pandas Visualization
  • AS05 Scraping youtube
  • AS05. Improving Youtube Scraper with ChatGPT
  • AS06 Tokenization
  • AS06 - Tokenization and keywords
  • AS07 Collocation for finding enthusiastic commentors
  • AS08 Youtube comment clustering
  • AS08 - Clustering Commenters Based on Behavior

Resources

  • Python教學影片
  • Python開發環境
  • Colab
  • Repository
  • Open issue
  • .ipynb

TM04 Feature Selection & IR & Search

Contents

  • Search
    • Build Data
    • Search for a single query term
    • Search for multiple terms (AND)
  • TF-IDF
    • Testing tf-idf
      • Compute tf and df
      • Compute idf
      • (Testing) tfidf for whole doc
  • Basic IR Model
    • Representation
      • Sorting
    • BM25
  • Inverted Indexing
    • Build inverted index

TM04 Feature Selection & IR & Search#

  • We have a lot of documents, we hope to extract their significant characteristics compared to others.

Search#

Build Data#

Each doc has been tokenzied to a word list

import json
import pandas as pd

with open("data/reuters.json") as fin:
    documents = json.load(fin)

print("Number of documents: %d" % len(documents))
print(documents[0].keys()) # field: title, content
print(documents[0]['content'][:100])
---------------------------------------------------------------------------
FileNotFoundError                         Traceback (most recent call last)
Cell In[1], line 4
      1 import json
      2 import pandas as pd
----> 4 with open("data/reuters.json") as fin:
      5     documents = json.load(fin)
      7 print("Number of documents: %d" % len(documents))

File ~/anaconda3/lib/python3.10/site-packages/IPython/core/interactiveshell.py:282, in _modified_open(file, *args, **kwargs)
    275 if file in {0, 1, 2}:
    276     raise ValueError(
    277         f"IPython won't let you open fd={file} by default "
    278         "as it is likely to crash IPython. If you know what you are doing, "
    279         "you can use builtins' open."
    280     )
--> 282 return io_open(file, *args, **kwargs)

FileNotFoundError: [Errno 2] No such file or directory: 'data/reuters.json'
doc_df = pd.DataFrame(documents)
doc_df
title content
0 ASIAN EXPORTERS FEAR DAMAGE FROM U . S .- JAPA... [Mounting, trade, friction, between, the, U, ....
1 CHINA DAILY SAYS VERMIN EAT 7 - 12 PCT GRAIN S... [survey, of, 19, provinces, and, seven, cities...
2 JAPAN TO REVISE LONG - TERM ENERGY DEMAND DOWN... [The, Ministry, of, International, Trade, and,...
3 THAI TRADE DEFICIT WIDENS IN FIRST QUARTER [Thailand, ', s, trade, deficit, widened, to, ...
4 INDONESIA SEES CPO PRICE RISING SHARPLY [Indonesia, expects, crude, palm, oil, (, CPO,...
... ... ...
5664 FED SETS TWO BILLION DLR CUSTOMER REPURCHASE [The, Federal, Reserve, entered, the, U, ., S,...
5665 SENATE SEEKS U . S . PROBE OF CANADIAN CORN LEVY [The, Senate, voted, unanimously, to, seek, an...
5666 U . K . MONEY MARKET SHORTAGE FORECAST REVISED... [The, Bank, of, England, said, it, had, revise...
5667 KNIGHT - RIDDER INC & [lt, ;, KRN, >, SETS, QUARTERLY, Qtly, div, 25...
5668 NATIONWIDE CELLULAR SERVICE INC & [lt, ;, NCEL, >, 4TH, QTR, Shr, loss, six, cts...

5669 rows × 2 columns

Search for a single query term#

search(): return all docs containing query term

def search(docs, query):
    results = []
    for doc in docs:
        if query in doc['content']:
            results.append(doc['title'])
    return results


query = "computer"

print(len(search(documents, query)))
print(search(documents, query))
54
['VW SAYS 480 MLN MARKS MAXIMUM FOR CURRENCY LOSSES', 'JAPAN WARNS U . S . IT MAY RETALIATE IN TRADE DISPUTE', 'SINGAPORE EXTERNAL TRADE GAINS 8 . 8 PCT IN QUARTER', 'TALKING POINT / IBM &', 'YEUTTER SAYS JAPANESE CURB ALL BUT CERTAIN U . S .', 'COMPUTERLAND TO BE ACQUIRED BY INVESTOR GROUP', 'ENTERTAINMENT MARKETING TOPS CRAZY EDDIE OFFER A', 'INVESTORS MAY TAKE COMPUTERLAND PUBLIC', 'SHULTZ WELCOMES TOKYO ECONOMIC PACKAGE U . S .', 'TOSHIBA , SHARP RESTRAIN LAP - TOP PC EXPORTS TO EC', 'MITSUI BUYS FIVE PCT STAKE IN U . S . CHIP MAKER', 'NEW LEADER COMING TO U . S . SEC IN CHALLENGING ERA', 'NAKASONE HARD - PRESSED TO SOOTHE U . S ANGER ON TRADE', 'HONEYWELL BULL SEES REVENUE GROWTH', "JAPAN ' S CHIP MAKERS ANGERED BY U . S . SANCTION PLANS", 'NAKASONE SOUNDS CONCILIATORY NOTE IN CHIP DISPUTE', 'TOKYO BIDS TO STOP CHIP ROW TURNING INTO TRADE WAR', 'JAPAN HAS LITTLE NEW TO OFFER IN MICROCHIP DISPUTE', 'NAKASONE SOUNDS CONCILIATORY NOTE IN CHIP DISPUTE', 'TOKYO BIDS TO STOP CHIP ROW BECOMING TRADE WAR', 'ECONOMIC SPOTLIGHT - U . S . CONGRESS RAPS JAPAN', 'EUROPE ON SIDELINES IN U . S - JAPAN MICROCHIP ROW', 'U . S . SETS CORN DEFICIENCY PAYMENT HALF PIK CERTS', 'U . S . STOCK MARKET OVERREACTS TO TARIFFS - YEUTTER U . S .', 'ENVOY ADVISES NAKASONE TO PREPARE FOR U . S . VISIT', 'U . S .- JAPAN NOT IN TRADE WAR , YEUTTER SAYS', 'JAPAN / U . S . WILL BE AT ODDS WHILE TRADE LOPSIDED', 'JAPAN WARNS OF ANTI - U . S . SENTIMENT IN TRADE ROW', 'JAPAN / U . S . WILL BE AT ODDS WHILE TRADE LOPSIDED', 'EC MINISTERS WILL DISCUSS STRENGTHENING EMS FLOAT', 'IMF , WORLD BANK TO MEET AMID NEW INFLATION FEARS', 'JAPAN CUTS CHIP SUPPLY , MAY PRODUCE SHORTAGE', 'BALDRIGE CONCERNED ABOUT KOREAN / TAIWAN DEFICITS', 'GATT COUNCIL DEFERS DECISION ON SEMICONDUCTORS', 'U . S . TELLS JAPAN TO DO MORE TO CUT TRADE SURPLUS U . S .', 'MATHEMATICAL APPLICATIONS SETS OPERATIONS SALE', 'EXCO BUYS U . S . GOVERNMENT SECURITIES BROKER &', 'JAPAN , U . S . SET TO BEGIN HIGH - LEVEL TRADE TALKS', "SEARCH FOR BRITISH FERRY ' S TOXIC CARGO CONTINUES", 'WEINBERGER OPPOSES FUJITSU BUYING U . S . FIRM', 'FAIRCHILD DEAL FAILURE SEEN MAKING JAPANESE WARY', 'SURALCO ALUMINA EXPORTS DROPPED 75 PCT IN FEB', 'VENTRA BUYS JOINT VENTURE LEASING', 'PAPER SAYS U . S . MAY SEEK TO CURB FOREIGN TAKEOVERS', 'ECONOMIC SPOTLIGHT - JAPAN BUYING OVERSEAS FIRMS', 'PILGRIM VENTURE IN MERGER AGREEMENT &', 'GROUP RECONSIDERS COMPUTER MEMORIES &', 'JAPAN TO ASK CHIP MAKERS TO SLASH OUTPUT FURTHER', 'SCAN - GRAPHICS TO MERGE WITH PUBLIC COMPANY', 'ECONOMIC SPOTLIGHT - U . S . DEFICIT WITH', 'WAVEHILL INTERNATIONAL TO MAKE ACQUISITION &', 'JAPAN IN LAST DITCH EFFORT TO SAVE CHIP PACT', 'BRAZIL COMPUTER MARKET TO REMAIN CLOSED - MINISTER', 'CAROLIAN SYSTEMS SEES LOWER FISCAL 1987 PROFIT &']

Search for multiple terms (AND)#

# match if all query terms appears in the doc
# match if a list appears in a doc
def match(doc_terms, query_terms):
    for term in query_terms:
        if term not in doc_terms:
            return False
    return True


def match1(doc_terms, query_terms):
    return set(query_terms) <= set(doc_terms)


def search(docs, query):
    query_terms = query.split(" ")
    results = []
    for doc in docs:
        if match(doc['content'], query_terms):
            results.append(doc['title'])
    return results

query = "Japan beef"
results = search(documents, query)
print("Number of matched results: %d" % len(results))
print("\n".join(results[0:10]))
Number of matched results: 21
ASIAN EXPORTERS FEAR DAMAGE FROM U . S .- JAPAN RIFT
JAPAN MINISTRY SAYS OPEN FARM TRADE WOULD HIT U . S .
JAPAN ' S LDP URGES MORE IMPORTS OF 12 FARM ITEMS
U . S . URGES JAPAN TO OPEN FARM MARKET FURTHER U . S .
LYNG OPENS JAPAN TALKS ON FARM TRADE BARRIERS U . S .
U . S . URGES JAPAN TO OPEN FARM MARKET FURTHER U . S .
U . S . MEAT INDUSTRY LAUNCHES CAMPAIGN IN JAPAN
GATT CASE AGAINST JAPAN A MODEL FOR U . S . - LYNG
U . S . TAKES TOUGH STAND ON GATT FARM ISSUES
EC AGREES TRADE DEAL WITH ARGENTINA

TF-IDF#

  • tf: The occurrences of a term in a document. tf(t, d)

  • df: document frequency: Number of docs where a word appreared

  • tf-idf: term-frequency and inverted document frequency

Testing tf-idf#

def print_tfidf(td, ndoc, ntd):
    import math
    idf = math.log(ndoc/ntd)
    tfidf = td*idf
    print("n(t, d)=%d\t ndoc=%d\t ndoc_term=%d\t idf=%f\t tfidf=%f"%(td, ndoc, ntd, idf, tfidf))

print_tfidf(5, 100, 10)
print_tfidf(10, 100, 20)
print_tfidf(20, 100, 40)
print_tfidf(40, 100, 80)
print_tfidf(50, 100, 100)

print("-"*80)

print_tfidf(5, 100, 5)
print_tfidf(5, 100, 10)
print_tfidf(5, 100, 20)
print_tfidf(5, 100, 40)

print("-"*80)

print_tfidf(5, 100, 5)
print_tfidf(10, 100, 10)
print_tfidf(20, 100, 20)
print_tfidf(40, 100, 40)
n(t, d)=5	 ndoc=100	 ndoc_term=10	 idf=2.302585	 tfidf=11.512925
n(t, d)=10	 ndoc=100	 ndoc_term=20	 idf=1.609438	 tfidf=16.094379
n(t, d)=20	 ndoc=100	 ndoc_term=40	 idf=0.916291	 tfidf=18.325815
n(t, d)=40	 ndoc=100	 ndoc_term=80	 idf=0.223144	 tfidf=8.925742
n(t, d)=50	 ndoc=100	 ndoc_term=100	 idf=0.000000	 tfidf=0.000000
--------------------------------------------------------------------------------
n(t, d)=5	 ndoc=100	 ndoc_term=5	 idf=2.995732	 tfidf=14.978661
n(t, d)=5	 ndoc=100	 ndoc_term=10	 idf=2.302585	 tfidf=11.512925
n(t, d)=5	 ndoc=100	 ndoc_term=20	 idf=1.609438	 tfidf=8.047190
n(t, d)=5	 ndoc=100	 ndoc_term=40	 idf=0.916291	 tfidf=4.581454
--------------------------------------------------------------------------------
n(t, d)=5	 ndoc=100	 ndoc_term=5	 idf=2.995732	 tfidf=14.978661
n(t, d)=10	 ndoc=100	 ndoc_term=10	 idf=2.302585	 tfidf=23.025851
n(t, d)=20	 ndoc=100	 ndoc_term=20	 idf=1.609438	 tfidf=32.188758
n(t, d)=40	 ndoc=100	 ndoc_term=40	 idf=0.916291	 tfidf=36.651629

Compute tf and df#

from collections import Counter

term_freq = Counter()
doc_freq = Counter() # Number of docs where a word appeared

tdf = list()
for doc in documents:
    term_doc_freq = Counter()
    doc_terms = set() # to store all terms appearing in the doc
    for w in doc['content']:
        if len(w) > 1:
            term_freq[w] += 1
            term_doc_freq[w] += 1
            doc_terms.add(w)
        
    for w in doc_terms:
        doc_freq[w] += 1
    tdf.append(term_doc_freq)
doc_df['tdf'] = tdf
doc_df
title content tdf
0 ASIAN EXPORTERS FEAR DAMAGE FROM U . S .- JAPA... [Mounting, trade, friction, between, the, U, .... {'Mounting': 1, 'trade': 13, 'friction': 1, 'b...
1 CHINA DAILY SAYS VERMIN EAT 7 - 12 PCT GRAIN S... [survey, of, 19, provinces, and, seven, cities... {'survey': 1, 'of': 5, '19': 1, 'provinces': 1...
2 JAPAN TO REVISE LONG - TERM ENERGY DEMAND DOWN... [The, Ministry, of, International, Trade, and,... {'The': 2, 'Ministry': 1, 'of': 8, 'Internatio...
3 THAI TRADE DEFICIT WIDENS IN FIRST QUARTER [Thailand, ', s, trade, deficit, widened, to, ... {'Thailand': 2, 'trade': 1, 'deficit': 1, 'wid...
4 INDONESIA SEES CPO PRICE RISING SHARPLY [Indonesia, expects, crude, palm, oil, (, CPO,... {'Indonesia': 3, 'expects': 1, 'crude': 1, 'pa...
... ... ... ...
5664 FED SETS TWO BILLION DLR CUSTOMER REPURCHASE [The, Federal, Reserve, entered, the, U, ., S,... {'The': 1, 'Federal': 2, 'Reserve': 1, 'entere...
5665 SENATE SEEKS U . S . PROBE OF CANADIAN CORN LEVY [The, Senate, voted, unanimously, to, seek, an... {'The': 2, 'Senate': 2, 'voted': 1, 'unanimous...
5666 U . K . MONEY MARKET SHORTAGE FORECAST REVISED... [The, Bank, of, England, said, it, had, revise... {'The': 1, 'Bank': 1, 'of': 3, 'England': 1, '...
5667 KNIGHT - RIDDER INC & [lt, ;, KRN, >, SETS, QUARTERLY, Qtly, div, 25... {'lt': 1, 'KRN': 1, 'SETS': 1, 'QUARTERLY': 1,...
5668 NATIONWIDE CELLULAR SERVICE INC & [lt, ;, NCEL, >, 4TH, QTR, Shr, loss, six, cts... {'lt': 1, 'NCEL': 1, '4TH': 1, 'QTR': 1, 'Shr'...

5669 rows × 3 columns

Compute idf#

import math
num_docs = len(documents)
tfidfs = list()
for tdf in doc_df['tdf']:
    tfidf = Counter()
    for w, f in tdf.items():
        tfidf[w] = f * math.log2(num_docs/doc_freq[w])
    tfidfs.append(tfidf)
doc_df['tfidf'] = tfidfs
for title, tfidf in zip(doc_df['title'][:20], doc_df['tfidf'][:20]):
    print(title.lower())
    for k, v in tfidf.most_common(5):
        print("\t%s(%.4f)"%(k, v))
asian exporters fear damage from u . s .- japan rift
	Japan(40.0751)
	trade(35.1953)
	electronics(31.3001)
	tariffs(29.1976)
	businessmen(28.4453)
china daily says vermin eat 7 - 12 pct grain stocks a
	preservation(21.7678)
	waste(18.0189)
	China(15.8372)
	storage(13.5929)
	vermin(12.4689)
japan to revise long - term energy demand downwards
	energy(33.8159)
	MITI(31.0738)
	electric(16.7628)
	demand(15.6265)
	kilolitres(12.4689)
thai trade deficit widens in first quarter
	baht(30.4409)
	Thailand(13.5929)
	billion(13.4837)
	pct(13.3317)
	Janunary(12.4689)
indonesia sees cpo price rising sharply
	Harahap(37.4066)
	CPO(31.4066)
	palm(27.1858)
	Indonesia(18.7783)
	Indonesian(14.1531)
australian foreign ship ban ends but nsw ports hit
	NSW(41.8755)
	ports(28.1705)
	disrupted(17.3230)
	disruption(17.3230)
	shipping(16.8702)
indonesian commodity exchange may expand
	rubber(48.1874)
	exchange(38.7578)
	Nainggolan(37.4066)
	coffee(26.9603)
	trading(22.9718)
sri lanka gets usda approval for wheat price
	Colombo(9.4689)
	Northwest(9.0094)
	Continental(7.3396)
	Department(7.3166)
	Food(6.9770)
western mining to open new gold mine in australia
	WMC(34.4066)
	mine(19.2084)
	Goodall(12.4689)
	Bundey(12.4689)
	WMNG(11.4689)
sumitomo bank aims at quick recovery from merger
	Sumitomo(121.2613)
	Komatsu(112.2199)
	Heiwa(49.8755)
	Sogo(49.8755)
	business(32.1744)
subroto says indonesia supports tin pact extension
	extension(29.8755)
	ITA(28.9846)
	sixth(24.0283)
	Subroto(17.5369)
	pact(10.1877)
bundesbank allocates 6 . 1 billion marks in tender
	Bundesbank(41.8392)
	marks(37.0228)
	liquidity(22.0587)
	steering(19.3230)
	billion(17.9783)
bond corp still considering atlas mining bail - out
	Atlas(97.9552)
	Bond(40.3828)
	Masbate(24.9378)
	copper(19.3395)
	gold(16.2408)
china industrial output rises in first quarter
	quarter(11.7429)
	China(10.5581)
	edition(9.6615)
	first(9.2750)
	industrial(8.7948)
japan ministry says open farm trade would hit u . s .
	Japan(46.7543)
	beef(33.7048)
	products(24.2055)
	Farm(22.2735)
	largest(22.2049)
amatil proposes two - for - five bonus share issue
	Amatil(24.9378)
	AMAA(12.4689)
	BTI(10.8839)
	Shareholders(10.1470)
	rank(9.0094)
bowater 1986 pretax profits rise 15 . 6 mln stg
	vs(35.6104)
	debit(34.6461)
	mln(22.8789)
	0p(18.9378)
	7p(18.0189)
u . k . money market deficit forecast at 250 mln stg
	stg(16.9602)
	bills(16.4066)
	drain(14.1531)
	around(10.0343)
	690(9.2990)
south korea moves to slow growth of trade surplus
	Kim(73.2821)
	surplus(51.4676)
	Korea(35.8622)
	South(32.4077)
	debt(26.3204)
finns and canadians to study mtbe production plant
	MTBE(68.8133)
	Neste(34.4066)
	Oy(32.6517)
	Edmonton(32.6517)
	Celanese(30.4409)

(Testing) tfidf for whole doc#

import math

tfidf = Counter() # tfidf of each word
num_docs = len(documents)
for w in term_freq:
    tfidf[w] = term_freq[w] * math.log2(num_docs/doc_freq[w])

print("Most frequent terms:")
print(term_freq.most_common(50))

print("Most frequent document terms:")
print(doc_freq.most_common(50))

print("Terms with the highest TFIDF:")
print(tfidf.most_common(50))
Most frequent terms:
[('.', 65574), ('the', 47222), (',', 46891), ('to', 26853), ('of', 25821), ('in', 21209), ('and', 18750), ('said', 18485), ('a', 17638), ('for', 9145), ('mln', 9128), ('-', 9003), ('The', 8825), ("'", 8457), ('pct', 7529), ('s', 7246), ('on', 6853), ('from', 6145), ('that', 6118), ('is', 5835), ('"', 5775), ('dlrs', 5654), ('by', 5533), ('1', 5510), ('at', 5409), ('it', 5379), ('was', 4969), ('be', 4906), ('year', 4748), ('with', 4704), ('U', 4604), ('000', 4450), ('S', 4418), ('its', 4337), ('billion', 4320), ('vs', 4315), ('will', 4115), ('would', 3906), ('2', 3738), ('as', 3433), (';', 3400), ('lt', 3349), ('/', 3327), ('has', 3324), ('not', 3285), ('an', 3273), ('he', 3159), (',"', 2990), ('which', 2873), ('3', 2861)]
Most frequent document terms:
[('.', 5518), (',', 5260), ('of', 4774), ('the', 4675), ('said', 4608), ('to', 4505), ('and', 4409), ('in', 4328), ('a', 4186), ('The', 3645), ('for', 3481), ('-', 3132), ("'", 3066), ('s', 2929), ('on', 2869), ('from', 2800), ('it', 2629), ('by', 2603), ('at', 2591), ('is', 2481), ('with', 2407), ('was', 2358), ('mln', 2349), ('its', 2256), ('pct', 2250), ('1', 2237), ('that', 2213), ('be', 2133), ('year', 2019), (';', 2004), ('lt', 1985), ('an', 1953), ('dlrs', 1920), ('will', 1886), ('>', 1865), ('has', 1848), ('which', 1806), ('not', 1787), ('2', 1751), ('U', 1743), ('as', 1737), ('would', 1733), ('"', 1704), ('S', 1680), ('were', 1571), ('5', 1560), ('last', 1549), ('this', 1540), ('3', 1499), ('are', 1466)]
Terms with the highest TFIDF:
[('the', 13133.755890104927), ('vs', 11819.897451299465), ('mln', 11602.122089408393), ('pct', 10037.43141662644), ('"', 10014.825591502971), ('billion', 9708.298988635965), ('to', 8903.573031221764), ('dlrs', 8831.479915724474), ('000', 8735.648149554252), ('/', 8538.756942662225), ('that', 8302.681624636662), ('in', 8258.652467773287), ('U', 7833.805905389986), ('S', 7751.968753725295), ('a', 7717.012287508844), ('-', 7706.658538925057), ("'", 7499.131190762607), ('1', 7391.824860309295), ('tonnes', 7101.385223833243), ('year', 7071.924546100801), ('is', 6956.326735464018), ('be', 6918.491748867262), ('s', 6903.163306261671), ('and', 6799.552353818978), ('on', 6733.389801100871), ('would', 6678.567132859057), ('he', 6604.153901689901), ('will', 6533.651300788917), ('cts', 6434.3737582039785), ('for', 6434.353001850576), ('of', 6400.914940388855), ('oil', 6358.193774547681), ('2', 6335.592962972377), ('was', 6288.42130712229), (',"', 6260.578976192779), ('from', 6253.566431010715), ('by', 6213.110591378933), ('at', 6109.926636935147), ('trade', 5996.730016845508), ('it', 5963.052504216226), ('as', 5858.402536369312), ('with', 5813.476631030548), ('its', 5765.292072561284), ('0', 5695.314602771666), ('The', 5623.073386890246), ('said', 5526.1564156909535), ('3', 5490.527607478552), ('not', 5471.346975245519), ('1986', 5451.2519110030335), ('has', 5375.338499228798)]
from nltk.corpus import stopwords
stopword_list = stopwords.words('english')

term_freq = Counter()
doc_freq = Counter()

for doc in documents:
    doc_terms = set()
    for w in doc['content']:
        if not w.isalpha():
            continue
        w = w.lower()
        if w not in stopword_list:
            term_freq[w] += 1
            doc_terms.add(w)
    for w in doc_terms:
        doc_freq[w] += 1


tfidf = Counter()
num_docs = len(documents)
for w in term_freq:
    if w not in stopword_list and len(w) > 2:
        tfidf[w] = term_freq[w] * math.log2(num_docs/doc_freq[w])

print("Most frequent terms:")
print(term_freq.most_common(50))

print("Most frequent document terms:")
print(doc_freq.most_common(50))

print("Terms with the highest TFIDF:")
print(tfidf.most_common(50))
Most frequent terms:
[('said', 18494), ('mln', 9160), ('pct', 7551), ('dlrs', 5815), ('year', 5071), ('u', 4607), ('billion', 4321), ('vs', 4317), ('would', 3930), ('lt', 3351), ('last', 2770), ('bank', 2731), ('trade', 2699), ('oil', 2382), ('cts', 2322), ('market', 2320), ('tonnes', 2307), ('net', 2190), ('one', 1998), ('company', 1983), ('new', 1974), ('also', 1826), ('two', 1813), ('prices', 1813), ('government', 1644), ('japan', 1503), ('rate', 1492), ('february', 1488), ('dollar', 1484), ('january', 1455), ('week', 1431), ('may', 1403), ('shares', 1399), ('price', 1328), ('foreign', 1318), ('loss', 1304), ('march', 1278), ('told', 1276), ('exchange', 1272), ('could', 1250), ('production', 1241), ('share', 1228), ('stock', 1226), ('group', 1205), ('april', 1169), ('rates', 1168), ('today', 1163), ('month', 1159), ('shr', 1150), ('rose', 1148)]
Most frequent document terms:
[('said', 4609), ('mln', 2352), ('pct', 2253), ('year', 2200), ('lt', 1987), ('dlrs', 1966), ('u', 1744), ('would', 1740), ('last', 1618), ('one', 1306), ('two', 1295), ('also', 1274), ('market', 1219), ('billion', 1194), ('new', 1173), ('company', 1099), ('told', 1053), ('trade', 1018), ('net', 967), ('bank', 954), ('may', 914), ('government', 879), ('today', 877), ('vs', 850), ('prices', 836), ('april', 833), ('cts', 827), ('march', 826), ('three', 809), ('week', 806), ('could', 806), ('added', 781), ('oil', 767), ('month', 764), ('exchange', 757), ('international', 752), ('price', 741), ('total', 737), ('first', 733), ('expected', 733), ('foreign', 714), ('per', 704), ('rate', 701), ('share', 699), ('corp', 696), ('current', 693), ('group', 686), ('inc', 685), ('tonnes', 671), ('five', 671)]
Terms with the highest TFIDF:
[('mln', 11625.928874284848), ('pct', 10052.245761614773), ('billion', 9710.54628006852), ('dlrs', 8884.33695958949), ('tonnes', 7102.583041707593), ('bank', 7021.491390129802), ('year', 6924.910669960947), ('oil', 6873.965563005887), ('would', 6696.74736491751), ('trade', 6686.38575420033), ('cts', 6448.507550453009), ('net', 5587.799178810262), ('said', 5523.057434422488), ('market', 5144.359052222177), ('dollar', 5049.739415228694), ('last', 5010.604972048774), ('prices', 5006.63471161198), ('japan', 4996.303127414279), ('loss', 4847.404181128349), ('company', 4693.568416731597), ('january', 4679.590091592154), ('february', 4669.286973448756), ('shares', 4522.777242325159), ('rate', 4499.287016354734), ('new', 4486.687338931271), ('government', 4420.977722623381), ('stg', 4273.980342230134), ('one', 4231.642865801446), ('production', 4094.001358499999), ('rates', 4074.8471241785546), ('week', 4027.1810536920775), ('foreign', 3939.6315459436505), ('also', 3932.709138594268), ('stock', 3925.7648080906674), ('price', 3898.408834970051), ('two', 3861.9477564681106), ('profit', 3803.0457369254964), ('share', 3708.22832756922), ('exchange', 3694.81536919646), ('may', 3693.8579628621833), ('exports', 3684.1487886651003), ('group', 3671.4106141141697), ('rose', 3633.532280877387), ('shr', 3626.788340297595), ('japanese', 3605.1178054429806), ('sales', 3599.8345492389662), ('wheat', 3568.2748360658793), ('march', 3551.409384307176), ('could', 3517.8031566143236), ('offer', 3493.1652955028644)]

Basic IR Model#

https://en.wikipedia.org/wiki/Tf–idf

Representation#

doc -> `[('u', 18), ('said', 16), ('trade', 15), ('japan', 12), ('dlrs', 6), ('exports', 6), ('imports', 5), ('tariffs', 5), ('japanese', 5), ('billion', 5), ('businessmen', 4), ('might', 4), ('electronics', 4), ('taiwan', 4), ('also', 4), ('last', 4), ('year', 4), ('hong', 4), ('kong', 4), ('products', 3)]`
def represent(terms):
    counter = Counter()
    for term in terms:
        term = term.lower()
        if term in stopword_list:
            continue
        if not term.isalpha():
            continue
        counter[term] += 1
    return counter
print(represent(documents[0]['content']).most_common(20))
[('u', 18), ('said', 16), ('trade', 15), ('japan', 12), ('dlrs', 6), ('exports', 6), ('imports', 5), ('tariffs', 5), ('japanese', 5), ('billion', 5), ('businessmen', 4), ('might', 4), ('electronics', 4), ('taiwan', 4), ('also', 4), ('last', 4), ('year', 4), ('hong', 4), ('kong', 4), ('products', 3)]
def tfidf(doc_dict, query_dict):
    global num_docs
    global doc_freq
    score = 0
    for term in query_dict:
        if term in doc_dict:
            score += doc_dict[term] * math.log2(num_docs / doc_freq[term])
    return score


def match_docs(docs, query_dict):
    results = []
    for doc in docs:
        doc_dict = represent(doc['content'])
        matched = True
        for term in query_dict:
            if term not in doc_dict:
                matched = False
                break
        if matched:
            results.append(doc)
    return results


def ranked_search(docs, query, score_func):
    query_terms = query.split(" ")
    query_dict = represent(query_terms)
    results = []
    matched_docs = match_docs(docs, query_dict)
    for doc in matched_docs:
        doc_dict = represent(doc['content'])
        score = score_func(doc_dict, query_dict)
        results.append((doc['title'], score))
    
    ranked = sorted(results, key=lambda x: x[1], reverse=True)
    return ranked

    
query = "Japan beef"
results = search(documents, query)
print("Number of matched results: %d" % len(results))
print("\n".join(results[0:10]))

print(
'''
---------------------------------------------------------------
ranked_search
---------------------------------------------------------------
''')
results = ranked_search(documents, query, tfidf)
print("\nNumber of matched results: %d" % len(results))
for title, score in results:
    print("%s \t %f" % (title.lower(), score))
                           
Number of matched results: 21
ASIAN EXPORTERS FEAR DAMAGE FROM U . S .- JAPAN RIFT
JAPAN MINISTRY SAYS OPEN FARM TRADE WOULD HIT U . S .
JAPAN ' S LDP URGES MORE IMPORTS OF 12 FARM ITEMS
U . S . URGES JAPAN TO OPEN FARM MARKET FURTHER U . S .
LYNG OPENS JAPAN TALKS ON FARM TRADE BARRIERS U . S .
U . S . URGES JAPAN TO OPEN FARM MARKET FURTHER U . S .
U . S . MEAT INDUSTRY LAUNCHES CAMPAIGN IN JAPAN
GATT CASE AGAINST JAPAN A MODEL FOR U . S . - LYNG
U . S . TAKES TOUGH STAND ON GATT FARM ISSUES
EC AGREES TRADE DEAL WITH ARGENTINA

---------------------------------------------------------------
ranked_search
---------------------------------------------------------------


Number of matched results: 21
japan beef price support cut will not raise demand 	 168.859461
u . s . meat industry launches campaign in japan 	 145.589919
lyng opens japan talks on farm trade barriers u . s . 	 96.102045
japan ministry says open farm trade would hit u . s . 	 86.204470
japan seen reducing beef , pork intervention prices 	 79.368314
more pressure urged for asia to take u . s . beef 	 59.573164
asian exporters fear damage from u . s .- japan rift 	 46.501541
economic spotlight - u . s . congress raps japan 	 46.501541
gatt case against japan a model for u . s . - lyng 	 46.351369
australian beef output seen declining in 1987 	 42.989606
ec agrees trade deal with argentina 	 39.740471
u . s . urges japan to open farm market further u . s . 	 33.204660
u . s . urges japan to open farm market further u . s . 	 33.204660
oecd farm subsidies study results detailed 	 33.204660
u . s . to ask for share of japan ' s rice market u . s . 	 33.129574
japan ' s lipc to buy beef on april 23 	 33.092031
japan ' s lipc to buy beef on april 23 	 33.092031
u . s . house panel extends eep , urges ussr offer 	 16.546015
japan sets 1987 / 88 first half beef import quota 	 13.259338
japan ' s ldp urges more imports of 12 farm items 	 9.935118
u . s . takes tough stand on gatt farm issues 	 9.935118

Sorting#

small_list = [5, 6, 3, 4, 6, 7, 9]
print(sorted(small_list))
print(sorted(small_list, reverse=True))
print(sorted(small_list, key=lambda x: -x))

small_strings = ['beef', 'chiken', 'pork', 'mutton', 'hot']
print(sorted(small_strings))
print(sorted(small_strings, key=lambda x: len(x)))


small_tuples = [('beef', 4), ('chiken', 2), ('pork', 4), ('mutton', 3), ('hot', 7)]
print(sorted(small_tuples))
print(sorted(small_tuples, key=lambda x: x[0]))
print(sorted(small_tuples, key=lambda x: x[1]))
print(sorted(small_tuples, key=lambda x: len(x[0])))


small_strings = ['社會系', '社會工作系', '經濟系', '政治系', '國家發展研究所']
print(sorted(small_strings))
print(sorted(small_strings, key=lambda x: len(x)))
[3, 4, 5, 6, 6, 7, 9]
[9, 7, 6, 6, 5, 4, 3]
[9, 7, 6, 6, 5, 4, 3]
['beef', 'chiken', 'hot', 'mutton', 'pork']
['hot', 'beef', 'pork', 'chiken', 'mutton']
[('beef', 4), ('chiken', 2), ('hot', 7), ('mutton', 3), ('pork', 4)]
[('beef', 4), ('chiken', 2), ('hot', 7), ('mutton', 3), ('pork', 4)]
[('chiken', 2), ('mutton', 3), ('beef', 4), ('pork', 4), ('hot', 7)]
[('hot', 7), ('beef', 4), ('pork', 4), ('chiken', 2), ('mutton', 3)]
['國家發展研究所', '政治系', '社會工作系', '社會系', '經濟系']
['社會系', '經濟系', '政治系', '社會工作系', '國家發展研究所']

BM25#

https://en.wikipedia.org/wiki/Okapi_BM25

avgdl = 0
for doc in documents:
    doc_dict = represent(doc['content'])
    avgdl += sum(doc_dict.values())
avgdl /= num_docs
    

def idf(term):
    global num_docs
    global doc_freq
    return math.log2((num_docs - doc_freq[term] + 0.5) / (doc_freq[term] + 0.5))
                     
                     
def bm25(doc_dict, query_dict):
    global num_docs
    global doc_freq
    global avgdl
    k1 = 1.2
    b = 0.75
    query_len = sum(query_dict.values())
    doc_len = sum(doc_dict.values())
    score = 0
    for term in query_dict:
        if term in doc_dict:
            tf = ((doc_dict[term] / doc_len) * (k1 + 1)) / (
                (doc_dict[term] / doc_len) + k1 *(1 - b + b * (doc_len / avgdl)))
            score += idf(term) * tf
    return score
                                                            
                
results = ranked_search(documents, query, bm25)
print("\nNumber of matched results: %d" % len(results))
for title, score in results:
    print("%s \t %f" % (title.lower(), score))
Number of matched results: 21
japan ' s lipc to buy beef on april 23 	 1.676211
japan ' s lipc to buy beef on april 23 	 1.676211
japan sets 1987 / 88 first half beef import quota 	 1.548682
more pressure urged for asia to take u . s . beef 	 0.618041
japan seen reducing beef , pork intervention prices 	 0.523364
australian beef output seen declining in 1987 	 0.444577
u . s . meat industry launches campaign in japan 	 0.430094
gatt case against japan a model for u . s . - lyng 	 0.426406
japan ' s ldp urges more imports of 12 farm items 	 0.424614
u . s . to ask for share of japan ' s rice market u . s . 	 0.337159
japan beef price support cut will not raise demand 	 0.315739
lyng opens japan talks on farm trade barriers u . s . 	 0.241400
u . s . urges japan to open farm market further u . s . 	 0.233559
u . s . urges japan to open farm market further u . s . 	 0.233559
japan ministry says open farm trade would hit u . s . 	 0.215694
ec agrees trade deal with argentina 	 0.160007
economic spotlight - u . s . congress raps japan 	 0.091814
oecd farm subsidies study results detailed 	 0.052830
asian exporters fear damage from u . s .- japan rift 	 0.050941
u . s . house panel extends eep , urges ussr offer 	 0.030293
u . s . takes tough stand on gatt farm issues 	 0.027802

Inverted Indexing#

Build inverted index#

inverted_index = {}

for i in range(len(documents)):
    doc_dict = represent(documents[i]['content'])
    for term in doc_dict:
        if term not in inverted_index:
            inverted_index[term] = []
        inverted_index[term].append(i)
        
    
def match_indexed_docs(docs, inverted_index, query_dict):
    count = Counter()
    for term in query_dict:
        for doc_id in inverted_index[term]:
            count[doc_id] += 1
    results = []
    for doc_id in count:
        if count[doc_id] == len(query_dict):
            results.append(docs[doc_id])
    return results
            
    
def ranked_indexed_search(docs, inverted_index, query, score_func):
    query_terms = query.split(" ")
    query_dict = represent(query_terms)
    results = []
    docs = match_indexed_docs(docs, inverted_index, query_dict)
    for doc in docs:
        doc_dict = represent(doc['content'])
        score = score_func(doc_dict, query_dict)
        if score > 0:
            results.append((doc['title'], score))
    
    ranked = sorted(results, key=lambda x: x[1], reverse=True)
    return ranked


results = ranked_indexed_search(documents, inverted_index, query, bm25)
print("\nNumber of matched results: %d" % len(results))
for title, score in results:
    print("%s %f" % (title, score))
Number of matched results: 21
JAPAN ' S LIPC TO BUY BEEF ON APRIL 23 1.676211
JAPAN ' S LIPC TO BUY BEEF ON APRIL 23 1.676211
JAPAN SETS 1987 / 88 FIRST HALF BEEF IMPORT QUOTA 1.548682
MORE PRESSURE URGED FOR ASIA TO TAKE U . S . BEEF 0.618041
JAPAN SEEN REDUCING BEEF , PORK INTERVENTION PRICES 0.523364
AUSTRALIAN BEEF OUTPUT SEEN DECLINING IN 1987 0.444577
U . S . MEAT INDUSTRY LAUNCHES CAMPAIGN IN JAPAN 0.430094
GATT CASE AGAINST JAPAN A MODEL FOR U . S . - LYNG 0.426406
JAPAN ' S LDP URGES MORE IMPORTS OF 12 FARM ITEMS 0.424614
U . S . TO ASK FOR SHARE OF JAPAN ' S RICE MARKET U . S . 0.337159
JAPAN BEEF PRICE SUPPORT CUT WILL NOT RAISE DEMAND 0.315739
LYNG OPENS JAPAN TALKS ON FARM TRADE BARRIERS U . S . 0.241400
U . S . URGES JAPAN TO OPEN FARM MARKET FURTHER U . S . 0.233559
U . S . URGES JAPAN TO OPEN FARM MARKET FURTHER U . S . 0.233559
JAPAN MINISTRY SAYS OPEN FARM TRADE WOULD HIT U . S . 0.215694
EC AGREES TRADE DEAL WITH ARGENTINA 0.160007
ECONOMIC SPOTLIGHT - U . S . CONGRESS RAPS JAPAN 0.091814
OECD FARM SUBSIDIES STUDY RESULTS DETAILED 0.052830
ASIAN EXPORTERS FEAR DAMAGE FROM U . S .- JAPAN RIFT 0.050941
U . S . HOUSE PANEL EXTENDS EEP , URGES USSR OFFER 0.030293
U . S . TAKES TOUGH STAND ON GATT FARM ISSUES 0.027802
query = "Taiwan Japan"
results = ranked_indexed_search(documents, inverted_index, query, bm25)
print("\nNumber of matched results: %d" % len(results))
for title, score in results:
    print("%s %f" % (title, score))
Number of matched results: 41
RISING TAIWAN DOLLAR CAUSES FOREIGN RESERVES LOSS 0.557418
TAIWAN STEEL FIRM SEES LOWER EXPORTS , MORE OUTPUT 0.534024
THAI ZINC EXPORTS FALL IN MARCH 0.447403
LONDON GRAIN FREIGHTS 27 , 000 0.411234
TAIWAN PLANS MISSION TO CLOSE TRADE GAP WITH U . S . 0.361747
TAIWAN FOREIGN EXCHANGE RESERVES HIT NEW HIGH 0.358275
TAIWAN SEES SHARP DECLINE IN SHIPBREAKING 0.304945
YEUTTER PUTS CURRENCY BURDEN ON TAIWAN , KOREA 0.289958
MULFORD SAYS GERMANY , JAPAN SHOULD DO MORE 0.248875
BALDRIGE CONCERNED ABOUT KOREAN / TAIWAN DEFICITS 0.226503
SOUTH AFRICA CORN EXPORTS COULD BE REDUCED - USDA 0.210748
LONDON FREIGHT MARKET FEATURES GRAIN OUT OF U . S . 0.210735
TAIWAN ANNOUNCES NEW ROUND OF IMPORT TARIFF CUTS 0.165164
TAIWAN ANNOUNCES NEW ROUND OF IMPORT TARIFF CUTS 0.165164
BALDRIGE WARNS OF WORLD TRADE WAR DANGER U . S . 0.155060
BALDRIGE WARNS OF WORLD TRADE WAR DANGER U . S . 0.152501
TAIWAN CURBS INFLOWS OF FOREIGN EXCHANGE 0.141761
TAIWAN COMPLAINS ABOUT SIZE OF RESERVES 0.139054
REPORT EXPECTS SHARP DROP IN WORLD IRON IMPORTS 0.138398
USDA COMMENTS ON EXPORT SALES REPORT U . S . 0.130159
U . S . HOUSE PANEL APPROVES TRADE BILL 0.122454
TAIWAN SHIPBUILDER LOOKS FOR JAPANESE VENTURES 0.099780
TAIWAN DOLLAR AND RESERVES SEEN RISING MORE SLOWLY 0.098053
USDA COMMENTS ON EXPORT SALES REPORT 0.095963
AUSTRALIAN BEEF OUTPUT SEEN DECLINING IN 1987 0.095316
ECONOMIC SPOTLIGHT - U . S . DEFICIT WITH 0.093116
ECONOMIC SPOTLIGHT - U . S . CONGRESS RAPS JAPAN 0.090142
USDA COMMENTS ON EXPORT SALES 0.089580
PHILIPPINE PLANNING CHIEF URGES PESO DEVALUATION 0.078124
ASIAN EXPORTERS FEAR DAMAGE FROM U . S .- JAPAN RIFT 0.069852
LATIN , CARIBBEAN NATIONS OPPOSE TRADE BILLS A 0.052019
ECONOMIC SPOTLIGHT - JAPAN SHIPBUILDERS RECOVERY 0.047666
U . S . HOUSE PANEL APPROVES TRADE BILL 0.044894
JAPANESE TARIFFS SEEN AS WORLDWIDE WARNING 0.040512
U . S . TRADE DEFICIT 38 . 37 BILLION DLRS IN 4TH QTR 0.039176
EXPORT BUSINESS - GRAINS / OILSEEDS COMPLEX 0.025505
U . S . HOUSE PANEL TAKES FIRST TRADE BILL VOTES 0.019227
DOLLAR VALUE APPROPRIATE , BUNDESBANK OFFICIAL SAYS 0.018462
NET CHANGE IN EXPORT COMMITMENTS -- USDA 0.011454
TRADE INTERESTS READY FOR BATTLE IN U . S . HOUSE U . S . 0.010810
TRADE INTERESTS READY FOR FIGHT IN U . S . CONGRESS U . S . 0.008743
results = ranked_indexed_search(documents, inverted_index, query, tfidf)
print("\nNumber of matched results: %d" % len(results))
for title, score in results:
    print("%s %f" % (title, score))
Number of matched results: 41
ECONOMIC SPOTLIGHT - U . S . DEFICIT WITH 121.213109
TAIWAN SEES SHARP DECLINE IN SHIPBREAKING 64.755111
ASIAN EXPORTERS FEAR DAMAGE FROM U . S .- JAPAN RIFT 63.133312
ECONOMIC SPOTLIGHT - U . S . CONGRESS RAPS JAPAN 45.701311
TAIWAN CURBS INFLOWS OF FOREIGN EXCHANGE 43.998890
NET CHANGE IN EXPORT COMMITMENTS -- USDA 39.863770
TAIWAN ANNOUNCES NEW ROUND OF IMPORT TARIFF CUTS 38.188223
TAIWAN ANNOUNCES NEW ROUND OF IMPORT TARIFF CUTS 38.188223
RISING TAIWAN DOLLAR CAUSES FOREIGN RESERVES LOSS 35.701776
TAIWAN COMPLAINS ABOUT SIZE OF RESERVES 32.377556
TAIWAN FOREIGN EXCHANGE RESERVES HIT NEW HIGH 32.377556
TAIWAN DOLLAR AND RESERVES SEEN RISING MORE SLOWLY 32.377556
JAPANESE TARIFFS SEEN AS WORLDWIDE WARNING 31.566656
ECONOMIC SPOTLIGHT - JAPAN SHIPBUILDERS RECOVERY 25.755989
USDA COMMENTS ON EXPORT SALES REPORT U . S . 24.918215
TAIWAN PLANS MISSION TO CLOSE TRADE GAP WITH U . S . 24.080442
YEUTTER PUTS CURRENCY BURDEN ON TAIWAN , KOREA 24.080442
TAIWAN STEEL FIRM SEES LOWER EXPORTS , MORE OUTPUT 20.756222
TAIWAN SHIPBUILDER LOOKS FOR JAPANESE VENTURES 20.756222
USDA COMMENTS ON EXPORT SALES 15.783328
LONDON FREIGHT MARKET FEATURES GRAIN OUT OF U . S . 15.783328
USDA COMMENTS ON EXPORT SALES REPORT 15.783328
U . S . TRADE DEFICIT 38 . 37 BILLION DLRS IN 4TH QTR 15.783328
BALDRIGE CONCERNED ABOUT KOREAN / TAIWAN DEFICITS 14.945554
LATIN , CARIBBEAN NATIONS OPPOSE TRADE BILLS A 12.459108
MULFORD SAYS GERMANY , JAPAN SHOULD DO MORE 12.459108
THAI ZINC EXPORTS FALL IN MARCH 9.134887
TRADE INTERESTS READY FOR FIGHT IN U . S . CONGRESS U . S . 9.134887
TRADE INTERESTS READY FOR BATTLE IN U . S . HOUSE U . S . 9.134887
PHILIPPINE PLANNING CHIEF URGES PESO DEVALUATION 9.134887
LONDON GRAIN FREIGHTS 27 , 000 9.134887
BALDRIGE WARNS OF WORLD TRADE WAR DANGER U . S . 9.134887
REPORT EXPECTS SHARP DROP IN WORLD IRON IMPORTS 9.134887
EXPORT BUSINESS - GRAINS / OILSEEDS COMPLEX 9.134887
U . S . HOUSE PANEL TAKES FIRST TRADE BILL VOTES 9.134887
SOUTH AFRICA CORN EXPORTS COULD BE REDUCED - USDA 9.134887
DOLLAR VALUE APPROPRIATE , BUNDESBANK OFFICIAL SAYS 9.134887
U . S . HOUSE PANEL APPROVES TRADE BILL 9.134887
U . S . HOUSE PANEL APPROVES TRADE BILL 9.134887
AUSTRALIAN BEEF OUTPUT SEEN DECLINING IN 1987 9.134887
BALDRIGE WARNS OF WORLD TRADE WAR DANGER U . S . 9.134887

previous

TM03 POS Tagging

next

TM05 Sentiment analysis

Contents
  • Search
    • Build Data
    • Search for a single query term
    • Search for multiple terms (AND)
  • TF-IDF
    • Testing tf-idf
      • Compute tf and df
      • Compute idf
      • (Testing) tfidf for whole doc
  • Basic IR Model
    • Representation
      • Sorting
    • BM25
  • Inverted Indexing
    • Build inverted index

By JILUNG

© Copyright 2022.