Explain the application of TF-IDF algorithm in the optimization of Shanghai Dragon

IDF: inverse text frequency index

so their frequency of

hypothesis: Shanghai dragon page retrieval digital 20 million, website optimization search number is 10 million, the number of retrieval techniques for 500 million

TF (=8/400=0.02

search engine index number is assumed to be 10 billion.

directly to the point, the TF-idf algorithm in the end is how to calculate the


and IDF is also very document frequency, refers to the word count of N appeared in many pages, file count is M, then IDF=lg (M/N). Assume that "site optimization" appears in the 2000 page document, the total number is 100 million, then the frequency of the IDF=lg file (100000000/2000) =4.69897, then the calculation of the final TF-IDF=0.02*4.69897=0.0939794.

In fact,

www.ruihess贵族宝贝 in Shanghai Longfeng this website page (page 70 400) appeared 8 times, website optimization appears 10 times, 16 skills.


, Shanghai dragon)

TF (=20/400=0.04

We illustrate

, TF word meaning, refers to a number of words in the page, if the total words of an article number is 200, and the "site optimization" this word appears 4 times, then the frequency is 0.02 TF=4/200.




search engine, weight calculation, according to each word segmentation to calculate, for example: the word "Shanghai dragon website optimization skills.

this is a judgment of a page of related problems, and in the Shanghai dragon website optimization, not only to determine the value of TF-IDF points, we need a high degree of recognition of the word for the page points. For example: the search engine included one trillion pages, it should be said that each page will have ",,," and so on, these high-frequency words also called noise words or stop words, search engine will remove these words, so these words plus weight is 0. Formula: TF-IDF=log (1 trillion / one trillion) =log1=0.


TF-idf algorithm is a kind of user information retrieval and data mining technology commonly used weighted, is often applied to the Shanghai dragon ER, and many people may not know, in fact, the most intuitive understanding is "the website keyword density".

TF =10/400=0.025