Quote from http://forums.searchenginewatch.com/...hp?threadid=48 original post by Orion

Hello. I'll introduce myself as Orion. I'm a formal scientist, with special interest in AI applied to IR technology. Let's start this thread with a brief description of keywords semantic connectivity and what it can do for improving success across search engines. My goal is that SEO/SEM R&D departments once for all start using scientific tools rather than mere rumors and 2nd-guessing thoughts disguised as "seo expert tips". Sorry for my 2 cents in advance. I'll like to present the key concepts then anyone interested can start commenting.

According to Fuzzy Set Theory ("Modern Information Retrieval"; Baeza-Yates; Ribeiro-Neto, Addison, 1999), the degree of term co-ocurrence in a database is a measure of semantic connectivity (SM) and can be used to build thesaurus for the database. Some engines use term co-occurence in their query expansion algorithms. Understanding how one can measure term co-occurence could be used to carefully select keywords semantically connected in a given search engine database. As an added benefit, SM makes unnecessary the excesive repetition of keywords (keywords spamming).

Let's us start with the simple case of two keywords (k1 and k2). Later on we can expand on other cases (more than 2 keywords, keywords transposition, entropy relevance, etc).

Let n1 and n2 be the number of search results containing k1 and k2, respectively and n12 is the number of search results containing both terms. (One actually does a search for k1 then for k2 and finally a composite query consisting of k1 and k2). Using geometry arguments and fuzzy sets, it can be demonstrated that there exists an index, termed correlation index, c, such that

c = n12/(n1 + n2 - n12)

Thus c oscillates between 0 and 1. Term correlations increases as c approaches 1. This allows us for in a given search engine or IR database

a. test the best combination of paired keywords from a pool of keywords with the highest semantic connectivity (for that database).
b. build a thesaurus of synonisms targeting that database
c. build a query expansion or find similars library.
d. carefully craft titles and descriptions of web pages


Having said that, let do some simple calculations.

Case 1: Single terms (synonyms, similar terms)

By querying Google, for car, auto and automobile we obtain

k1=car = 224,000,000
k2=auto = 124,000,000
k12=car auto = 13,000,000
c=0.0388 or 38.8 ppt

k1=car = 224,000,000
k2=automobile = 50,400,000
k12=car automobile = 10,500,000
c=0.0399 or 39.9 ppt

Results: Thus in Google, k1=car and k2=automobile seem to have a greater synonymity association (semantic

connectivity) than k1=car and k2=auto. Note. The large number of results for k2=auto is not surprising; (a)

auto is considered a word in other languages (eg. Spanish) (b) auto is a root for automobile, automatic and

derivative terms.


Case 2: Single terms (query refinement with similar concepts)

k1=car = 224,000,000
k2=insurance = 111,000,000
k12=car insurance = 9,000,000
c = 0.0276 or 27.6 ppt

k1=auto = 124,000,000
k2=insurance = 111,000,000
k12=auto insurance = 8,660,000
c = 0.0383 or 38.3 ppt

Results: In Google, auto insurance has a greater c-index than car insurance, thus having a greater semantic

association (semantic connectivity).

A Final Note.

If we double quote the k12's the c-indices will change, since quoted k12 results are a subset of unquoted k12

results. For example. In the above cases we obtain.

“car insurance” = 4,810,000 with c = 0.0146 or 14.6 ppt
“auto insurance” = 4,460,000 with c = 0.0193 or 19.3 ppt

Yet in Google the results still indicate that "auto insurance" appear to be more connected than "car

insurance".


About Language, Geolocation and Demographic Characteristics

Car in Mexico and Puerto Rico means auto and is also a stem of other terms and derivatives. The popular term

for car is not auto but actually coche, in Mexico and carro, in Puerto Rico. Thus geolocation and demographic

data interpretations are better confirmed with c-indices extracted from regional directories.

For a review of c-indices, read Baeza-Yates and Ribeiro-Neto's "Modern Information Retrieval"; (1999, Addison,

Chapters 2 and 5). c-index analyzers are excellent analytical tools for doing semantic connectivity analysis and for targeting keywords. They are also easy to build. I have written several applications
Have you seen this ?