Hi all,
ah...the old LSI issue has resurfaced again I see!!
First off, LSI is a statistical measure used to address the problems of synonomy, and polysemy. The idea is to retrieve documents based on the concepts in the document.
All well and nice. So why is no one raving about it if its the solution to end all our problems??
because:
Single value decomposition technique is used, which is unstable and and impractical for large highly dynamic collections (the web). The reason is that it needs to be run every time something changes.
There is no consensus as to how many concept dimensions to use. Researchers like Deerwester and Dumais just proceeded by trial and error, which is not possible with a huge corpus of web data.
After a certain threshold, the performance steadily decreases.
With LSI, you can't really use complicated keywords, because these will be filtered out the SVD stage. They cannot lead to generalization so they get discarded.
The SVD method has been used by excite for a long time. However not LSI.
Additionally, the patent is owned by Bellcore:
Telecordia Technologies (Bellcore) Patent : Computer information retrieval using latent semantic structure (U. S. Patent No. 4,839,853, June 13, 1989) before initiating any commerical product development based on LSI.
hehe...the patent cafe use it even: iamcafe.com/
Saying that Google uses LSI is like saying that a bike has a bolt on it.
You're talking about some of the brightest minds in computing. Do you think that they would use a technique shown to have real drawbacks for use on the web? I have read some solid research out of some of these institutions when I've had to implement some hard techniques, and I can tell you it would be lovely if it read like LSI! In a digital library environment it has worked ok, well enough in fact. So the very core idea is retained and a different method is made, which in turn, is just another cog in the machine.
A lot of SEO people believe it is used, which is fine. It can't hurt you or your site. Just don't treat it as the holy grail and the most important thing that Google does.
For me LSI is a rough and ready plug if I have to quickly get some estimates on a corpus.
Not a pretty site but has working examples for you to try: lsa.colorado.edu/
BTW, LSA and LSI are exactly the same thing.
Happy working
