Why more data often beats better algorithms


Technology trends and news by Jeremy Liew
June 19, 2008 | last edited July 10, 2008 | Comments (0)

2351

 As web based product development and game development both become more iterative, better datamining and analysis becomes more and more important. But the data generated by users behavior can be almost overwhelming. How should a startup think about getting the most insight and value from their data?

Anand Rajaraman is a co-founder of Kosmix, a Lightspeed portfolio company, and also teaches a datamining class at Stanford. He knows a thing or two about the subject, and he suggests that more data usually beats better algorithms:

Different student teams in my class adopted different approaches to the [Netflix challenge] problem, using both published algorithms and novel ideas. Of these, the results from two of the teams illustrate a broader point. Team A came up with a very sophisticated algorithm using the Netflix data. Team B used a very simple algorithm, but they added in additional data beyond the Netflix set: information about movie genres from the Internet Movie Database (IMDB). Guess which team did better?

Team B got much better results, close to the best results on the Netflix leaderboard!! I’m really happy for them, and they’re going to tune their algorithm and take a crack at the grand prize. But the bigger point is, adding more, independent data usually beats out designing ever-better algorithms to analyze an existing data set. I’m often suprised [sic] that many people in the business, and even in academia, don’t realize this.

Another fine illustration of this principle comes from Google. Most people think Google’s success is due to their brilliant algorithms, especially PageRank. In reality, the two big innovations that Larry and Sergey introduced, that really took search to the next level in 1998, were:

1. The recognition that hyperlinks were an important measure of popularity — a link to a webpage counts as a vote for it.
2. The use of anchortext (the text of hyperlinks) in the web index, giving it a weight close to the page title.

First generation search engines had used only the text of the web pages themselves. The addition of these two additional data sets — hyperlinks and anchortext — took Google’s search to the next level. The PageRank algorithm itself is a minor detail — any halfway decent algorithm that exploited this additional data would have produced roughly comparable results.

In a followup post, he notes that:

1. More data is usually better than more complex algorithms because complex algorithms don’t scale as well (computationally) and
2. More independent data is better than more of the same data, but if data was originally sparse, then more of the same data can help a lot too.

Mayank Bawa of Aster Data chimes in to say that running simple analysis over complete datasets is better than running more complex data over sampled datasets for two reasons:

1. The freedom of big data allows us to bring in related datasets that provide contextual richness.
2. Simple algorithms allow us to identify small nuances by leveraging contextual richness in the data.

In other words, since human behavior is complex, and some behavior crossmatches are rare, using a sample of data will cause some important but rare correlations to be lost into the noise.

He also points out that Google takes a similar approach to datamining.

This is good stuff.

To read more from Jeremy, visit his blog

0 comments

Vator.tv top stories

Assessing wisdom of crowds

2009-01-06-assessing-wisdom-of-crowds

Financial trends and news

by Eric Ries
January 6, 2009
Why fit is a quality that requires special treatment

Ignorance and bliss

2008-11-01-ignorance-and-bliss

From entrepreneur

by Demian Entrekin
November 1, 2008
How much knowledge is too much?

Go faster on an uphill

4736_jimkolleger_1

From investor

by Bambi Francisco
October 27, 2008
Gain from adversity; Running lessons that can be applied to the startup marathon

Survival in an advertising recession

2008-10-10-survival-in-an-advertising-recession

From investor

by Jeremy Liew
October 10, 2008
Which online media companies will make it through

35 tips for getting started with social media

2008-12-08-35-tips-for-getting-started-with-social-media

From entrepreneur

by Mike Fruchter
December 8, 2008
For beginners starting out with social media

Assessing wisdom of crowds

2009-01-06-assessing-wisdom-of-crowds

Financial trends and news

by Eric Ries
January 6, 2009
Why fit is a quality that requires special treatment

Facebook sues Power.com

2010-01-02-facebook-sues-powercom

Technology trends and news

by Bambi Francisco
January 2, 2009
Apparently, Power wasn't in compliance with other social networks

2008 was an ugly year for venture capitalists

2009-01-02-2008-was-an-ugly-year-for-venture-capitalists

Financial trends and news

by Bambi Francisco
January 2, 2009
Exits hit $24.1 billion down 58% from 2007
© 2008 Vator, Inc.