Monday, April 12, 2010

R Statistics and the Internet

Very nice article on how R is currently being utilized by a some of the bigger internet players - How Google and Facebook are using R - mainly for finding patterns in user behavior via log file analysis

While I have limited experience with R, I have had the opportunity to generate a few performance testing histograms, extract key phrases from documents, and play at bit with the TM package to generate term to document matrices. R seems like a very nice entrée into statistical data processing for a non-statistics person and I plan to make more use of it in the near future, hopefully exploiting some of its text mining related packages.

On a related note, I have been playing with utilizing Carrot2 to track industry news topics. So far more tuning is needed to get acceptable results but the plan is to generate a regular feed of hot topics in the news from a variety of industries (using cluster labels with some human intervention). I mention this because I am wondering if I can do this in R and how the results might compare to Carrot2. Should be an interesting experiment.

A little background blurb on the above:


R -- http://www.r-project.org/ "R is a free software environment for statistical computing and graphics. It compiles and runs on a wide variety of UNIX platforms, Windows and MacOS."


TM package -- http://cran.r-project.org/web/packages/tm/index.html "tm (shorthand for Text Mining Infrastructure in R) provides a framework for text mining applications within R."

1 comments:

Brien said...

Get your geek on Darin!