Very nice article on how R is currently being utilized by a some of the bigger internet players - How Google and Facebook are using R - mainly for finding patterns in user behavior via log file analysis
While I have limited experience with R, I have had the opportunity to generate a few performance testing histograms, extract key phrases from documents, and play at bit with the TM package to generate term to document matrices. R seems like a very nice entrée into statistical data processing for a non-statistics person and I plan to make more use of it in the near future, hopefully exploiting some of its text mining related packages.
On a related note, I have been playing with utilizing Carrot2 to track industry news topics. So far more tuning is needed to get acceptable results but the plan is to generate a regular feed of hot topics in the news from a variety of industries (using cluster labels with some human intervention). I mention this because I am wondering if I can do this in R and how the results might compare to Carrot2. Should be an interesting experiment.
A little background blurb on the above:
R -- http://www.r-project.org/ "R is a free software environment for statistical computing and graphics. It compiles and runs on a wide variety of UNIX platforms, Windows and MacOS."
TM package -- http://cran.r-project.org/web/packages/tm/index.html "tm (shorthand for Text Mining Infrastructure in R) provides a framework for text mining applications within R."
Monday, April 12, 2010
R Statistics and the Internet
Posted by Darin at Monday, April 12, 2010
Labels: clustering, eCarrot, R, Statistics, text mining
Subscribe to:
Post Comments (Atom)

1 comments:
Get your geek on Darin!
Post a Comment