Thursday, May 20, 2010

Google Predict

New service from Google that looks like their attempt to get a foothold in the Predictive Analytics market and perhaps bring Predictive Analytics to the masses. Process is:

  1. Load your data
  2. Build your model (compile your data)
  3. Make predictions

This is worth playing with in order to get a scense of the basic kinds of things Predictive Analytics as to offer.

Wednesday, May 05, 2010

GATE (General Architecture for Text Engineering) announes some very interesting new tools to be released this summer

GATE (General Architecture for Text Engineering) announes some very interesting new tools to be released this summer - http://gate.ac.uk/family/coming-soon/.

Of particular note is the 'GATE Mimir multiparadigm indexing', a search engine that indexes GATEs document annotations (e.g. entities, facts, parts of speach etc.) and enables structurerd seach on these annotations such as:

  • {Determiner}{Adjective}{Noun}
  • {Person}, CEO of {Organization},based in {Location}

Mimir can also store and utilize ontologies allowing it to augment queries via class types such as broadening a search or translating units of measurement etc. Seems like the query transformation via ontology possibilities could prove quite useful, especially within a restricted domain.

Mimir's java API might also make it a convenient mechanism for on the fly knowledge extraction (e.g. query in indexed collection of content for specific sets of entities or facts).

Monday, April 12, 2010

R Statistics and the Internet

Very nice article on how R is currently being utilized by a some of the bigger internet players - How Google and Facebook are using R - mainly for finding patterns in user behavior via log file analysis

While I have limited experience with R, I have had the opportunity to generate a few performance testing histograms, extract key phrases from documents, and play at bit with the TM package to generate term to document matrices. R seems like a very nice entrée into statistical data processing for a non-statistics person and I plan to make more use of it in the near future, hopefully exploiting some of its text mining related packages.

On a related note, I have been playing with utilizing Carrot2 to track industry news topics. So far more tuning is needed to get acceptable results but the plan is to generate a regular feed of hot topics in the news from a variety of industries (using cluster labels with some human intervention). I mention this because I am wondering if I can do this in R and how the results might compare to Carrot2. Should be an interesting experiment.

A little background blurb on the above:


R -- http://www.r-project.org/ "R is a free software environment for statistical computing and graphics. It compiles and runs on a wide variety of UNIX platforms, Windows and MacOS."


TM package -- http://cran.r-project.org/web/packages/tm/index.html "tm (shorthand for Text Mining Infrastructure in R) provides a framework for text mining applications within R."

Wednesday, March 03, 2010

Large Scale Content Classification With Little Time and Few Resources

What would you do if faced with the daunting task of building multiple industry taxonomies for large scale automatic classification given very little time and even less resources. This is exactly the dilemma zibb.com came face-to-face with early in our inception. Of course, being an enthusiastic bunch, myself and a co-worker, said yes, of course we can do this and set off to tackle this problem with full abandon and a prayer (an ecumenical prayer of course with a few expletive deletives thrown in for good measure).

Luckily, we had a few tricks up our sleeves plus the luxury of being able to focus almost exclusively on precision (accuracy) without too much regard for recall (getting it all) – this was largely thanks to our huge index of over 2 billion document.

Our first trick was that we already had Teragram’s rules based classifier available to us and the second trick was that our content was already organized into industries (sectorized). These two factors worked to greatly reduce the difficulty of our task.

Utilizing a rules based classifier, like Teragram, eliminates the need to collect training documents for each topic which can take an exceptional amount of time and if not done carefully result in-accurate results.

Secondly, sectorizing content significantly reduces the variety of meanings that words tend to have. For example, the term MMIC in the below Teragram rule (taken from our Electronics industry taxonomy) can mean all sorts of thing outside of the Electronics industry context.

(OR,"mmics","mmic","monolithic microwave")

MMIC = Medical Marijuana Identification Card
MMIC = Medical Mutual Insurance Company
MMIC = Mobile Medical International Corporation
MMIC = Motorcycle & Moped Industry Council

However, within the electronics industry it means “Monolithic Microwave Integrated Circuits”. Bottom line is having sectorized content allowed us to write very simple rules without having to worry about language ambiguity, which significantly reduced the effort required to write each rule.

The final piece of this puzzle involved implementing a ranking algorithm which allowed us to set a taxonomy wide aboutness threshold. For this we worked closely with the Teragram team to implement a ranking algorithm which looks at density of term hits and hit position along with a number of other factors. This allowed us to set a ranking cutoff score under which we drop classifications. This ranking algorithm is called "zone ranking" and is now part of the standard Teragram classification tool (by the way, we really enjoyed working with Teragram folks as they were eager to collaberate and really worked as a team with us).

In the end, this combination of sectorized content + simple rules + ranking cutoff allowed us to build out almost forty industry taxonomies in English plus twenty of those in Dutch and do it on time and in budget. Some of the results of this work can be seen on http://www.zibb.com/ - e.g. MMIC

Incidentally, this formula allowed us to run automatic content classification for large entity files without having to write a single rule (auto-generated the rules from names). These entities include companies, airplanes, airports, chefs, Semiconductor equipment and more.

So if you are ever faced with the task of implementing classification for a large number of topics and documents in a short period of time and few resource, you may want to consider a rules based classifier combines with some sort of overall content segmentation (like sectorization).

It should be noted that a great disadvantage of a rules base classifier is it is language dependant, meaning that the rules must be rewritten for each language. As such if you need to classify content in a large variety of languages a language independant statistical approach might prove to be more scalable.

Commitments...

Well, looking back at some of our older blog posts is quite entertaining!  Over and over we committed to posting frequently, talking about what we do, and we even seem to imply that we'd be fun and interesting while doing it!  Well, I'll admit that was too much to sign up for; we've had too much work to do on behalf of our clients and our company and we just couldn't "spare a square" to sit down and jot out compelling posts.    And to think, I once deleted one of our team members posts about the top ten urinals because I thought it wasn't professional enough to post; you can actually find great suppliers and research about such fixtures should you need one for your project.

Another admittance that I should formally make is to say, don't worry, our ego's are in check, we know that no one is intentionally reading this blog to find the next scintillating tale of incredible technological accomplishment or to find a demonstration of superior business acumen.

But that's OK.  We are a team of less than a dozen people (see I can't even commit to our real team size!) who work incredibly hard everyday and accomplish minor miracles for our organization.  We're not curing cancer, we're not addressing global poverty, but we are passionate people contributing to an information age society that needs help finding things, getting things, and making things.

So every once in a while, we might come over here and post about something that we are interested in, or some product we've built, heck, we might even talk about something you did that was cooler than our current scope allows withing expecting anything in return.  This post will cleanse our pallet of our previous over-commitments and let us just post when we want to, work hard and make great products without having to be the smartest guys or gals in the room.  Besides, whatever we post after this little ditty can only seem more brilliant for it.

Monday, November 02, 2009

B2B Search Terrain: Supporting all the Participants of NaNoWriMo 2009

Each November an amazing event (http://www.nanowrimo.org) rolls around that challenges any aspiring writer, casual writer, or insanely crazy person to write a 50,000 word novel in 30 days. In an effort to show my support to these participants during this grueling month I will be blogging "frequently" on topics related to B2B Search. While I know I won't keep pace (1,700 words per day) with the true writers in this world, I would like to show my allegiance by contributing to this blog several times a week.

The B2B Search Terrain will cover topics that relate to business-to-business search. At Zibb we have been doing search for over four years. But it's not just about search; so much of what we do is because we have a search environment that allows us this amazing opportunity to look at the innards of content and create powerful products for our B2B customers. Some of the topics that will be up for blogging are:


* Lead Generation
* Directory Search
* B2B Search
* Net Neutrality
* Website Registration
* B2B Editorial Content
* Cloud Computing in Search
* Other Search Engines
* B2B Industry News
* Building Sites on Search


For a few of these topics I will be asking my colleagues to weigh in with their thoughts and opinions. Many of these topics will be observations and live examples of how we at Zibb handle a problem and come up with solid solutions for our B2B customers.

Best of luck to all those NaNoWriMo novelists or as my boss would say, "bash on"!

Friday, October 09, 2009

Not Invented Here! Zibb invites OneRiot over for dinner and drinks

We do a lot of great search things here at Zibb. We don't have to be humble about that since 1) no one reads our blog and 2) we really do have a great team building great opportunities for our consumers!

But one thing we do particularly well is to recognize great work when we see it. We are going to start highlighting those great accomplishments of others and providing unasked for feedback and sparkling commentary. When it makes sense and we can work it out we'll also team up with those great achievers to bring their ideas and services into play for our own users and communities.

The first such team up is with our new friends over at OneRiot (http://www.oneriot.com). They launched about a year ago and pretty quickly caught our attention. They focus on "buzzed about" stories using their own PulseRank algorithm and this produces a very unique perspective on any topic. Here at Zibb, using results from our own engine, we can tell you all about Google and Google Wave; oh boy we got news coverage going back to the first announcement of Google's latest gizmo; but what we don't tell you is what the world "thinks" of Google Wave as a whole. We believe that when you add OneRiot's social perspective to our own high-quality, semantically enhanced results, you get something synergistic; better together to benefit our B2B users who need to keep up on both the companies, products and trends in their market.

It becomes massively interesting to combine the full coverage of a topic with the bits that drive the crowd wild. OneRiot+Zibb is just that. Starting today you will get OneRiot results on the same page with Zibb.com results for any keyword you search for; Try http://www.zibb.com/all/search/all?q=apple+tablet. Oh, and we've added lots of OneRiot-y spice to our hot topic pages. Take a look at http://www.zibb.com/all/hot-topic/sidewiki. You say you didn't know we had hot topics? We periodically create these specialty pages around trending topics and now we are including the social Pulse of the universe by adding OneRiot results to those pages as well. Here's one we did for our buds at OneRiot: http://www.zibb.com/all/hot-topic/oneriot.

What now? Well, we're going to keep pushing OneRiot into more places on Zibb.com; we'll talk to the 150+ Zibb On Demand customers who use our search-based products on their own websites and let then in on the action; and as always we'll add a few new and innovative spins in just to keep it interesting. So welcome to Zibb OneRiot! Who's next? We've already integrated a few other search friends, so stay tuned as we continue to build and partner up to create the best B2B vertical search experience. And feel free to talk to us if you think you have something to bring to the table.


About OneRiot
OneRiot, a realtime search engine, helps users find the news, blogs and videos that people are buzzing about. OneRiot ranks its search results using PulseRank, a realtime ranking algorithm that sorts web content according to its current social significance. By indexing pages shared by millions of Digg, Twitter, and wider social web users - including the contributions of OneRiot’s own three million-strong panel - OneRiot delivers fresh, hyper-relevant search results that answer the question: what is happening right now? For more information: www.oneriot.com.

About Zibb.com
Zibb, a search engine, helps their B2B audience find high-quality relevant content within their industry to best address their needs for researching topics, companies and products. Zibb combines company white page information with relevant news, blogs, whitepapers and company websites all organized and annotated by industry. When you search for chips, you might want silicon, you might want potato crisps, or you might want an old tv series about the California Highway Patrol. Zibb knows the difference and will help you cut to the most relevant results. For more information: www.zibb.com.

About Reed Business
A member of the Reed Elsevier plc group, Reed Business is the largest business-to-business publisher in the world. We have a portfolio of more than 200 market leading publications, newsletters, directories and reference books, electronic products, online services, industry conferences and awards, covering over 25 markets in the USA, UK, and Asia. Publishing many of the leading names in business publishing, including Variety, EDN, Computer Weekly, Design News, RCD, totaljobs.com, Estates Gazette, New Scientist, Flight International, Kellysearch, The Bankers' Almanac, Mardev and DM2, Reed Business has annual sales of over £2 billion. Our market leadership stems from a focus on product quality, innovation, editorial excellence, staff training and development, and the customer. This ensures that we continue to offer the most creative marketing and effective business solutions. For more information: www.reedbusiness.com