Tuesday, August 12, 2008

Machine learning - and Apache Mahout

Isabel Drost recently contributed some enhancements to the Guided Editor (to allow nested facts, very handy) - quite a clever patch.

As if that isn't enough, she is also a contributor to the Apache Mahout project:
Mahout is: (in the projects own words): "Mahout's goal is to build scalable, Apache licensed machine learning libraries." The project site is here.

Interestingly one of my #1 books to read on the toilet at the moment is:

This book talks about (amongst many things) using machine learning to "learn" rules - the benefit of learning rules as opposed to some opaque representation is that a human has a fighting chance of understanding the rules, and improving the learning process. It would be interesting to one day see this stuff applied with projects like Mahout.

Anyway, here is Isabel's writeup on the subject:

The amount of digital data easily available for analysis both in research and
in business has increased tremendously during the last decade. One example of
such data are event logs generated in health care about the patient handling
process. Another example are event logs generated by standard workflow tools.
It is natural to ask whether it is possible to draw conclusions from these
logs, to generalize from what was observed, to learn common process rules
from this data [http://wwwkramer.in.tum.de/ipm08/].

In recent years a rather large community of researchers has treated the
problem of learning from example data. The goal of the new Apache project
Mahout [http://lucene.apache.org/mahout] is to create a commercial friendly,
stable, scalable suite of machine learning tools. The framework is designed
for high throughput and will be capable of handling massive datasets both
during training and application - in case this distinction exists. Our focus
is on scalability and we intend to provide parallelized machine learning
algorithm implementations based on the Hadoop framework.

To date several basic algorithms and frameworks have been implemented and
integrated into Mahout: There are implementations for grouping data points
that are similar to each other (clustering). Based on a set of labeled
examples it is possible to learn a classifier that is able to assign new data
points to existing categories(classification). Mahout also integrated Taste,
a framework for learning which items to recommend to users given a log of
user interactions.

Currently the focus is mainly on recommendation mining and learning from
textual data, yet the community is open for new ideas and happily welcomes
contributions from people involved in other topics. So in case you are
interested in machine learning or want to know what one could do with your
data, just drop by on the dev or users mailing list and post your questions
and comments.



  1. Thought I'd mention these blogs on machine learning and NLP.


    I have a collection of machine learning PDF's, if you want more stuff after that book.

  2. woolfel : can you pls. post where/how to get hold of those PDFs ?

    thank you,


  3. I've collected several AI and expert system related PDF's over the years. many of them are readily available on the internet. I have several different lists of pdf's on my blog.

  4. i am unable to itembasedrecommender on apache mahout , give an example of it i am working on ubuntu 12.04

  5. could you please share the more machine learning PDf's with us

  6. *Can I just say what a relief to find someone who actually knows what theyre talking about on the internet. You definitely know how to bring an issue to light and make it important. More people need to read this and understand this side of the story. I cant believe youre not more popular because you definitely have the gift.

    Search Engine Optimization

  7. Due to digital market approach, there is a serious skill gap among the professionals. Thus, on experiencing such course will stay you ahead in terms of competition as well as provide enough market exposure. As per statistics concerned, we will go to experience a rapid increase in job opportunity in this field and the average figure may reach to approx 1,50,000 jobs by the year 2020. digital marketing course in hyderabad

  8. Photos are worth a thousand words and Instagram is all about pictures. If you are into Instagram for marketing purposes, then you ought to understand that random photos do not work. cara meningkatkan traffic organik Instagram