I finally managed to configure Pig and Hadoop on my computer. I used Pig 0.7 and Hadoop 0.20.2. It took me a while to configure but finally I made it.  Hadoop and Pig are constantly getting updated so don’t trust much on tutorials of older versions if you are not very experienced on the matter.  Nevertheless, I should mention this tutorial because it helped me a great deal in understanding how to configure hadoop. The only major misunderstanding was with the configuration of the ssh,  so if you are a beginner like me, be careful to mess with ssh .


  1. Read the apache tutorials on Pig and Hadoop but be careful with some mistakes they make on the writing
  2. Use the tutorial that comes within the folder of Pig (the tutorial files they talk on the Pig tutorial are inside the Pig’s folder).
  3. Get the latest stable versions

1. Red the apache tutorials on Pig and Hadoop but be careful with some mistakes : It means that they do have some mistakes, for example on this part of the tutorial the id.pig  is:

A = load 'passwd' using PigStorage(':');
B = foreach A generate $0 as id;
dump B;
store B into ‘id.out’;

They forget to mention that you either use dump or store…you may have some errors if you use both. Second if you copied and pasted this code then you will for sure have an error instead change the last part with ‘id.out’ (not the same as above).   I also received an error with the following mapreduce script

Unix:   $ java -cp pig.jar:.:$HADOOPDIR idmapreduce

It can not find the passwd file on hdfs directory and  it does not have a logout file to write the results. Instead of figuring out the problem, I went ahead and ran another mapreduce job with another command from the next section of this tutorial (following the steps) and it worked!

$ java -cp $PIGDIR/pig.jar:$HADOOP_CONF_DIR  org.apache.pig.Main 

So if this script worked fine then the previous one must have something wrong, I will test tomorrow if putting passwd on the hdfs  would eventually solve the problem.

2. Use the tutorial that comes within the folder of Pig (the tutorial files they talk on the Pig tutorial are inside the Pig’s folder) and 3. Get the latest stable versions

This is important because there are changes between versions. I made a stupid stupid mistake on this. I did not know that the files used for testing on the Pig’s tutorials are actually inside a folder called “tutorial” inside my Pig’s folder. So I downloaded a tutorial of a previous Pig’s version….and of course I kept getting mistakes since I was running with a later version of Pig.  After I made the  appropriate corrections , it worked!!

The errors I was having were scary and hard to interpret , I got for example: “INFO executionengine.HExecutionEngine: Connecting to hadoop file system at: file:///”  and “ERROR mapReduceLayer.MapReduceLauncher: java.io.IOException: excite.log.bz2 does not exist” (posted here).

It was finally solved when I used the appropriate tutorial files. It was not easy to figure it out.

Future work

Well, now that I have Pig and hadoop running smoothly, I will start to make a lot of experiments. My task is to give a “score” to tweets according to a list of words with or without weights.  So for example if my tweet of 8 words is  “Samsung Launching New Android Device on November   http://on.mash.to/9wJbGC” and my list has three words Iphone BlackBerry and Android , the total weight of this tweet will be 1/8. Things get more complicated when I have to filter content and use weights…. I will run my experiments in one large file containing a lot of tweets and THEN after having it right I will run in the cluster of yahoo…which has huge amount of data.


  1. Should we consider numbers and urls as words? I was told that urls should be considered as counting words but I am a little reluctant about it. (of course RT,  via, @… will not be considered)
  2. I am afraid with regard to the languages….how to sort that?


I will assist remotely to a class in California LA regarding Pig (introductory 2 hour course) and given by yahoo 🙂

Posted in Research, Uncategorized | Tagged , , | Leave a comment

Going to Yahoo R +D

Started my phd with a lot of work. After two weeks of getting things ready and beautifying my desk I was moved to yahoo R&D where I am supposed to stay at least half of the day… but in the reality I am staying almost the whole day due to the complexity of my tasks.

I am motivated of course but I have to do a lot of things I have never done before…lots of learning these days.

What I can say is that the project I am getting into is very interesting because it will analyze “diversity” of opinions and cultural differences…At least try to catch that from what people say online.

Hadoop + Pig = me crazy.

How is that for my first post?

Posted in Research | Tagged , , | Leave a comment

Hello world!

Hello world, this is my new blog!.

I have always liked to write what I do and think.  Since 11 y.o I carry a diary, the frequency has decreased a big deal and I do not update my diary on a paper anymore, now everything I write is digital. I think that is one of the reasons of my very bad handwriting.

What does this blog differentiate from the previous one? In this blog, I will focus more on my PHD and everything that I discover along the way. In other words, no drama, no love stories, just science, code and of course some observations about life in general.

I will write mostly in English but I may be tempted to write in Spanish a couple of times. English is not my native language but I take it as a challenge and a way to practice my writing skills.

Posted in Uncategorized | Leave a comment