PIG AND HADOOP CONFIGURED!

Experience

I finally managed to configure Pig and Hadoop on my computer. I used Pig 0.7 and Hadoop 0.20.2. It took me a while to configure but finally I made it.  Hadoop and Pig are constantly getting updated so don’t trust much on tutorials of older versions if you are not very experienced on the matter.  Nevertheless, I should mention this tutorial because it helped me a great deal in understanding how to configure hadoop. The only major misunderstanding was with the configuration of the ssh,  so if you are a beginner like me, be careful to mess with ssh .

Advices:

  1. Read the apache tutorials on Pig and Hadoop but be careful with some mistakes they make on the writing
  2. Use the tutorial that comes within the folder of Pig (the tutorial files they talk on the Pig tutorial are inside the Pig’s folder).
  3. Get the latest stable versions

1. Red the apache tutorials on Pig and Hadoop but be careful with some mistakes : It means that they do have some mistakes, for example on this part of the tutorial the id.pig  is:

A = load 'passwd' using PigStorage(':');
B = foreach A generate $0 as id;
dump B;
store B into ‘id.out’;

They forget to mention that you either use dump or store…you may have some errors if you use both. Second if you copied and pasted this code then you will for sure have an error instead change the last part with ‘id.out’ (not the same as above).   I also received an error with the following mapreduce script

Unix:   $ java -cp pig.jar:.:$HADOOPDIR idmapreduce

It can not find the passwd file on hdfs directory and  it does not have a logout file to write the results. Instead of figuring out the problem, I went ahead and ran another mapreduce job with another command from the next section of this tutorial (following the steps) and it worked!

$ java -cp $PIGDIR/pig.jar:$HADOOP_CONF_DIR  org.apache.pig.Main 
script1-hadoop.pig

So if this script worked fine then the previous one must have something wrong, I will test tomorrow if putting passwd on the hdfs  would eventually solve the problem.

2. Use the tutorial that comes within the folder of Pig (the tutorial files they talk on the Pig tutorial are inside the Pig’s folder) and 3. Get the latest stable versions

This is important because there are changes between versions. I made a stupid stupid mistake on this. I did not know that the files used for testing on the Pig’s tutorials are actually inside a folder called “tutorial” inside my Pig’s folder. So I downloaded a tutorial of a previous Pig’s version….and of course I kept getting mistakes since I was running with a later version of Pig.  After I made the  appropriate corrections , it worked!!

The errors I was having were scary and hard to interpret , I got for example: “INFO executionengine.HExecutionEngine: Connecting to hadoop file system at: file:///”  and “ERROR mapReduceLayer.MapReduceLauncher: java.io.IOException: excite.log.bz2 does not exist” (posted here).

It was finally solved when I used the appropriate tutorial files. It was not easy to figure it out.

Future work

Well, now that I have Pig and hadoop running smoothly, I will start to make a lot of experiments. My task is to give a “score” to tweets according to a list of words with or without weights.  So for example if my tweet of 8 words is  “Samsung Launching New Android Device on November   http://on.mash.to/9wJbGC” and my list has three words Iphone BlackBerry and Android , the total weight of this tweet will be 1/8. Things get more complicated when I have to filter content and use weights…. I will run my experiments in one large file containing a lot of tweets and THEN after having it right I will run in the cluster of yahoo…which has huge amount of data.

Questions

  1. Should we consider numbers and urls as words? I was told that urls should be considered as counting words but I am a little reluctant about it. (of course RT,  via, @… will not be considered)
  2. I am afraid with regard to the languages….how to sort that?

Motivation

I will assist remotely to a class in California LA regarding Pig (introductory 2 hour course) and given by yahoo 🙂

Advertisements
This entry was posted in Research, Uncategorized and tagged , , . Bookmark the permalink.

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s