I finally managed to configure Pig and Hadoop on my computer. I used Pig 0.7 and Hadoop 0.20.2. It took me a while to configure but finally I made it. Hadoop and Pig are constantly getting updated so don’t trust much on tutorials of older versions if you are not very experienced on the matter. Nevertheless, I should mention this tutorial because it helped me a great deal in understanding how to configure hadoop. The only major misunderstanding was with the configuration of the ssh, so if you are a beginner like me, be careful to mess with ssh .
- Read the apache tutorials on Pig and Hadoop but be careful with some mistakes they make on the writing
- Use the tutorial that comes within the folder of Pig (the tutorial files they talk on the Pig tutorial are inside the Pig’s folder).
- Get the latest stable versions
1. Red the apache tutorials on Pig and Hadoop but be careful with some mistakes : It means that they do have some mistakes, for example on this part of the tutorial the id.pig is:
A = load 'passwd' using PigStorage(':'); B = foreach A generate $0 as id; dump B; store B into ‘id.out’;
They forget to mention that you either use dump or store…you may have some errors if you use both. Second if you copied and pasted this code then you will for sure have an error instead change the last part with ‘id.out’ (not the same as above). I also received an error with the following mapreduce script
Unix: $ java -cp pig.jar:.:$HADOOPDIR idmapreduce
It can not find the passwd file on hdfs directory and it does not have a logout file to write the results. Instead of figuring out the problem, I went ahead and ran another mapreduce job with another command from the next section of this tutorial (following the steps) and it worked!
$ java -cp $PIGDIR/pig.jar:$HADOOP_CONF_DIR org.apache.pig.Main script1-hadoop.pig
So if this script worked fine then the previous one must have something wrong, I will test tomorrow if putting passwd on the hdfs would eventually solve the problem.
2. Use the tutorial that comes within the folder of Pig (the tutorial files they talk on the Pig tutorial are inside the Pig’s folder) and 3. Get the latest stable versions
This is important because there are changes between versions. I made a stupid stupid mistake on this. I did not know that the files used for testing on the Pig’s tutorials are actually inside a folder called “tutorial” inside my Pig’s folder. So I downloaded a tutorial of a previous Pig’s version….and of course I kept getting mistakes since I was running with a later version of Pig. After I made the appropriate corrections , it worked!!
The errors I was having were scary and hard to interpret , I got for example: “INFO executionengine.HExecutionEngine: Connecting to hadoop file system at: file:///” and “ERROR mapReduceLayer.MapReduceLauncher: java.io.IOException: excite.log.bz2 does not exist” (posted here).
It was finally solved when I used the appropriate tutorial files. It was not easy to figure it out.
Well, now that I have Pig and hadoop running smoothly, I will start to make a lot of experiments. My task is to give a “score” to tweets according to a list of words with or without weights. So for example if my tweet of 8 words is “Samsung Launching New Android Device on November http://on.mash.to/9wJbGC” and my list has three words Iphone BlackBerry and Android , the total weight of this tweet will be 1/8. Things get more complicated when I have to filter content and use weights…. I will run my experiments in one large file containing a lot of tweets and THEN after having it right I will run in the cluster of yahoo…which has huge amount of data.
- Should we consider numbers and urls as words? I was told that urls should be considered as counting words but I am a little reluctant about it. (of course RT, via, @… will not be considered)
- I am afraid with regard to the languages….how to sort that?
I will assist remotely to a class in California LA regarding Pig (introductory 2 hour course) and given by yahoo 🙂