[Intro to Hadoop and MapReduce] Lesson 3 HDFS and MapReduce

[Intro to Hadoop and MapReduce] Lesson 3 HDFS and MapReduce

1. Quiz: HDFS

Is there a problem? > https://youtu.be/6F8-cCUbRU8

  • Network failure
  • Disk failure on DN(datanode)
  • Not all DN used
  • Block sizes differ
  • Disk failure on NN(namenode)

2. Quiz: Data Redundancy

Any problem now?(when NN failure)

  • Data inaccessible > when network failure on NN
  • Data lost forever > when disk failure on NN
  • No problem

3. NameNode Standby

The active namenode works before, but the standby can be configured to take over if the active one fails.

4. HDFS Demo

  • Hadoop fs commands like unix commands
  • You can read instructions on how to access and run the virtual machines here
hadoop fs -ls
hadoop fs -put purchases.txt
hadoop fs -ls
hadoop fs -tail purchases.txt
hadoop fs -mv purchases.txt newname.txt
hadoop fs -rm newname.txt
hadoop fs -mkdir myinput
hadoop fs -put purchases.txt myinput
hadoop fs -ls myinput

5. MapReduce

6. Real World Example

7. Quiz: Hashtables

Hashtables > Key -> Value problems?

  • It won’t work
  • Run out of memory
  • Long time
  • Wrong answer

8. Distributed Work

9. Summary of MapReduce

Note: Hadoop takes care of the Shuffle and Sort phase. You do not have to sort the keys in your reducer code, you get them in already sorted order.

10. Quiz: Sort Final Result

Final results in sorted order?

  • Impossible
  • Only one reducer
  • Extra step

11. Quiz: Multiple Reducers

There are 4 intermediates: Apple, Banana, Carrot, Grape Which keys go to the first reducer?

  • Apple, Banana
  • Apple, Carrot
  • Carrot, Grape
  • Apple, Grape
  • Don’t Know; 2 Each
  • Don’t Know
Even One reducer would get none. See a nice overview of partitioning in Hadoop

12. Daemons of MapReduce

  • Job Tracker
  • Task Trackers

13. Running a Job

RUNNING A MAPREDUCE JOB WITH THE VM ALIAS hs {mapper script} {reducer script} {input_file} {output directory}

hadoop jar /usr/lib/hadoop-0.20-mapreduce/contrib/streaming/hadoop-streaming-2.0.0-mr1-cdh4.1.1.jar -mapper mapper.py -reducer reducer.py -file mapper.py -file reducer.py -input myinput -output joboutput

hadoop fs -get joboutput/part-00000 mylocalfile.txt

14. Simplifying Things

15. A Different Application

16. Other Problems

17. Virtual Machine Setup

You can read instructions on how to download and run the virtual machineshere.

Information on how to transfer files back and forth to the virtual machine can be found here.

For step-by-step instructions for how to load data into HDFS, please re-watch HDFS Demo. For a reminder of how to run a mapreduce job, please re-watch Simplifying Things.

18. Conclusion

See more in the free Chapter 6 of Tom White’s essential text, Hadoop: The Definitive Guide