dev

[Intro to Hadoop and MapReduce] Lesson 3 HDFS and MapReduce

Sanghun Kang

Nov 1, 2017 • 2 min read

1. Quiz: HDFS

Is there a problem? > https://youtu.be/6F8-cCUbRU8

Network failure
Disk failure on DN(datanode)
Not all DN used
Block sizes differ
Disk failure on NN(namenode)

2. Quiz: Data Redundancy

Any problem now?(when NN failure)

Data inaccessible > when network failure on NN
Data lost forever > when disk failure on NN
No problem

3. NameNode Standby

The active namenode works before, but the standby can be configured to take over if the active one fails.

4. HDFS Demo

Hadoop fs commands like unix commands
You can read instructions on how to access and run the virtual machines here

hadoop fs -ls
hadoop fs -put purchases.txt
hadoop fs -ls
hadoop fs -tail purchases.txt
hadoop fs -mv purchases.txt newname.txt
hadoop fs -rm newname.txt
hadoop fs -mkdir myinput
hadoop fs -put purchases.txt myinput
hadoop fs -ls myinput

5. MapReduce

6. Real World Example

7. Quiz: Hashtables

Hashtables > Key -> Value problems?

It won’t work
Run out of memory
Long time
Wrong answer

8. Distributed Work

9. Summary of MapReduce

Note: Hadoop takes care of the Shuffle and Sort phase. You do not have to sort the keys in your reducer code, you get them in already sorted order.

10. Quiz: Sort Final Result

Final results in sorted order?

Impossible
Only one reducer
Extra step

11. Quiz: Multiple Reducers

There are 4 intermediates: Apple, Banana, Carrot, Grape Which keys go to the first reducer?

Apple, Banana
Apple, Carrot
Carrot, Grape
Apple, Grape
Don’t Know; 2 Each
Don’t Know

Even One reducer would get none. See a nice overview of partitioning in Hadoop

12. Daemons of MapReduce

Job Tracker
Task Trackers

13. Running a Job

RUNNING A MAPREDUCE JOB WITH THE VM ALIAS hs {mapper script} {reducer script} {input_file} {output directory}

hadoop jar /usr/lib/hadoop-0.20-mapreduce/contrib/streaming/hadoop-streaming-2.0.0-mr1-cdh4.1.1.jar -mapper mapper.py -reducer reducer.py -file mapper.py -file reducer.py -input myinput -output joboutput

hadoop fs -get joboutput/part-00000 mylocalfile.txt

14. Simplifying Things

15. A Different Application

16. Other Problems

17. Virtual Machine Setup

You can read instructions on how to download and run the virtual machineshere.

Information on how to transfer files back and forth to the virtual machine can be found here.

For step-by-step instructions for how to load data into HDFS, please re-watch HDFS Demo. For a reminder of how to run a mapreduce job, please re-watch Simplifying Things.

18. Conclusion

See more in the free Chapter 6 of Tom White’s essential text, Hadoop: The Definitive Guide