[Intro to Hadoop and MapReduce] Lesson 4 Problem set
1. Quiz: HDFS
Which of the following is true?
- HDFS uses a central SAN(storage area network) to hold its data
- HDFS stores a single copy of all data
- HDFS replicates all data for reliability
- To store 100TB of data in a Hadoop cluster you would need 300TB of raw disk space by default
2. Quiz: DataNode
Which of the following is true if one of the nodes running the DataNode daemon on the cluster fails?
- Data could be lost
- Hadoop will automatically re-replicate any blocks which were stored on that node
- Hadoop will automatically e-mail the system administrator warning of the problem
- Hadoop will continue but from now on there will now only be two copies of some block
3. Quiz: NameNode
What precautions can you take to reduce the likelihood of problems related to NameNode failure?
- Configure the NameNode to store its metadata in a second location using NFS
- Configure a standby NameNode
- Run a NameNode daemon on every node in the cluster
- Make sure the NameNode is running on high-end hardware
4. Quiz: MapReduce
If you run a MapReduce job and specify an output directory in HDFS which already exists, which of the following happens?
- The previous directory will be deleted
- The previous directory will be renamed with __old after its name
- The job will refuse to run
- The job will run and new files will be put in the existing directory
5. Quiz: Key
Think about the data set we used in Lesson 2. If we wanted to work out how many people had purchased goods using a particular credit card, what could we use as the key emitted by the Mappers?
- The store name
- The product description
- The purchase method
- The purchase price
6. Quiz: Block Size
Why is Hadoop’s block size set to 64MB by default, when most filesystem have block sizes of 16KB or less?
- If Hadoop’s block size was set to 16KB, there would be a huge number of blocks throughout the cluster, which causes the NameNode to manage an enormous amount of metatada
- Since we need a Mapper for each block that we want to process, there would be a lot of Mappers, each processing a piece bit of data, which isn’t efficient
- Because you can only store one block per node on the cluster
- Because a 16KB block is too small for multiple Mappers to process simultaneously