[Intro to Hadoop and MapReduce] Lesson 4 Problem set

[Intro to Hadoop and MapReduce] Lesson 4 Problem set

1. Quiz: HDFS

Which of the following is true?

  • HDFS uses a central SAN(storage area network) to hold its data
  • HDFS stores a single copy of all data
  • HDFS replicates all data for reliability
  • To store 100TB of data in a Hadoop cluster you would need 300TB of raw disk space by default

2. Quiz: DataNode

Which of the following is true if one of the nodes running the DataNode daemon on the cluster fails?

  • Data could be lost
  • Hadoop will automatically re-replicate any blocks which were stored on that node
  • Hadoop will automatically e-mail the system administrator warning of the problem
  • Hadoop will continue but from now on there will now only be two copies of some block

3. Quiz: NameNode

What precautions can you take to reduce the likelihood of problems related to NameNode failure?

  • Configure the NameNode to store its metadata in a second location using NFS
  • Configure a standby NameNode
  • Run a NameNode daemon on every node in the cluster
  • Make sure the NameNode is running on high-end hardware

4. Quiz: MapReduce

If you run a MapReduce job and specify an output directory in HDFS which already exists, which of the following happens?

  • The previous directory will be deleted
  • The previous directory will be renamed with __old after its name
  • The job will refuse to run
  • The job will run and new files will be put in the existing directory

5. Quiz: Key

Think about the data set we used in Lesson 2. If we wanted to work out how many people had purchased goods using a particular credit card, what could we use as the key emitted by the Mappers?

  • The store name
  • The product description
  • The purchase method
  • The purchase price

6. Quiz: Block Size

Why is Hadoop’s block size set to 64MB by default, when most filesystem have block sizes of 16KB or less?

  • If Hadoop’s block size was set to 16KB, there would be a huge number of blocks throughout the cluster, which causes the NameNode to manage an enormous amount of metatada
  • Since we need a Mapper for each block that we want to process, there would be a lot of Mappers, each processing a piece bit of data, which isn’t efficient
  • Because you can only store one block per node on the cluster
  • Because a 16KB block is too small for multiple Mappers to process simultaneously