[Intro to Hadoop and MapReduce] Lesson 1 Big data
1. Introduction
You can read more about Big Data in Wikipedia which is also a company that generates and processes huge amounts of data itself.
MapReduce and Apache Hadoop are the technologies we will be talking about more in this course.
2. Data Sources
According to IBM: “Every day, 2.5 billion gigabytes of high-velocity data are created in a variety of forms, such as social media posts, information gathered in sensors and medical devices, videos and transaction records”
3. Quiz: Big Data
What is BIG DATA?
- Order deatils for a store
- All orders across 100s of stores
- A Persons’s stock portpolio
- All stock transactions for the NYSE
4. Definition of Big Data
A resonable definition of big data might be, It’s data that’s too big to be processed on a single machine.
Big Data is a loosely defined term used to describe data sets so large and complex that they become awkward to work with using standard statistical software. (International Journal of Internet Science, 2012, 7 (1), 1–5)
5. Quiz: Challenges
Challenges with big data
- Most data is worthless
- Data is created fast
- Data from different sources in various formats
6. The 3 Vs - Volume
The 3 V’s were first defined in a research report by Douglas Laney in 2001 titled “3D Data Management: Controlling Data Volume, Velocity and Variety”.
In 2012 he updated the definition as follows “Big data is high volume, high velocity, and/or high variety information assets that require new forms of processing to enable enhanced decision making, insight discovery and process optimization”.
7. Quiz: Worthwhile Data
Data Worth Storing?
- Transactions
- Logs
- Business
- User
- Sensor
- Medical
- Social
8. Variety
The problem is that to store data in systems like that(traditional database), the data needs to be able to fit in pre-defined tables. And a lot of data that we deal with these days, tends to be what we call unstructured or semi-sturctured data.
9. Data Formats
Nice thins about Hadoop is that it doesn’t care what format your data comes in. Unlike a traditional database, you can store the data in its raw format and manipulate it and reformat it later.
10. Quiz: Using Variety
Variety
- Current GPS
- Current Plan
- Traffic Data
- Current Load
- Fuel Efficiency
11. Velocity
TB/day
12. Quiz: Your Interests
What data intrests you? > Survey question. no right.
- Science
- E-commerce
- Financial
- Medical
- Sports
- Social
- Utilities
13. Doug Intro
14. Doug Cutting: The Origins of Hadoop
Doug Cutting, Creator of Hadoop
Here are the papers Google published about their distributed file system (GFS) and their processing framework, MapReduce.
15. Hadoop Logo Intro
16. Doug Cutting: The Name of Hadoop
Came from his son’s toy.
17. Core Hadoop
Cloudera provides free download of Chapter 2 of Tom White’s essential text, Hadoop: The Definitive Guide.
18. Hadoop Ecosystem
See more inforation about Pig, Hive, HBase, Impala, Mahout, Sqoop, Flume, Hue, Oozie.
- CDH
19. Congratulations
See more in the free Chapter 2 of Tom White’s essential text, Hadoop: The Definitive Guide