Curated by Mathew Anthony for those who want to get, keep and grow their customers ... and some trending issues
Tuesday, July 1, 2014
Learn Hadoop, MapReduce and BigData from Scratch
A Complete Guide to Learn and Master the Popular Big Data Technologies
The growth of data both structured and unstructured is a big technological challenge and thus provides a great opportunity for IT and technology professionals world wide. There is just too much data and very few professionals to manage and analyze it. We bring together a comprehensive course which will help you master the concepts, technologies and processes involved in BigData.
In this course we will primarily cover MapReduce and its most popular implementation the Apache Hadoop. We will also cover Hadoop ecosystems and practical concepts involved in handling very large data.
The MapReduce Algorithm is used in Big Data to scale computations. Running in parallel the map reduce algorithms load a manageable chunk of data into RAM, perform some intermediate calculations, load the next chunk and keep going until all of the data has been processed. In its simplest representation it can be broken down into a Map step that often takes data set we can think of as ‘unstructured’ the a Reduce step that outputs a ‘structured’ data set often smaller.
In its simplest sense Hadoop is an implementation of the MapReduce Algorithm.
It’s a convenient shorthand when we use the term Hadoop. There is the Hadoop project at a high level, then there is a core selection of tools the Hadoop refers to such as the Hadoop Distributed File System(HDFS), the HDFS shell and the HDFS protocol ‘hdfs://’. Then there is a bigger stack of tools that are becoming central to the use of` Hadoop often referred to as the ‘Hadoop Ecosystem’. These tools consist of but are not limited to Hbase, Pig, Hive, Crunch, Mahout and Avro. Then there is the new Hadoop 2.2.x version that implements a new architecture for MapReduce and allows for efficient workflows using a ‘DAG’ of jobs, a significant evolution of the classic MapReduce job.
Finally Hadoop is written in Java. In Hadoop we see Java’s significant contribution to the evolution of the distributed space as it is represented by Hadoop 2.2 and the Hadoop Ecosystem.
1. A familiarity of programming in Java. You can do our Java course for free if you want to brush up you java skills here
2. A familiarity of Linux.
3 Have access to a Amazon EMR account.
4. Have Oracle Virtualbox or VMware installed and functioning.
What Will I Learn?
In this course you will learn key concepts in Hadoop and learn how to write your own Hadoop Jobs and MapReduce programs.
The course will specifically facilitate the following High Level outcomes
1. Become literate in Big Data terminology and Hadoop.
2. Given a big data scenario, understand the role of Hadoop in overcoming the
challenges posed by the scenario.
3. How Hadoop functions both in data storage and processing Big Data.
4. Understand the difference between MapReduce version 1 in Hadoop version 1.x.x and
MapReduce version 2 in Hadoop version 2.2.x.
5. Understand the Distributed File Systems architecture and any implementation such as
Hadoop Distributed File System or Google File System.
6. Analyze and Implement a Mapreduce workflow and how to design java classes for
ETL(extract transform and load) and UDF (user defined functions) for this workflow.
7. Data Mining and filtering
The course will specifically facilitate the following Practical outcomes
1. Use the HDFS shell
2. Use the Cloudera, Hortonworks and Apache Bigtop virtual machines for Hadoop code
development and testing.
3. Configure, execute and monitor a Hadoop Job.
4. Use Hadoop data types, readers, writers and splitters.
5. Write ETL and UDF classes for hadoop workflows with PIG and Hive
6. Write filters for Data mining and processing with Mahout , Crunch and Arvo.