AWS Elastic Map Reduce

Amazon Elastic MapReduce (Amazon EMR) provides you with a fully managed, on-demand Hadoop framework. Amazon EMR reduces the complexity and up-front costs of setting up Hadoop and, combined with the scale of AWS, gives you the ability to spin up large Hadoop clusters instantly and start processing within minutes.

AWS Snowball
  • Amazon EMR is a web service that makes it easy to process large amounts of data efficiently. Amazon EMR uses Hadoop processing combined with several AWS products to do tasks such as:
    • web indexing
    • data mining
    • log file analysis
    • machine learning
    • scientific simulation
    • data warehousing.
  • When you launch an Amazon EMR cluster, you specify several options, the most important being:
    • The instance type of the nodes in your cluster
    • The number of nodes in your cluster
    • The version of Hadoop you want to run (Amazon EMR supports several recent versions of Apache Hadoop, and also several versions of MapR Hadoop.)
    • Additional tools or applications like Hive, Pig, Spark, or Presto
  • There are two types of storage that can be used with Amazon EMR:
    • Hadoop Distributed File System (HDFS)
      • HDFS is the standard file system that comes with Hadoop.
      • All data is replicated across multiple instances to ensure durability.
      • Amazon EMR can use Amazon EC2 instance storage or Amazon EBS for HDFS.
    • EMR File System (EMRFS)
      • EMRFS is an implementation of HDFS that allows clusters to store data on Amazon S3.
      • EMRFS allows you to get the durability and low cost of Amazon S3 while preserving your data even if the cluster is shut down.

Amazon EMR is an instance of Apache Hadoop, you can use the extensive ecosystem of tools that work on top of Hadoop, such as Hive, Pig, and Spark. Many of these tools are natively supported and can be included automatically when you launch your cluster, while others can be installed through bootstrap actions.

AWS EMR: Use Cases

  • Log Processing: Amazon EMR can be used to process logs generated by web and mobile applications. Amazon EMR helps customers turn petabytes of unstructured or semistructured data into useful insights about their applications or users.
  • Clickstream Analysis: Amazon EMR can be used to analyze clickstream data in order to segment users and understand user preferences. Advertisers can also analyze clickstreams and advertising impression logs to deliver more effective ads.
  • Genomics and Life Sciences: Amazon EMR can be used to process vast amounts of genomic data and other large scientific datasets quickly and efficiently. Processes that require years of compute can be completed in a day when scaled across large clusters.