AWS Data Pipeline

AWS Data Pipeline manages and streamlines data-driven workflows, which includes scheduling data movement and processing. The service is useful for customers who want to move data along a defined pipeline of sources, destinations and data-processing activities.

AWS Snowball
  • Using a Data Pipeline template, an IT pro can access information from a data source, process it and then automatically transfer results to another system or service.
  • Access to the Data Pipeline is available through the

    • AWS Management Console,
    • the command-line interface
    • or service APIs.
  • With AWS Data Pipeline, you can regularly access your data where it’s stored, transform and process it at scale, and efficiently transfer the results to AWS services such as Amazon S3, Amazon Relational Database Service (Amazon RDS), Amazon DynamoDB, and Amazon EMR.

  • Everything in AWS Data Pipeline starts with the pipeline itself. A pipeline schedules and runs tasks according to the pipeline definition. The scheduling is flexible and can run every 15 minutes, every day, every week, and so forth.

  • Three main components of AWS Data Pipeline work together to manage your data:

    • Pipeline definition specifies the business logic of your data management.
    • AWS Data Pipeline web service interprets the pipeline definition and assigns tasks to workers to move and transform data.
    • Task Runners poll the AWS Data Pipeline web service for tasks and then perform those tasks.

You can write a custom task runner application, or you can use the Task Runner application that is provided by AWS Data Pipeline.

AWS Data Pipeline: Example

you can use AWS Data Pipeline to archive your web server’s logs to Amazon Simple Storage Service (Amazon S3) each day and then run a weekly Amazon Elastic MapReduce(Amazon EMR) job flow over those logs to generate traffic reports.

AWS Snowball

In this example, AWS Data Pipeline would schedule the daily tasks to copy data and the weekly task to launch the Amazon EMR job flow. AWS Data Pipeline would also ensure that Amazon EMR waits for the final day’s data to be uploaded to Amazon S3 before it began its analysis, even if there is an unforeseen delay in uploading the logs.

AWS Data Pipeline: Standard Activities

  • CopyActivity: This activity can copy data between Amazon S3 and JDBC data sources, or run a SQL query and copy its output into Amazon S3.
  • HiveActivity: This activity allows you to execute Hive queries easily.
  • EMRActivity: This activity allows you to run arbitrary Amazon EMR jobs.
  • ShellCommandActivity: This activity allows you to run arbitrary Linux shell commands or programs.

AWS Data Pipeline: Standard Preconditions

  • DynamoDBDataExists: This precondition checks for the existence of data inside a DynamoDB table.
  • DynamoDBTableExists: This precondition checks for the existence of a DynamoDB table.
  • S3KeyExists: This precondition checks for the existence of a specific AmazonS3 path.
  • S3PrefixExists: This precondition checks for at least one file existing within a specific path.
  • ShellCommandPrecondition: This precondition runs an arbitrary script on your resources and checks that the script succeeds.

AWS Data Pipeline: Lab

AWS Data Pipeline: Quiz

  • An International company has deployed a multi-tier web application that relies on DynamoDB in a single region. For regulatory reasons they need disaster recovery capability in a separate region with a Recovery Time Objective of 2 hours and a Recovery Point Objective of 24 hours. They should synchronize their data on a regular basis and be able to provision the web application rapidly using CloudFormation. The objective is to minimize changes to the existing web application, control the throughput of DynamoDB used for the synchronization of data and synchronize only the modified elements. Which design would you choose to meet these requirements?
    • Use AWS data Pipeline to schedule a DynamoDB cross region copy once a day. Create a ‘Lastupdated’ attribute in your DynamoDB table that would represent the timestamp of the last update and use it as a filter.
    • Use EMR and write a custom script to retrieve data from DynamoDB in the current region using a SCAN operation and push it to DynamoDB in the second region. (No Schedule and throughput control)
    • Use AWS data Pipeline to schedule an export of the DynamoDB table to S3 in the current region once a day then schedule another task immediately after it that will import data from S3 to DynamoDB in the other region. (With AWS Data pipeline the data can be copied directly to other DynamoDB table)
    • Send each item into an SQS queue in the second region; use an auto-scaling group behind the SQS queue to replay the write in the second region. (Not Automated to replay the write)