Another Octopress blog about programming and infrastructure.

Deploying From CodePipeline to OpsWorks Using a Custom Action and Lambda

Update: CodePipeline now has built-in support for both Lambda and OpsWorks. I’d now recommend using the built-in functionality rather than the method described in this post.

Original post:

AWS recently released CodePipeline after announcing it last year. After doing the setup walkthrough I was surprised to see only the following deployment options!

I’m sure other integrations are in the works, but fortunately CodePipeline supports custom actions allowing you to build your own integrations in the mean time.

If you were hoping to see this:

Then read on!

I’ve implemented a custom action to deploy to OpsWorks using Lambda, you can find the full source of the Lambda function on GitHub. It leverages the fact that CodePipeline uses S3 to store artifacts between stages to trigger the Lambda function, and SNS retry behaviour for polling.

This blog posts explains how to configure your pipleine and the Lambda function to deploy to OpsWorks using CodePipeline.

Analysing DynamoDB Index Usage in Hive Queries

Elastic Map Reduce allows you to conveniently run SQL-like queries against DynamoDB using Hive. This overcomes many of the limitations of the built-in DynamoDB query functionality and makes it significantly more useful for storing raw analytical data.

While the abstraction provided by this handler is pretty good, it is still subject to the same underlying throughput and indexing limitations faced when accessing data through the DynamoDB API directly. In particular, access efficiency is extremely sensitive to the use of appropriate indexes - full table scans are both slow and expensive.

The documentation provides some guidance with regard to performance optimisation, however it does not go into how the handler maps a Hive query to a DynamoDB scan or query, nor under what circumstances indexes will be used to avoid scanning the entire table.

In this blog post you’ll find several Hive queries run against an example DynamoDB table, along with the resulting DynamoDB request to observe which indexes are used.

Building ZeroC Ice 3.5 Projects With Gradle and IntelliJ

ZeroC Ice is a distributed computing platform supporting many languages including Java. Building an Ice project requires compiling “slice” data structure definitions into a compatible Java interface. Most often it is recommended to use the Eclipse plugin, however I prefer to use IntelliJ and a build tool which is IDE agnostic. Official Gradle support is coming in Ice 3.6, but that’s still in beta. Fortunately it’s quite easy to invoke te slice2java tool from Gradle and develop Ice 3.5 projects on Gradle and IntelliJ by extension.

Part 2: Exporting and Analysing CloudWatch Logs With Data Pipeline and EMR

If you followed Part 1 you’ll now have your CloudWatch logs sitting conveniently in S3 to be analysed. You could now download them and search each file individually using grep or a similar tool, but it would be much nicer to be able to search by field and construct complex queries with multiple conditions.

Thankfully you have Elastic Map Reduce (EMR) at your disposal, which can help you analyse your logs straight from S3 using a nice UI (Hue) and with an SQL-like query language you’re already familiar with (Hive). EMR is typically employed to process terabytes of data, but it works well on relatively small data-sets too and will easily scale up if you happen to have a huge amount of logs to process. Running an on-demand EMR cluster for 6 hours also only costs less than $2.

This blog post will cover setting up an EMR cluster, logging into Hue, then using Hive to format and query the Apache HTTP access logs exported from CloudWatch in Part 1.

Part 1: Exporting and Analysing CloudWatch Logs With Data Pipeline and EMR

You’ve just discovered one of your instances has been hacked! A new instance is being launched to replace it, but you have no idea how the attacker got access in the first place and you need to stop it happening again. The clues are hidden somewhere in your HTTP access logs which are conveniently sitting in CloudWatch logs. Unfortunately accessing and analysing those logs from CloudWatch isn’t as simple as you thought. The only refinement available is by ingestion time and there’s no way you can trawl through days of logs by hand. You’ll need to analyse the logs externally but that’s a challenge too - there’s no automated export to S3 and the GetLogEvents API action is limited to pages of 1MB and 10 requests per second. Once you get the logs out you have to figure out how to analyse them, what you’re looking for is too complex for simple text searches and loading tens of GB of logs into an RDBMS would be tedious.

Fortunately you found this blog post! Elastic Map Reduce (EMR) allows you to quickly and conveniently create a Hadoop cluster running Hive. It might seem like overkill to use Hadoop to process just a few GB of logs once-off, but Hive provides a convenient SQL-like interface and works perfectly fine at small scale. Plus, considering you pay by the hour the cost is almost negligible.

The only question is how to get your logs out of CloudWatch and into S3 for EMR to process, so I recently wrote a small tool called cwlogs-s3 to help with this process. Part 1 of this blog post will cover how to export your logs to S3 using cwlogs-s3 and Data Pipeline, then Part 2 will cover how to analyse those logs with Hive on EMR.

Automated HAProxy Failover on OpsWorks

Without a doubt ELB is the simplest load balancing solution on AWS, however it may not be suitable for all users given it doesn’t support features such as a static IP. Fortunately OpsWorks makes it only marginally more complicated to set up HAProxy as an alternative.

The AWS ecosystem encourages you to implement redundancy across availability zones and to avoid a single point of failure (SPOF). HAProxy will give you many additional features over ELB, however it is difficult to achieve cross-zone redundancy and automated failover as supported natively by ELB. DNS round-robbin can help balance load across multiple HAProxy instances to achieve scalability, however this solution does not help to achieve high availability.

This blog post will demonstrate how to implement automated failover using a self-monitoring pair of HAProxy instances in an active/standby configuration. When a failure is detected the healthy standby will automatically take control of the elastic IP (EIP) assigned to the pair and ensure the service can continue to function. A notification will also be triggered via SNS to alert you that a failover has taken place.

Revisited: Retrieving Files From S3 Using Chef on OpsWorks

One of my earliest and most popular posts is Retrieving Files From S3 Using Chef on OpsWorks. That posts uses the Opscode AWS cookbook which in turn uses the right_aws gem. While this method is fine - particularly if you’re not using OpsWorks - there are some situations where it’s not ideal.

Recently I’ve started using the aws-sdk gem directly which is bundled with the OpsWorks agent. The version at the time of writing is 1.53.0.

The advantages of this are:

  • Support for IAM instance roles, meaning you don’t have to pass AWS credentials via your custom JSON.
  • No dependencies on external cookbooks.
  • Will ordinarily be run at the compile stage, therefore you could download a JSON file, parse it, then use it to generate resources if you wanted.

The disadvantages are:

  • It’s not entirely clear, but my feeling is that the gems included with the OpsWorks agent aren’t necessarily part of the API “contract” provided by OpsWorks for cookbook developers. Therefore there is no guarantee that AWS won’t change the version or even remove it entirely without notice. I think it’s unlikely that they’ll remove the aws-sdk gem or move to a version with compatibility breaking changes any time soon, but it’s possible.
  • Less “chef-like” solution, although you could write your own chef resource to wrap it
  • If you’re not using OpsWorks then the aws-sdk will create another dependency

This blog post provides an example of how to use the bundled aws-sdk gem to download a file from S3 using IAM instance roles on OpsWorks.

My 2015 AWS Wish List

As a new year dawns it occurred to me how much AWS functionality I now use heavily wasn’t available only a year ago. Almost every day I check the AWS blog to find some new feature is available. This got me thinking about the functionality I’d like to see in 2015, so I put together a list of my top 5.

I’m sure the engineers at AWS are already working on some (if not most) of these, but if not then hopefully someone sees this post and gets a great idea!

Monitoring Per Application Metrics With CloudWatch Logs and OpsWorks

CloudWatch logs is a cheap and easy to set up centralised logging solution. At the moment it lacks several valuable features such as a convenient way to search logs, however it does an excellent job at providing graphing and alerting on aggregated metrics pulled from ingested log data. An obvious application for this is to monitor HTTP server statistics to provide graphs of overall request rates, response sizes, and error rates.

OpsWorks makes it easy to orchestrate a fleet of EC2 instances serving multiple applications (as oppose to Elastic Beanstalk which only hosts a single application). Apache is the default HTTP server for most OpsWorks layer types.

This post demonstrates how to setup CloudWatch logs for Apache access logs on OpsWorks, then create custom CloudWatch metrics for an individual OpsWorks application to graph the HTTP request rate.

How Far Can You Go With HAProxy and a t2.micro

Load balancing is critical to any scalable and highly available cloud application. The obvious choice for load balancing on AWS is ELB, but unfortunately if you require features such as a static IP or URL based request routing then ELB isn’t an option.

HAProxy is a great solution that performs extremely well even on small EC2 instance types. It is also a supported layer type in OpsWorks which makes it the obvious choice for OpsWorks users.

It’s well known that several large application servers can be served by just a single small HAProxy server, but what is the definition of small? How about the smallest EC2 instance on offer - the t2.micro? This blog post puts HAProxy on a t2.micro to the test using loader.io to determine just how many requests/second it can handle and whether CPU or network is the limiting factor.