07 Mar 2017

Mp4s To Gif From Twitter

Visit the link, Cmd-Option-J in Chrome to open DevTools.

Execute and capture result of $('#playerContainer').data().config.video_url

Take that link and enter it in https://cloudconvert.com/ and select “Select Files” dropdown, enter url.

Or, use a commandline tool:

Install pre-requisites

pip install cloudconvert requests

Download script: https://gist.github.com/c83c3e91ee3f9df21686bb50b4fbf904

Make it executable: chmod +x twitter-gif

Run it: twitter-gif TWEET-LINK outputfilename-optional

12 Feb 2017

Solving Infinite Loop In NPM With Dtruss

Last week one of the engineering juniors that I mentor ran into a strange environmental issue.

When he ran npm run karma it would run for ~8 minutes and then suddenly spit out an out of memory error. He tried debugging it for awhile himself and then reached out to me to assist.

We ran through the normal set of troubleshooting steps:

  • Verify NPM and Node are on versions appropriately matched to production. (They were newer so we re-installed the ones used in prod)
  • rm -rf node_modules/ followed by npm install. (This semi-frequently resolves issues when old dependencies are not cleared out)

And when we tried running the offending command again, we suffered the error once more.

Which was when I reached into my bag of tricks and thought back on articles by @b0rk and @brendangregg. I remembered tutorials about using Dtrace to track down system calls from particular process identifiers. And I remembered a similar tool called DTruss that allows for attaching to PID and observing the system calls. For more info on DTruss, go check it out here: http://www.brendangregg.com/DTrace/dtruss or by vim $(which dtruss).

So I explained the barebones that I knew about how DTruss operates and we fired up dtruss npm run karma.

We had time to talk a bit about system calls and the meaning of the readout. After 2 minutes we noticed that the log continued to fly by but the same folder was being accessed. Over and over and over. We had a recursive dependency due to an out-dated library that was stored inside the project tree.

Thanks to DTruss, we realized the issue, wiped out the offending folder and tried again with success!

PS - While writing this article I learned that Brendan Gregg wrote DTruss. Many thanks both for DTruss and for writing articles about how to use these tools! I also owe a thanks to Julia Evans who exposed me to these tools through her blogging and Zines :).

01 Feb 2017

Thoughts On Gitlab Data Incident

Background

On Feb 1st, Gitlab suffered a irrecoverable data loss for a period of 6 hours.

https://about.gitlab.com/2017/02/01/gitlab-dot-com-database-incident/

(In case that link goes stale, here’s a copy: https://gist.github.com/8b9449ec4260583d0e644c7cdc94f3be)

My first thought is that it’s a horrible experience both for the users who lost data and for the engineers involved in the process at Gitlab. The feelings of anger, self doubt and frustration are hard to bear. I wish them all the best in recovering and getting back to work. My heart goes out to them for this experience.

After being floored by the possibility of permanent data loss, my thoughts went next to consider how their experience could inform my team’s decisions with regard to our own processes.

None of this is intended to as backseat driving the situation Gitlab suffered. It is intended as constructive discussion of systems failing to discourage human error, of which we are all susceptible.

Summary of Events

The tl;dr was PG replica got behind. Engineer1 went into debugging after their shift was over. Then the engineer believed they were SSH’d into the replica, but were really SSH’d into the primary. At this point Engineer1 tried to run a command to start replication. They had trouble with command and assumed they needed to wipe out the data directory fully where postgres stores databases. They ran a variant of “rm -rf” and removed the 300GB of data. Engineer1 realized the issue and stopped the deletion when only a few gigabytes remained. The data was unrecoverable from data directory. At this point Engineer1 handed off the baton due to realizing the mistake and already being heavily fatigued.

Their 5 backup systems all failed them. Their latest mostly complete backup was 6 hrs out of sync. Their webhooks data is lost or 24 hrs out of sync.

Repeating that… all 5 backups failed! That is a very very worst case.

That said, their data from 24 hrs ago seemed like valid backups and their backup from 6 hours before was valid. That means backup system 6 and 7 were working decently.

Ways to Limit Risk in Future

My takeaways from their incident:

  • Check your backup system works the way you think it does. Ideally this means occasional automated and manual occasions when backups are loaded into system and verified.
  • Use buddy system when doing potentially dangerous things on production.
    • This would lessen the likely of executing commands while SSH’d into wrong box
    • Talk through actions before doing them when on production. Have team mate confirm each step.
  • Take an airline pilot checklist approach to these situations to fend off some of the avoidable mistakes.
  • Do not make big decisions under time crunch. The engineer was trying to leave at end of shift w/ hard stop timeline. They were rushed and stressed. Having replication lag way longer and handing off to other person could have offset the much worse disaster that they induced. Twelve hours of partially degraded service might be worthwhile trade instead of a complete loss of 6 hr of data.
  • Tiredness leads to mistakes. Tap out and hand off the baton.
  • Take a backup manually before operating like this on production systems. A 5 minute operate of streaming exporting via pg_dump to AWS S3 would help narrow the window from 6 hr loss to minutes or zero time (assuming app was in full maintenance mode during database replication). I take advantage of this technique before doing potentially destructive database actions. Create a full db snapshot if it’s a db level change or a table level snapshot if limited to single table. Commit your action, validate findings, and then wipe out the snapshots if space is precious.

Conclusions

Humans make mistakes when working with complicated systems. Well designed systems and policies help put safeguards in place to reduce the likelihood of irrecoverable & disasterous events.

I anticipate that the engineering team is working on a clear blameless post-mortem to bring closure to this event.If you’re unfamiliar with blameless post-mortems, check out this article by John Allspaw: https://codeascraft.com/2012/05/22/blameless-postmortems/. During the post-mortem they’ll identify the actions taken and circumstances of the incident along with systems and protocols that can be improved to make these circumstances likely to recur.

PS - I went and checked our various backups for production systems after this event. The hourly, daily, weekly, monthly backups are in good order for Mongo, Postgres and Redis. The automated backups of Redshift look good, as do the manual checkpoints from before major changes. The S3 copies that are permanently stored for varying durations for Mongo are in good shape as well. The realtime replication of Mongo to Postgres is in good shape and has preserved us from data loss when an incident occured. I’ll be ever nervous about data loss, but I think we’re in generally good shape.

31 Jan 2017

Implementing Bayeux Client In Golang

Announcing a golang client for Bayeux (long polling): https://github.com/zph/bayeux.

I recently found myself needing to integrate Salesforce data into a production system. Which gave me the opportunity to implement a client for Bayeux protocol based on the Salesforce docs, Stack Overflow undocumented features, a rough python implementation from Github’s Gists, and Faye Ruby gem.

The protocol enables a client to subscribe for realtime updates based on a predetermined query using Salesforce’s SQL type language.

For the small number of realtime queries supported by Salesforce API, this works wonderfully.

Usage example:

package main

import (
	"fmt"

	bay "github.com/zph/bayeux"
)

func main() {
	b := bay.Bayeux{}
	creds := bay.GetSalesforceCredentials()
	c := b.TopicToChannel(creds, "topicName")
	for {
		select {
		case e := <-c:
			fmt.Printf("TriggerEvent Received: %+v", e)
		}
	}
}

Check out the library here: https://github.com/zph/bayeux

12 Sep 2016

Good Dull Best Practices in Operations

Tonight I read this paper: pdf or archive.

And I was impressed by the boring and solid guidelines therein.

My takeaways were:

  • Immutable/recreatable servers and infrastructure. Poignant b/c of an EC2 hardware failure today and needing to recover by snapshotting the root drive and reattaching to new server.
  • Instrument all the things.
  • Spread testing across unit, integration and multi-service tests
  • Gradual deployment of new services/updates. As a colleague and friend would say, “bake it in production for a little while”.
  • “Proven technology is almost always better than operating on the bleeding edge. Stable software is better than an early copy, no matter how valuable the new feature seems.” Bleeding edge is named that way for a reason. I find this tension between proven technology and bleeding edge to be on my mind lately.

07 Sep 2016

Using Clojure on AWS Lambda

AWS Lambda is great for ad hoc services without needing to manage additional infrastructure. I’ve used it on a couple tasks for syncing S3 buckets.

The workflow goes like this:

  • Register a lambda function
  • Setup appropriate role and ARN permissions
  • Setup a trigger, ie a circumstance that should invoke this function
  • Build code to respond to the trigger
  • Upload, debug, etc

So this weekend I built an AWS Lambda in python to transform some textfiles that were stored in EDN format into JSON and then partition them according to one key. EDN is a json-ish format from the Clojure world (https://en.wikipedia.org/wiki/Extensible_Data_Notation). These EDN files were on S3 and gzip compressed.

I built the lambda in python, used boto3, and edn_format for freeing the data from EDN. I packaged those dependencies up into a zipfile and shipped it to staging environment.

It worked marvelously on files that were up to 1MB in size. Then larger files started timing out… because AWS Lambda has an upper time limit of 300 seconds per execution. I found the culprit files, mostly ~ 7MB of gzipped EDN, tried them locally, performance profiled it, and realized the issue was in deserializing EDN data in Python. Woops! As you might expect, EDN libraries are few and far between compared to JSON. And they tend to be less robust and don’t delegate to C extensions.

Now clojure is the logical choice for this EDN -> JSON partitioning task. But AWS only officially supports Java, Python and Node.js.

But clojure is really just java under the hood… so I found an article with the basic guidelines and set to work. (Article: https://aws.amazon.com/blogs/compute/clojure/).

The trick to using clojure is needing to expose a static method with the appropriate signature for AWS Lambda and then using a few project.clj configurations.

project.clj - Note the uberjar profile with :aot :all and the aws lambda clojar. Include [com.amazonaws/aws-lambda-java-core “1.0.0”] as dependency and set :profiles {:uberjar {:aot :all}}

Then to help with the aws-lambda protocol, I followed instructions from the original article, along with a secondary source of information from @kobmic on Github. I’m particularly happy with their implementation of the deflambda macro, copied to here:

;; convenience macro for generating gen-class and handleRequest
(defmacro deflambda [name args & body]
  (let [class-name (->> (clojure.string/split (str name) #"-")
                     (mapcat clojure.string/capitalize)
                     (apply str))
        fn-name (symbol (str "handle-" name "-event"))]
    `(do (gen-class
           :name ~(symbol class-name)
           :prefix ~(symbol (str class-name "-"))
           :implements [com.amazonaws.services.lambda.runtime.RequestStreamHandler])

         (defn ~(symbol (str class-name "-handleRequest")) [this# is# os# context#]
           (let [~fn-name (fn ~args ~@body)
                 w# (io/writer os#)]
             (-> (json/read (io/reader is#) :key-fn keyword)
               (~fn-name)
               (json/write w#))
             (.flush w#))))))

Used like

(deflambda s3-split [event]
  (example.core/handler event)

And in the AWS Lambda dashboard, the handler name is S3Split::handleRequest.

So where the Python version of this code was timing out at 300 seconds without completing the task, my clojure lambda burns through it in 20-70 seconds and has been working well.

Additional Code for Deploying/Updating/Building

Create lambda function

#!/usr/bin/env bash

aws lambda create-function --function-name example-lambda --handler S3Put::handleRequest --runtime java8 --memory 512 --timeout 120 --role arn:aws:iam::<ID>:role/example-role-lambda --zip-file fileb://./target/example-0.1.0-SNAPSHOT-standalone.jar

Update lambda function

#!/usr/bin/env bash

aws lambda update-function-code \
  --function-name example-lambda \
  --zip-file fileb://./target/example-1.0.0-SNAPSHOT-standalone.jar

Build

lein uberjar

24 Jul 2016

Reflections on Migrating Redis and PG

I had the task of deploying three production databases with minimal downtime. Here’s the takeaways.

Moving Redis with persistent data

Redis needed to move off a couple providers and into another provider. This needed to happen inside a 30 min maintenance window for one application (which performs critial writes) but some novelty loss of other low value writes was an acceptable tradeoff for having 0 downtime of other services.

One db was easily imported using DB host’s Import tool. Another db was not able to use that mechanism and was transfered by redis-transfer. I enjoyed extending the tool to make it work well for this purpose.

Postgres

Simplest of all, it was a matter of generating Heroku backup, downloading that link and importing it into other db.

#!/usr/bin/env bash

# References https://devcenter.heroku.com/articles/heroku-postgres-import-export
#
# Requires heroku commandline tool.
# The following ENV are required
# HEROKU_API_KEY=

# The following envs are required for the destination DB and are automatically
# used by PG.
# PGPASSWORD=
# PGUSER=
# PGHOST=
# PGPORT=

# Set this for simpler scripting
# PGDATABASE=

# Install heroku toolkit https://zph.xargs.io/heroku-toolkit-install.sh | bash
# sudo apt-get install postgresql
OUTPUT_FILE="latest.dump"
APP_NAME=$HEROKU_APP
heroku=$HEROKU_BIN
$heroku pg:backups -a $APP_NAME capture && \
  curl -o $OUTPUT_FILE `$heroku pg:backups -a $APP_NAME public-url` && \
  pg_restore --verbose --clean --no-acl --no-owner -d $PGDATABASE $OUTPUT_FILE

The Day Of

I ran through all the steps, outlined them, then setup working scripts for each portion of process. Those were then setup as commands in a command station type tmux session.

Each Tmux tab was a phase of the process: maintenance_mode:on, redis_migrations, maintenance_mode:off, pg_migrations, logging

Inside each tab it had the commands I would need to one, one per section of the window:

|-----------------|------------------|
| redis1_migration| redis_migration2 |
|-----------------|------------------|
| point to new r1 | point to new r2  |
|-----------------|------------------|

Performing the Migration

  • Notified stakeholders in advance
  • Prepared steps, conducted trials against staging
  • Setup migration scripts
  • Walk through checklist in 15 min before time
  • Set one heroku app to maintenance mode
  • Import 2 redis dbs
    • Verify result
    • Run script to point to those new endpoints
  • Maintenance mode off
  • PG migrate
    • Verify results
    • Run script to point to new endpoints

Conclusion

Glad redis-transfer was available to help with a recalcitrant server. And I’m glad to be preparing postgres for more active duty in our stack.

My takeaway from accomplishing this migration was that careful planning leads to quick and uneventful maintenance windows. Also, I’d rather migrate pg than redis.

And have a migration buddy :). Makes it far more enjoyable and extra hands in case things go wrong.

24 Jul 2016

Added Shortlinks To Hugo Blog

I got a bee in my bonnet today about adding unobtrusive Twitter share links to this blog.

It involved the following steps:

  • Finding out how to do it without using Twitter’s SDK on page
  • Wiring that into a Hugo template
  • Adding fragment to share links
  • Adding mechanism for shortlinks on blog

Twitter Shares without their SDK

I prefer not to include Third Party JS on pages for security and purity reasons.

I searched around on NPM and found something simple that reflected this attitude: SocialMediaLinks and then built off of there for just the functionality I needed.

Wiring that into Hugo

I embed a few data attributes on .twitter_share using a Hugo partial.

<a href="#"
   target="_blank"
   class="twitter-share in-headline"
   data-url="{{.Permalink}}"
   data-via="_ZPH"
   data-title="{{.Title}}"
   {{ if .IsPage }}
     data-aliases="{{ .Aliases | jsonify }}"
   {{ end }}
   ><i class="fa fa-2x fa-twitter"></i></a>

When the page loads, the div’s href is filled in using this fn:

document.addEventListener("DOMContentLoaded", function() {
  _.each(document.querySelectorAll('.twitter-share'), function(el) {
    const { via, title, aliases } = el.dataset
    var ax, url
    try {
      ax = JSON.parse(aliases)
      url = _.sortBy(ax, length)[0]
    } catch (e) {
      url = el.dataset.url
    }
    const href = SocialMediaLinks.create({account: 'twitter', url: url, title: title, via: via})
    el.href = href
  })
});

Parsing/Stringifying Urls

This is my happiest implementation of url parsing so far in Javascript. The concept is adapted from https://gist.github.com/jlong/2428561 and adapted to suit ES6. The clever trick is getting the browser to do the parsing by making it an a element.

import * as _ from 'lodash'

export default class Link {
  constructor(u) {
    this.url = this.parseURL(u);
  }

  parseURL(url) {
    // Credit: https://www.abeautifulsite.net/parsing-urls-in-javascript
    // And Originally: https://gist.github.com/jlong/2428561
    var parser = document.createElement('a')
    // Let the browser do the work
    parser.href = url;
		//  Available on parser
		// 	protocol
		// 	host
		// 	hostname
		// 	port
		// 	pathname
		// 	search aka queryParams
		// 	hash
    return parser;
  }

  getQueryParams() {
    const kvs = this.url.search.replace(/^\?/, '').split('&');
    return _.reduce(kvs, function(acc, kv) {
      var k, v = kv.split('=');
      if (_.isEmpty(k)) {
        return acc
      } else {
        return acc[k] = v
      }
    }, {})
  }

  setQueryParam(k, obj) {
    const qp = this.getQueryParams()
    qp[k] = obj;
    // Keep Parser in Sync so we can use href
    this.url.search = this.queryParamsToString(qp)
    return qp
  }

  emptyOr(v, ifEmpty, notEmpty) {
    if (_.isEmpty(v)) {
      return ifEmpty
    } else {
      return notEmpty
    }
  }

  queryParamsToString(qp) {
    return _.map(qp, function(v, k) {
      return [k, v].join("=")
    }).join("&")
  }

  toString() {
    return this.url.href;
  }
}

The ShareSocialMedia.create() function appends a query param that’s a hashed value so that retweets and content pathways can be tracked for analytics.

When building the twitter link, we check for a shortcode in the Aliases portion of page metadata and fallback to using the full link. By using aliases frontmatter for this Hugo will autogenerate redirect urls for each of these entries with a 301 link

The redirects work by generating an html document at that alias location like so (from the Hugo docs):

<html>
	<head>
		<link rel="canonical" href="http://mysite.tld/posts/my-original-url"/>
		<meta http-equiv="content-type" content="text/html; charset=utf-8"/>
		<meta http-equiv="refresh" content="0;url=http://mysite.tld/posts/my-original-url"/>
	</head>
</html>

And Finally

My post-new script for creating new posts on blog has a function in it to take the filename of the post, md5 hash it, and take the first 6 chars. That value’s inserted into the page frontmatter.

Try it out ;-) aliased link

Full code

24 Jul 2016

Using Hugo Static Site Generator

I reworked this blog to use Hugo static site generator because my Octopress site was a bit long in the tooth.

It’s now using the following:

The tooling for compiling and releasing is here:

23 Jul 2016

On Being a 10x Engineer

Wise words: