Writings

Technology, open source, personal essays, and everything that isn't climate.

OpenStack as Layers

Last week at LinuxCon I gave a presentation on DevStack which gave me the proper excuse to turn an idea that Dean Troyer floated a year ago about OpenStack Layers into pictures (I highly recommend reading that for background, I won't justify every part of that here again). This abstraction has been something that's actually served us well as we think about projects coming into DevStack. OpenStack_Layers Some assumptions are made here in terms of what essential services are here as we build up the model.

Layer 1: Base Compute Infrastructure

We assume that compute infrastructure is the common starting point of minimum functional OpenStack that people are deploying. The output of the last OpenStack User Survey shows that the top 3 deployed services, regardless of type of cloud (Dev/Test, POC, or Production) are Nova / Glance / Keystone. So I don't think this is a huge stretch. There are definitely users that take other slices (like Swift only) but compute seems to be what the majority of people coming to OpenStack seem to be focussed on. Basic Compute services need 3 services to get running. Nova, Glance, and Keystone. That will give you a stateless compute cloud which is a starting point for many people getting into the space for the first time.

Layer 2: Extended Infrastructure

Once you have a basic bit of compute infrastructure in place, there are some quite common features that you do really need to do more interesting work. These are basically enhancements on the Storage, Networking, or Compute aspects of OpenStack. Looking at the User Survey these are all deployed by people, in various ways, at a pretty high rate. This is the first place we see new projects integrating into OpenStack. Ironic extends the compute infrastructure to baremetal, and Designate adds a missing piece of the networking side with DNS management as part of your compute creation. Hopefully nothing all that controversial here.

Layer 3: Optional Enhancements

Now we get a set of currently integrated services that integrate North bound and South bound. Horizon integrates on the North bound APIs for all the services, it requires service further down in layers (it also today integrates with pieces further up that are integrated). Ceilometer consumes South bound parts of OpenStack (notifications) and polls North bound interfaces. From the user survey Horizon is deployed a ton. Ceilometer, not nearly as much. Part of this is due to how long things have been integrated, but even if you do analysis like take the Cinder / Neutron numbers, delete all the Folsom deploys from it (which is the first time those projects were integrated) you still see a picture where Ceilometer is behind on adoption. Recent mailing list discussions have hints at why, including some of the scaling issues, and a number of alternative communities in this space. Let's punt on Barbican, because honestly, it's new since we came up with this map, and maybe it's really a layer 2 service.

Layer 4: Consumption Services

I actually don't like this name, but I failed to come up with something better. Layer 4 in Dean's post was "Turtles all the way down", which isn't great describing things either. This is a set of things which consume other OpenStack services to create new services. Trove is the canonical example, create a database as a service by orchestrating Nova compute instances with mysql installed in them. The rest of the layer 4 services all fit the same pattern, even Heat. Heat really is about taking the rest of the components in OpenStack and building a super API for their creation. It also includes auto scaling functionality based on this. In the case of all integrated services they need a guest agent to do a piece of their function, which means when testing them in OpenStack we don't get very far with the Cirros minimal guest that we use for Layer 3 and down. But again, as we look at the user survey we can see deployment of all of these Layer 4 services is lighter again. And this is what you'd expect as you go up these layers. These are all useful services to a set of users, but they aren't all useful to all users. I'd argue that the confusion around Marconi's place in the OpenStack ecosystem comes with the fact that by analogy it looks and feels like a Layer 4 service like Trove (where a starting point would be allocating computes), but is implemented like a Layer 2 one (straight up raw service expected to be deployed on bare metal out of band). And yet it's not consumable as the Queue service for the other Layer 1 & 2 services.

Leaky Taxonomy

This is not the end all be all of a way to look at OpenStack. However, this layered view of the world confuses people a lot less than the normal view we show them -- the giant spider diagram (aka the mandatory architecture slide for all OpenStack presentations): OpenStack_Spider_Diagram This picture is in every deep dive on OpenStack, and scares the crap out of people who think they might want to deploy it. There is no starting point, there is no end point. How do you bite that off in a manageable chunk as the spider grows? I had one person come up to me after my DevStack talk giving a big thank you. He'd seen a presentation on Cloudstack and OpenStack previously and OpenStack's complexity from the outside so confused him that he'd run away from our community. Explaining this with the layer framing, and showing how you could experiment with this quickly with DevStack cleared away a ton of confusion and fear. And he's going to go dive in now.

Tents and Ecosystems

Today the OpenStack Technical Committee is in charge of deciding the size of the "tent" that is OpenStack. The approach to date has been a big tent philosophy, where anything that's related, and has a REST API, is free to apply to the TC for incubation. But a big Tent is often detrimental to the ecosystem. A new project's first goal often seems to become incubated, to get the gold star of TC legitimacy that they believe is required to build a successful project. But as we've seen recently a TC star doesn't guarantee success, and honestly, the constraints on being inside the tent are actually pretty high. And then there is a language question, because OpenStack's stance on everything being in Python is pretty clear. An ecosystem that only exists to spawn incubated projects, and incubated projects only being allowed to be in Python, basically means an ecosystem devoid of interesting software in other languages. That's a situation that I don't think any of us want. So what if OpenStack were a smaller tent, and not all the layers that are in OpenStack today were part of the integrated release in the future? Projects could be considered a success based on their users and usage out of the ecosystem, and not whether they have a TC gold star. Stackforge wouldn't have some stigma of "not cool enough", it would be the normal place to exist as part of the OpenStack ecosystem. Mesos is an interesting cloud community that functions like that today. Mesos has a small core framework, and a big ecosystem. The smaller core actually helps grow the ecosystem by not making the ecosystem 2nd class citizens. I think that everyone that works on OpenStack itself, and all the ecosystem projects, want this whole thing to be successful. We want a future with interoperable, stable, open source cloud fabric as a given. There are lots of thoughts on how we get there, and as no one has ever created a universal open source cloud fabric that lets users have the freedom to move between providers, public and private, so it's no surprise that as a community we haven't figured everything out yet. But here's another idea into the pool, under the assumption that we are all smarter together with all the ideas on the table, than any of us are on our own.

Splitting up Git Commits

Human review of code takes a bunch of time. It takes even longer if the proposed code has a bunch of unrelated things going on in it. A very common piece of review commentary is "this is unrelated, please put it in a different patch". You may be thinking to yourself "gah, so much work", but turns out git has built in tools to do this. Let me introduce you to git add -p. Lets look at this Grenade review - https://review.openstack.org/#/c/109122/1. This was the result of a days worth of hacking to get some things in order. Joe correctly pointed out there was at least 1 unrelated change in that patch (I think he was being nice, there were probably at least 4 things going that should have been separate). Those things are:

  • The quiece time for shutdown, that actually fixes bug 1285323 all on it's own.
  • The reordering on the directory creates so it works on a system without /opt/stack
  • The conditional upgrade function
  • The removal of the stop short circuits (which probably shouldn't have been done)

So how do I turn this 1 patch, which is at the bottom of a patch series, into 3 patches, plus drop out the bit that I did wrong? Step 1: rebase -i master Start by running git rebase -i master on your tree to put myself into the interactive rebase mode. In this case I want to be editing the first commit to split it out. screenshot_171 Step 2: reset the changes git reset ##### will unstage all the changes back to the referenced commit, so I'll be working from a blank slate to add the changes back in. So in this case I need to figure out the last commit before the one I want to change, and do a git reset to that hash. screenshot_173 Step 3: commit in whole files Unrelated change #1 was fully isolated in a whole file (stop-base), so that's easy enough to do a git add stop-base and then git commit to build a new commit with those changes. When splitting commits always do the easiest stuff first to get it out of the way for tricky things later. Step 4: git add -p In this change grenade.sh needs to be split up all by itself, so I ran git add -p to start the interactive git add process. You will be presented with a series of patch hunks and a prompt about what to do with them. y = yes add it, n = no don't, and lots of other options to be trickier. screenshot_176 In my particular case the first hunk is actually 2 different pieces of function, so y/n isn't going to cut it. In that case I can type 'e' (edit), and I'm dumping into my editor staring at the patch, which I can interactively modify to be the patch I want. screenshot_177 I can then delete the pieces I don't want in this commit. Those deleted pieces will still exist in the uncommitted work, so I'm not losing any work, I'm just not yet dealing with it. screenshot_178 Ok, that looks like just the part I want, as I'll come back to the upgrade_service function in patch #3. So save it, and final all the other hunks in the file that are related to that change to add them to this patch as well. screenshot_179 Yes, to both of these, as well as one other towards the end, and this commit is ready to be 'git commit'ed. Now what's left is basically just the upgrade_service function changes, which means I can git add grenade.sh as a whole. I actually decided to fix up the stop calls before doing that just by editing grenade.sh before adding the final changes. After it's done, git rebase --continue rebases the rest of the changes on this, giving me a new shiney 5 patch series that's a lot more clear than the 3 patch one I had before. Step 5: Don't forget the idempotent ID One last important thing. This was a patch to gerrit before, which means when I started I had an idempotent ID on every change. In splitting 1 change into 3, I added that id back to patch #3 so that reviewers would understand this was an update to something they had reviewed before. It's almost magic As a git user, git add -p is one of those things like git rebase -i that you really need in your toolkit to work with anything more than trivial patches. It takes practice to have the right intuition here, but once you do, you can really slice up patches in a way that are much easier for reviewers to work with, even if that wasn't how the code was written the first time. Code that is easier for reviewers to review wins you lots of points, and will help with landing your patches in OpenStack faster. So taking the time upfront to get used to this is well worth your time.

OpenStack Failures

Last week we had the bulk of the brain power of the OpenStack QA and Infra teams all in one room, which gave us a great opportunity to spend a bunch of time diving deep into the current state of the Gate, figure out what's going on, and how we might make things better. Over the course of 45 minutes we came up with this picture of the world. 14681027401_327a720647_o We have a system that's designed to merge good code, and keep bugs out. The problem is that while it's doing a great job of keeping big bugs out, subtle bugs, ones that are low percentage (like show up in only 1% of test runs) can slip through. These bugs don't go away, they instead just build up inside of OpenStack. As OpenStack expands in scope and function, these bugs increase as well. They might grow or shrink based on seemingly unrelated changes, dependency changes (which we don't gate on), timing impacts by anything in the underlying OS. As OpenStack has grown no one has a full view of the system any more, so even identifying that a bug might or might not be related to their patch is something most developers can't do. The focus of an individual developer is typically just wanting to land their code, not diving into the system as a whole. This might be because they are on a schedule, or just that landing code feels more fun and productive, than digging into existing bugs. From a social aspect we seem to have found that there is some threshold failure rate in the gate that we always return to. Everyone ignores base races until we get to that failure rate, and once we get above it for long periods of time, everyone assumes fixing it is someone else's responsibility. We had an interesting experiment recently where we dropped 300 Tempest tests in turning off Nova v3 by default, which gave us a short term failure drop, but within a couple months we're back up to our unpleasant failure rate in the gate. Part of the visibility question is also that most developers in OpenStack don't actually understand how the CI system works today, so when it fails, they feel powerless. It's just a big black box blocking their code, and they don't know why. That's incredibly demotivating. Towards Solutions Every time the gate fail rates get high, debates show up in IRC channels and on the mailing list with ideas to fix it. Many of these ideas are actually features that were added to the system years ago. Some are ideas that are provably wrong, like autorecheck, which would just increase the rate of bug accumulation in the OpenStack code base. A lot of good ideas were brought up in the room, over the next week Jim Blair and I are going to try to turn these into something a little more coherent to bring to the community. The OpenStack CI system tries to be the living and evolving embodiment of community values at any point in time. One of the important things to remember is those values aren't fixed points either. The gate doesn't exist to serve itself, it exists because before OpenStack had one, back in the Diablo days, OpenStack simply did not work. HP Cloud had 1000 patches to Diablo to be able to put it into production, and took 2 years to migrate from it to another version of OpenStack.

Helpful Gerrit Queries (Gerrit 2.8 edition)

Gerrit got a very nice upgrade recently which brings in a whole new host of features that are really interesting. Here are some of the things you should know to make use of these new features. You might want to read up on the basics of gerrit searches here: Gerrit queries to avoid review overload, before getting started.

Labels

Gone are the days of -CodeReview-1, we now have a more generic mechanism called labels. Labels are a lot more powerful because they can specify both ranges as well as specific users! For instance, to select everything without negative code reviews:

status:open NOT label:Code-Review<=-1

Because we now have operators, we can select for a range of values, so any negative (-1, -2, or any high negative value should it get implemented in the future) matches. Also negation is done with the 'NOT' keyword, and notable that CodeReview becomes label:Code-Review in the new system. Labels exist for all three columns. Verified is what CI bots vote in, and Workflow is a combination of the Work in Progress (Workflow=-1) and Approved (Workflow=1) states that we used to have.

Labels with Users

Labels get really power when you start adding users to them. Now that we have a ton of CI bots voting, with regular issues in their systems, you might want to filter out by changes that Jenkins currently has a positive vote on.

status:open label:Verified>=1,jenkins

This means that changes which do not yet have a Jenkins +1 or +2 won't be shown in your list. Hiding patches which are currently blocked by Jenkins or it hasn't reported on yet. If you want to see not yet voted changes, you could change that to >=0.

Labels with Self

This is where we get really fun. There is a special user, self, which means your logged in id.

status:open NOT label:Code-Review>=0,self label:Verified>=1,jenkins NOT label:Code-Review<=-1

This is a list of all changes that 'you have not yet commented on', that don't have negative code reviews, and that Jenkins has passing results. That means this query becomes a todo list, because as you comment on changes, positive, negative, or otherwise, they drop out of this query. If you also drop all the work in progress patches:

status:open NOT label:Code-Review>=0,self label:Verified>=1,jenkins NOT label:Code-Review<=-1 
  NOT label:Workflow<=-1

then I consider this a basic "Inbox zero" review query. You can apply this to specific projects with "project:openstack/nova", for instance. Out of this basic chunk I've built a bunch of local links to work through reviews.

File Matching

With this version of gerrit we get a thing called secondary indexes, which means we get some more interesting searching capabilities. which basically means we also have a search engine for certain other types of queries. This includes matching changes against files.

status:open file:^.*/db/.*/versions/.* project:^openstack.*

is a query that looks at all the outstanding changes in OpenStack that change a database migration. It's currently showing glance, heat, nova, neutron, trove, and storyboard changes. Very helpful if as a reviewer you want to keep an eye on a cross section of changes regardless of project.

Learning more

There are also plenty of other new parts of this query language. You can learn all the details in the gerrit documentation. We're also going to work at making some of these "inbox zero" queries available in the gerrit review system as a custom dashboard, making it easy to use it on any project in the system without building local bookmarks to queries. Happy reviewing!

Bash trick of the week - call stacks

For someone that used to be very vocal about hating shell scripting, I seem to be building more and more tools related to it every day. The latest is caller (from "man bash"):

caller [expr] Returns the context of any active subroutine call (a shell function or a script executed with the . or source builtins). Without expr, caller displays the line number and source filename of the current subroutine call. If a non-negative inte‐ ger is supplied as expr, caller displays the line number, subroutine name, and source file corresponding to that position in the current execution call stack. This extra information may be used, for example, to print a stack trace. The current frame is frame 0. The return value is 0 unless the shell is not executing a subroutine call or expr does not correspond to a valid position in the call stack.

This means that if your bash code makes heavy use of functions, you can get the call stack back out. This turns out to be really handy for things like writing testing scripts. I recently added some more unit testing to devstack-gate, and used this to make it easy to see what was going on:

# Utility function for tests
function assert_list_equal {
    local source=$(echo $1 | awk 'BEGIN{RS=",";} {print $1}' | sort -V | xargs echo)
    local target=$(echo $2 | awk 'BEGIN{RS=",";} {print $1}' | sort -V | xargs echo)
    if [[ "$target" != "$source" ]]; then
        echo -n `caller 0 | awk '{print $2}'`
        echo -e " - ERRORn $target n != $source"
        ERRORS=1
    else
    # simple backtrace progress detector
        echo -n `caller 0 | awk '{print $2}'`
        echo " - ok"
    fi
}

The output ends up looking like this:

ribos:~/code/openstack/devstack-gate(master)> ./test-features.sh 
test_full_master - ok
test_full_feature_ec - ok
test_neutron_master - ok
test_heat_slow_master - ok
test_grenade_new_master - ok
test_grenade_old_master - ok
test_full_havana - ok

I never thought I'd know this much bash, and I still think data structure manipulation is bash is craziness, but for imperative programming that's largely a lot of command calls, this works pretty well.

We live on fragile hopes and dreams

OpenSSL isn't formally verified!? No, neither is any part of your browser, your kernel, your hardware, the image rendering libraries that your browser uses, the web servers you talk to, or basically any other part of the system you use. The closest to formally verified in your day-to-day life that you're going to get may well be the brakes on your car, or the control systems on a jet engine. I shit you not. We live on fragile hopes and dreams. via My Heart Bleeds for OpenSSL | Coder in a World of Code.

At lot of the internet is learning a lot more about how software in the wild functions after heartbleed. I found that statement to be one of the best summaries.

Why you should be reviewing more OpenStack code

“Read, read, read. Read everything—trash, classics, good and bad, and see how they do it. Just like a carpenter who works as an apprentice and studies the master. Read! You’ll absorb it.” – William Faulkner

Icehouse 3 is upon us, and as someone that is on a bunch of core review teams, it means a steady drum beat of everyone asking how do they get core reviewers to review their code. My standard response has been "make sure you are also reviewing code". Why? Understanding implicit style While most projects use the hacking program to check for trivial style issues (wrong formatting), there are a lot of other parts of style that exist inside a project. What's a good function look like? When is a project handling exceptions vs. doing checks up front. What does spacing inside functions look like. What "feels" like Nova code, and what feels foreign and odd. This is like when you are invited to a party at someone's house for the first time. You walk in the door, and the first thing you do is look to the host, and the guests, and figure out if people are wearing shoes in the house or not. And follow suit if there looks like there is a pattern. It's about being polite and adapting to the local environment. Because unless you read other people's code, you'll never understand these things. There are lots of patches I look at briefly, realize that they are in some whacky style that's so foreign to the code at hand, that I don't have the energy to figure out what the author means, and move on. Taking load off review teams As a core reviewer, I currently have about 800 patches right now that I could +2 or -2. Given the rate of code coming in, that might as well be infinite. And it grows all the time. By reviewing code, even when you don't have approval authority, you'll be helping the review teams weed out patches which aren't in any way ready to move forward. That's a huge time savings, and one that I appreciate. Even if it's something as simple as making sure author's provide good commit messages, that's huge. Because I'll completely skip over reviews with commit messages that I can't understand. That's your opportunity to sell me on why I should spend the next 30 minutes looking at your code. A good commit message, being really clear about what problem this code hits, and what this solution is, and why this implementation is the right approach, will make me dive in. A terrible or unclear commit message will probably just make me ignore the code, because if the author didn't care enough to explain that to me in the commit message, there are probably lots of issues in the code itself. Even if you and I had a conversation about this code last week, don't assume I remember all of that. That was probably 50 code reviews ago for me, which means the context of that conversation has long since flushed from my brain. If you review a bunch of code, you'll understand how these things impact your ability to review code, and will naturally adapt how you write commits (including the message) to make life of a reviewer easier. Seeing the bigger picture People tend to start contributing in just one corner of OpenStack, but OpenStack is a big, interconnected project. What you do in one corner can effect the rest of the project. If you aren't reviewing code and changes happening at other layers of the project, it's really hard to know how your piece fits into the larger picture. Changes can look fine in the small, but have a negative impact on the wider project. If you are proactive in reviewing code more broadly you can see some of that coming, and won't be surprised when a core reviewer -2s you because you were going in a different direction than the rest of the project. Becoming a better programmer When I started on the OpenStack project 2 years ago I hadn't done python in a real way for years. My python was very rusty. But one of the first things I did was start reviewing a bunch of code, especially by some of the top people in the project. There are some really smart and really skilled people in the OpenStack project. There are people that have been part of the python community for 15+ years. People that live and breath in python. Just reading their code makes you realize some of what can be done with the language, and what the "pythonic" way of doing things is. Nothing is better training for becoming a better python developer than learning from these folks. Some times you'll find a real issue, because no one is perfect. Some times you'll find something you don't understand, and can leave a comment as a question, which you'll probably get an answer to. But all of it will be learning. And you will become a better developer. It does make a difference I'll be 100% honest, with 800+ reviews I should be looking at, I play favorites. People that I see contributing a lot on the review side (not just volume, but real quality reviews that save me time) are people who's code I want to review, because they are contributing to the whole of the project, not just their little corner. So that's why you should review more code in OpenStack. It will really contribute to the project, make you a better developer, and through all this you'll find your code is naturally aligning better with OpenStack and gets reviewed more often. Realize this is not an overnight fix, but a long term strategy for aligning with the community and becoming part of it.

Thinkpad X1 Carbon - awesome Linux laptop

Note: this is in reference to the Generation 1 X1 Carbon which was available in 2012/2013. The new Gen 2 X1 Carbon has enough different hardware that this may not apply.

The Thinkpad X1 Carbon is Lenovo's stab into the ultrabook market. Made of carbon fiber (hence the name), it's very light. The last 6 months with my Samsung chromebook has made me appreciate lightness when it comes to laptops.

I'd been lusting after one of these for the last year. I decided with new job it was time to treat myself to one for my personal laptop. A great Linux machine First off, this is about the best experience I've had with a Linux machine. Everything works, and is completely rock solid. It showed up right before Christmas, I did an Ubuntu install with full disk encryption, then went about restoring my 50G home directory onto it (which takes a few hours even over ethernet). The hardware is an i5 with Intel graphics. Over the years I had gotten so used to nvidia graphics, which are fast, but fail to suspend on about day 4. Which was typically fine, because they were work machines and Lotus Notes would crash Unity around day 3 anyway, so a restart was in order. But with Intel graphics this has been rock solid. I'm on my 3rd boot since I got it (did actually decide to take a kernel update the other day), with me suspend / resuming on average 6 times a day. Never an issue. Oh right, this is how a laptop is supposed to work. :) Everything I've tried so far as worked fine. Displayport is fine, fingerprint reader has a pam module, which I used for about a week, then found it was requiring a few more swipes than I liked, so uninstalled it. Battery Battery life is consistently 5 - 6 hours. So my charger stays in my office, the laptop rarely does. What's even better is it's a new kind of battery tech which means it does a fast charge to 80+% in < 45 minutes. So when I'm actually down to less than an hour of battery I'll take it up to the office, and call it a break (or jump on my desktop). Keyboard The X1 carbon is the new style Thinkpad keyboard, which you'll also find in things like the T430. While it isn't the old reliable Thinkpad keyboard, I'm actually very happy with it. It has a slightly different feel, but you get used to it over time. It has nearly the same throw of the old thinkpad keyboards, not quite the same, but close. I find the feel on the individual keys is actually nicer than the old Thinkpad keyboards. The surface just feels nice. It's still the generation that has real mouse buttons, which are actually now a think of the past, and a contributing reason to getting this versus a newer thinkpad. Realize, I'm about as invested in Thinkpad keyboards as anyone. My desktop keyboard is the USB Thinkpad keyboard, and I just ordered 2 more of them as backups given that it's a discontinued item. Screen 1600x900 at 14" is respectable. Importantly, it's a matte screen, which means it's usable around bright lights. It's not a great screen, especially compared to what I'm using on my desktop, however it's a comfortable one to work on. There are versions with touch screens, which would add weight and gloss, neither of which I was interested in. Slightly Older Hardware The X1 carbon came out about a year ago, so it's an i5 2 core processor. Ram is 8G max (it's soldered on, so you want to get the max). SSD maxes out at 256G. It's field replaceable, but not in a standard package, so max that out as well. In an ideal world This would have a better screen, and I could get it with speed and feed bump in the underlying hardware. That being said, I've got a nice new Haswell desktop with a ton of memory and SSD. This laptop is a joy to use, so I'm ok with slightly less speed on it. I did build a powerful workstation 6 months ago for a reason. And then Lenovo went all Crazy Pants At CES they announced a new X1 carbon. Faster processor, better screen... and a completely scrambled keyboard. No more function keys, instead a capacitive "touch region" . Caps lock removed and turned into a split home / end key. Tilda key moved over to between the right Alt & Ctrl keys. It also removes the mouse keys, which makes the touchpad non disableable. Complete crazy pants. screenshot_113 Which blows my mind. When Lenovo got the Thinkpad franchise, they got a keyboard design which was loved by millions. There were reasons why they needed to touch the keyboard once, because the old one won't fit in an ultra book. It takes up too much depth. However the level of scrambling they are doing to it now is just out of control. It makes me sad. But back to Linux... As a Linux laptop, this is a joy. This generation of X1 carbon is going to disappear soon with the big windows 8 push on their new version. So if you were ever thinking about it, now is the time to act.

IPython Notebook Experiments

A week of vacation at home means some organizing, physical and logical, some spending times with friends, and some just letting your brain wander on things it wants to. One of the problems that I've been scratching my head over is having a sane basis for doing data analysis for elastic recheck, our tooling for automatically categorizing races in the OpenStack code base. Right before I went on vacation I discovered Pandas, the python data analysis library. The online docs are quite good. However on something this dense having a great O'Reilly Book is even better. It has a bunch of really good examples working with public data sets, like census naming data. It also has very detailed background on the iPython data notebook framework, which is used for the whole book, and is frankly quite amazing. It brought back the best of my physics days using Mathematica. With the notebook server iPython isn't just a better interactive python shell. It's also a powerful webui, including python autocomplete. There is even good emacs integration, which includes supporting the inline graphing toolkit. Anything that's created in a cell will be available to future cells, and cells are individually executable. Looking at the example above, I'm setting up the basic json return from elastic search, which I only need to do once after starting the notebook. Pandas is all about data series. It's really a mere mortals interface on top of numpy, with a bunch of statistics and timeseries convenience functions added in. You'll find yourself doing data transforms a lot in it. Like my college physics professors used to say, all problems are trivial in the right coordinate space. Getting there is the hard part. With the elastic search data, a bit of massaging is needed to get the list of dictionaries that is easily convertable into a Pandas data set. In order to do interesting time series things I also needed to create a new column that was a datetime convert of @timestamp, and pivot it out into an index. You also get a good glimpse of the output facilities. By default the last line of an In[] block is output to the screen. There is a nice convenience method called head() to give you a summary view (useful for sanity checking). Also, this data actually has about 20 columns, so before sanity checking I sliced it down to 2 relevant ones just to make the output easier to grok. It took a couple of days to get this far. Again, this is about data transforms, and figuring out how to get from point a to point z. That might include include building and index, doing a transform on it (to reduce the resolution to day level), then resetting the index, building some computed columns, rolling everything back up in groupby clauses to compute the total number of successes and runs for each job on a certain day, and doing another computed column in this format. Here I'm also only slicing out only the jobs that didn't have a 100% success rate. And nothing would be quite complete without being able to inline visualize data. This is the same graphs that John Dickinson was creating from graphite, except on day resolution. The data here is coming from Elastic Search so we do miss a class of failures where the console logs never make it. That difference should be small at this point. Overall this has been a pretty fruitful experiment. Once I'm back in the saddle I'll be porting a bunch of these ideas back into Elastic Recheck itself. I actually think this will make asking the interesting follow on questions on "why does a particular job fail 40% of the time?" because we can compare it to known ER bugs, as well as figure out what our unclassified percentages look like. For anyone that wants to play, the original iPython Notebook is no longer hosted.