Writings

November 14, 2013 · Personal · personal · ibm

New Adventures...

Friday will be my last day at IBM. It's been an incredible run here, but, as they say, all good things.... I've had a pretty amazing run at IBM. From the Sydney Olympics to OpenStack, with tons of really cool projects in between. I got to call the Linux Technology Center my home over the last 13 years, an organization that lead the charge in getting IBM involved in Open Source projects far and wide. A place where I had to chance to work with many of my Open Source heroes. And an organization that dramatically changed the culture of development at IBM, very much for the better.

And starting Monday, I'm going to do much of that again, in my new role at Samsung Research America's Open Source Group. Over the past year Samsung has been building up the OSG very much along the lines of IBM's Linux Technology Center or Intel's Open Source Technology Center: a center of competency for Open Source skills within the company, and a pool of extremely talented upstream developers across a range of projects. I'm extremely excited to be a part of that.

I've been telling a few folks over the last couple of weeks, and because they are FAQs, here are the first few questions everyone jumps to:

Are you still going to work on OpenStack? Yes. My role in the OpenStack community, and the amount of time I'm spending upstream won't change in any real way. Honestly, I'll probably have more time upstream. So I'll be in Atlanta and Paris, have plenty of time for my TC and QA PTL duties, and be slinging code, ideas, and debugging the gate just like I am today.

Are you moving? No. We really like where we live, the Hudson Valley is definitely home, and moving is something that was never on the table. So I'll become a full time work from home person. I've got a great home office, will be running the woodstove during the cold months, and go for long mid day bike rides over the Walkway during the warm ones.

There are tons of unknowns are part of this new adventure, but all my interactions with the new team so far have made me extremely confident and excited that it will be a great fit. I'll be out in San Jose all next week doing initial orientation and team sync, so if you happen to be in the area, let me know.

October 28, 2013 · Technology · systems · infrastructure

Saving the subway during Sandy

A pretty compelling read in the NY Times Magazine about the efforts to save the Subway system during Sandy last year. The thing I found most interesting was the institutional memory of the system, and how critical that was to the success of the effort:

The most important thing you won’t be able to see in the next hurricane is the experience that a disaster brings with it. It is not as eye-catching as an experimental tunnel balloon, but Sandy showed that experience is the system’s most vital asset, the experience of people who knew how to deal with the giant pumping and breathing and excreting 660-mile-long passenger-rail system that shuttles a fleet of underground trains a cumulative 341 million miles a year. If a storm bigger than Sandy comes, nobody knows for sure what will happen, even with Slosh maps. But the M.T.A. will draw on the experience of those who have been through some version of it before. ...

Longtime transit employees like to say that in the ’70s, when transit had very little financing, the Stock Exchange didn’t open some days because the subway couldn’t get the workers to the floor. Back then, as after Sandy, what kept the system going was the passion of beleaguered employees — train nerds, really.

“There is a hypothetical R.C.I. in the field today,” Calandrella says, referring to a road car inspector, “who remembers having to fix five trains a day at each station in one location. There is not somebody with five years of experience who knows how to do that.”

Institutional memory, tribal knowledge, and shared culture are huge factors in overall success, and it's nice to see that called out on something as big as saving the NYC subway system.

October 27, 2013 · Technology · openstack

OpenStack CI by the numbers

For the week of Monday Oct 20th to Sunday Oct 27th (partial, it's still Sunday morning).

34894 - # of test jobs run by the CI system
25316 - # of devstack clouds created by the CI system
- 8254 - # of large ops runs (devstack + fake virt driver + limitted tempest test to drive it)
- 940 - # of swift functional runs (devstack + swift functional tests)
- 16122 - # of runs that do some level of devstack + tempest with libvirt qemu
508536 - # of qemu guests started (successfully) in the devstack clouds
128 - Max # of libvirt qemu guests started (successfully) in a single test run

Things that you figure out when you are playing with elastic search on a Sunday morning. One of the surprises for me was how much we use devstack outside of the base tempest run use case, and that our max # of guests spawned in a run is now over 100.

Update: Clark Boylan correctly pointed out that our guests are 2nd level, on platforms that don't support nested kvm, and thus are libvirt/qemu guests not libvirt/kvm guests.

October 26, 2013 · Technology · hudson-valley · history

Silent Film Night

One of my favorite things about Poughkeepsie is the Bardavon Theatre, which houses one of the few operational Wurlitzer theatre organs. We are regulars at the Bardavon film series, which typically would have organ music played before the event. For the 2001 screening it was a medley of sci fi themes, which was just incredible.

A couple of years ago they stepped up their game, when they did Nosferatu (1922) for halloween, with live organ accompaniment. This has apparently become a halloween tradition. Last year we had Jekyll and Hyde (1920), and this year (last night), the Hunchback of Notre Dame (1923).

I really enjoy the silent films of this era, as they are a whole other form of art that we really don't have any more. And sitting there, in a theatre that was rebuilt to have that organ in the 1920s, you get glimpses of a time long past. Of the people that would have gone there nearly a hundred years ago to watch these movies when they first came out.

Here's looking forward to next year's silent film, whatever that might be.

October 21, 2013 · OpenStack · openstack · software · longform

OpenStack Havana - the Quality Perspective

Like a lot of others, I'm currently trying to catch my breath after the incredible OpenStack Havana release. One of they key reasons that OpenStack is able to evolve as fast as it does, and the whole thing not fall apart, is because of the incredible preemptive integration gate that we have (think continuous integration++).

In Havana, beyond just increasing the number of tests we run, we made some changes in the nature of what we do in the gate. These changes are easy to overlook, so I wanted to highlight some of my favorites, and give a perspective in everything that's going on behind the scenes when you try to land code in OpenStack. Parallel Test Runner Every proposed commit to an OpenStack project needs to survive being integrated into a single node devstack install, and hit with 1300 API & integration tests from Tempest, but until Havana, these were run serially. Right before Havana 3 milestone we merged parallel tempest testing for most of our jobs. This dropped their run time in half, but more importantly it meant all our testing was defaulting to 4 simultaneous requests, as well as running every test under tenant isolation, where a separate tenant is created for every test group. Every time you ratchet up testing like this you expose new race conditions, which is exactly what we saw. That made for a rough RC phase (the gate was a sad panda for many days), but everyone buckled down to get these new issues fixed, which were previously only visible to large OpenStack installations. The result, everyone wins.

This work was a long time coming, and had been started in the Grizzly cycle by Chris Yeoh, and spearheaded to completion by Matt Treinish. Large Ops Testing A really clever idea was spawned this summer by Joe Gordon: could we actually manage to run Tempest tests on a devstack with a fake virt driver that would always "succeed" and do so instantaneously. In doing so we could turn the pressure up on the control plane in OpenStack without the overhead of real virt drivers slowing down control plane execution enough that bugs could hide. Again, the first time we cranked this to 11, lots of interesting results fell out, including some timeout and deadlock situations. All hands went on deck, the issues were addressed, and now Large Ops Testing is part of our arsenal, run on every single proposed commit. Upgrade Testing Most people familiar with OpenStack are familiar with Devstack, the opinionated installer for OpenStack from upstream git. Devstack actually makes the base of our QA system, because it can build a single node environment from git trees. Lesser known is it's sister tool, Grenade. Grenade uses 2 devstack trees (the last stable and master) to build an OpenStack at the previous version, inject some data, then shut down everything, and try to restart it with latest version of OpenStack. The ensures config files roll forward smoothly (or have specific minimal upgrade scripts in Grenade), database schemas roll forward smoothly, and that we don't violate certain deprecation guarantees.

Grenade was created by Dean Troyer, I did a lot of work towards the end of Grizzly to get it in the gate, and Adalberto Medeiros took it the final mile in Havana and got this to be something running on every proposed commit. New Tools for an Asynchronous World September was the 30th anniversary of the GNU project. I remember some time in the late 90s reading or watching something about Richard Stallman and GNU Hurd. The biggest challenge of building a system with dozens of daemons sending asynchronous messages, is having any idea what broke when something goes wrong. They just didn't have the tools or methods to make consistent forward progress. Linux emerged with a simpler model which could make progress, and the rest is history.

If you zoom back on OpenStack, this is exactly what we are building. A data center OS micro kernel. And as I can attest, debugging is often "interesting". Without the preemptive integration system, we'd never be able to keep up our rate of change. However as the number of integrated projects has increased we've definitely seen emergent behavior that is not straight forward to track down.

Jobs in our gate will fail, seemingly at random. People unfamiliar with the situation will complain about "flakey tests" or a "flakey gate", and just recheck their patch and see it pass on the second attempt. Most of the time neither the gate nor the tests are to blame, but the core of OpenStack itself. We managed to trigger a race condition, that maybe shows up 1% of the time in our configuration. We have moved to a world where test results aren't binary, pass or fail, but better classified with a race percentage.

This is a problem we've been mulling over for nearly a year, and the solution which has been created is ElasticRecheck, a toolchain that uses Elastic Search on our test logs to check new failures against known failures. While finding a "fingerprint" for a failure is still a manual step, it was still of dramatic benefit for the release process. It got us out of thinking that there were only a couple of race conditions we were hitting, and realizing there were dozens of very specific races, each with their own fix. It also gave us a systematic way of determining which race conditions were most impacting us, so they could be prioritized and fixed.

This work was spearheaded by Joe Gordon and Matt Treinish, and leveraged some background work that Clark Boylan and I had done early in the cycle. ElasticRecheck is exciting enough technology all by itself, it deserves it's own detailed dive. But that is for another day. And many more... These are just some of the sexiest highlights from the Havana release on the quality front.

The number of tests in Tempest that we run on every proposed patch has risen from 800 to 1300 during the cycle. This included new scenarios and a massive enhancement on coverage in all our services. 100 different developers contributed to Tempest during the Havana release (up from 60 in the Grizzly release), enhancing our integration suite. We've got a new stress framework which can provide load generation to burn in your cloud, which I expect will make an appearance in our gate during Icehouse.

The point being, lots of people, from lots of places, contributed heavily to make the Havana release the most solid release we've ever had from OpenStack. They did this not just with new features that make for good press releases, they also did this with contributions to the overall system that validates our software not once a day, not even once an hour, but on every single proposed patch.

So to everyone that contributed in this extraordinary effort: THANK YOU!

And I look forward, excitedly, to what we'll create for the Icehouse release.

September 27, 2013 · Technology · openstack · software

Gerrit queries to avoid OpenStack review overload

As with many OpenStack core reviewers, my review queue can be completely overwhelming, often 300 - 400 active reviews that I have +2 / -2 authority on. It's really easy to get discouraged on a list that big. Fortunately there are ways to trim that down.

Gerrit provides a simple query language to select which reviews you see, using the query bar in the top right of the page:

The way this works is by adding criteria into the search box, which by default is ANDed together to get the final results. In the process these queries change the URL for Gerrit, so you can bookmark the resultant queries for easy access later. Restricting to Single Project (and pulling your own stuff) This query is basically what you get when you click on a project link:

status:open project:openstack/tempest

Nothing special, but you can go one step further by removing yourself from the list of reviews:

status:open project:openstack/tempest -owner:sdague@linux.vnet.ibm.com

This also demonstrates that we can have both positive criteria and negative criteria. Little Lost Projects (don't loose the little ones) In addition to having +2 on nova, devstack, tempest, I've got it on a bunch of smaller projects, which I often forget I need to go review. You can build a single query that has all your little lost projects in a single list:

status:open (project:openstack-dev/hacking OR project:openstack-dev/grenade)

No Objections You can also filter based on votes in the various columns. It's not nearly as detailed as I'd like, but it is still useful. I have a basic query for No Objections on most projects that I review which looks something like this:

status:open project:openstack/tempest -Verified-1 -CodeReview-1 -CodeReview-2

This removes all reviews that have a current -1 in Verified column, and a -1 or -2 in the CodeReview column. So patches with negative feedback are dropped from view. The top of your review list may contain patches that haven't cleared CI yet, but that's easy to see. There might also be Jenkins -2 reviews in this list, but gate failed merges can usually use extra eyes.

I consider this a base list of patches that there is no reason I shouldn't be reviewing them. Potential Merges I'm typically up and at my computer at 7am EST, which is often a very slow time for zuul. So one of the things I look for is code that only requires one more +2 to go to merge on projects like Nova. Many of these are easy to review fixes, and clear the decks before the queue gets busy in the afternoon.

status:open -Verified-1 CodeReview+2 -CodeReview-1 -CodeReview-2 (project:openstack/nova OR project:openstack/python-novaclient)

Like the last one, we are filtering out all patches with negative feedback, but also requiring that there is an active +2 on the patch. I also make sure to do this for both nova and python-novaclient, which often gets lost in the noise. Lost Patches Especially in Nova it's easy for a patch to get lost, as there are so many of them. I define lost as a patch that's passed CI, but has no feedback in code review.

status:open -Verified-1 Verified+1 -CodeReview+2 -CodeReview+1 -CodeReview-1 -CodeReview-2 (project:openstack/nova OR project:openstack/python-novaclient) branch:master

These patches are often from newer folks on the project, and as such often need more time, so I typically only go after lost patches if I know I can set aside a solid hour on them. However I try hard to get to this query at least once a week, to make sure things don't get fully lost, as a -1 will give the patch originator feedback to work on, and a +2 will make it far more likely to get the attention of other core reviewers when they are looking for mergable code. Experimenting with your own The gerrit query language is somewhat limited (full docs are online), so it can't do everything I'd like, but even just these few slices make it easier to be able to get into a certain mindset for reviewing different slices of code. I have a toolbar folder full of bookmarks for these slices on different projects to do just that.

If you have other gerrit queries you regularly use, please leave a comment. Would love to see the ways other folks optimize gerrit for their workload.

September 9, 2013 · Technology · software

Un-DRMing the old fashion way

... with robots... wait, what?

Peter Purgathofer, an associate professor at Vienna University of Technology, built a Lego Mindstorms robot that presses "next page" on his Kindle repeatedly while it faces his laptop's webcam. The cam snaps a picture of each screen and saves it to a folder that is automatically processed through an online optical character recognition program. The result is an automated means of redigitizing DRM-crippled ebooks in a clear digital format. It's clunky compared to simply removing the DRM using common software, but unlike those DRM-circumvention tools, this setup does not violate the law.

You can read more about it over at Boingboing.

September 4, 2013 · Technology · hudson-valley

Harvest time

I love this time of the year from the Poughkeepsie Farm Project.

September 3, 2013 · Technology · travel · personal

Coyote Instructions

The most memorable sign from our vacation, presented without additional comment.

September 2, 2013 · Personal · travel · personal

Vacation Toolbox: Eyefi Card

[Eyefi](http://www.amazon.com/gp/product/B002UT42UI/ref=as_li_ss_il?ie=UTF8&camp=1789&creative=390957&creativeASIN=B002UT42UI&linkCode=as2&tag=seasmenwal-20"><img border="0" src="http://ws-na.amazon-adsystem.com/widgets/q?_encoding=UTF8&ASIN=B002UT42UI&Format=SL110&ID=AsinImage&MarketPlace=US&ServiceVersion=20070822&WS=1&tag=seasmenwal-20) is a really cool idea, add a wifi chip into a standard SD card, so that when you walk into your house, turn on your camera, all your photos are synced to the cloud. During this trip I finally also got the mobile link mode working, where your phone acts as a relay, which meant from Halifax on I was posting pictures up to friends on facebook during the trip. It does this by starting up a wifi hotspot after the camera has been on for 30 seconds (and hasn't found a wifi network it knows). You set up that hotspot on your phone, and because the devices are so close, it's the dominant signal, and your phone auto jumps to it. It pulls down all the photos yet synced, then the AP goes silent, and your phone goes back to normal. Next time you wander onto a wifi network, your phone then relays those up to your chosen cloud sync point (many are supported, including google, facebook, smugmug, flickr, or your own gallery 3 installation).

If you have a camera, you should get an eyefi card. I really can't imagine going back to having to sync photos through a PC any more.