Saturday 11 June 2011

Enter the Pipeline .... exploring Scala and Akka


Sometime ago I did some interesting work with Akka.  These days I don't code everyday, most of what I do is in my own time and I wanted to play with the latest Akka (1.1) and see if it was still as fun as it was.

I wanted something simple to connect to Twitter, read the stream, chop it up message by message, pass them to a JSON parsing actor, which it turn would pass them on to further actors for more processing. The concept can be extended in many ways.

The codebase as it stands is quite small, which is really thanks to standing on the shoulders of giants:

 Scala 2.9.0
 Akka 1.1
 SJSON
 Signpost OAuth library
 Apache HTTP Client 4.1
 ScalaTest
 Mockito
 SBT

A few comments about the code...

Firstly its not production code, it is an exploration of various things. I will outline some of what I think are the most interesting parts of the code. See the bottom of this post where to find the code.

Using Either and Option

One of the things that has always created confusion is the use of exceptions for things that are not really exceptional cases. Too often we see exceptions used for alternate path selection, often the less travelled path but still a valid path.

If you expect failure, which in a networked environment we should, then you should not be using exceptions, its not an exceptional case, its an expected one. However how do you represent this in code? Thankfully Scala gives you Either. Your methods can use Either to indicate that this method will return either an error or a result.



In the above we have defined a trait with a connect method that can return either an Error if something goes wrong, or a BufferedReader if all goes well.

Similarly you can use Option to indicate that a method may return a value or not. This is much more meaningful than returning null's as you cannot be sure what a null really means, whereas an Option explicitly states its Optional.

And its possible to combine the two...



So in the above we are saying, either expect a ReadError if there is a problem otherwise you will 'optionally' get a string.

Using 'become' to manange connection state

Akka provides the ability to change the receive method of an Actor using 'become', this can be very useful to manage the state of an actor based on the messages it receives. In this case we have two "states", active and inActive. The messages that are valid in each state are different. The actor can move between these states using 'become'.




Testing

I have written the tests with ScalaTest, in a Given-Then-When style. One of the challenges of testing an intergration point is verifying expected behaviour without actually having to connect to a real service. Using AkkaTestKit and Mockito I have attempted to test as close to the integration as possible without actually building a mock service.

Testing logging events

Akka provides an eventing service which is an alternative to traditional logging. Eventing decouples the action of logging from the source of the event, this is very useful in many respects including testing. It is possible to hook into the event service through the test, in this case I used an actor that simply places the events received on the queue. The test can then verify expected events are on the queue in the expected timeframe.

Whilst it is possible to get the Events sent to the TestKit testactor I found this tended to made the test confusing as essentially two streams of messages were being merged which in turn made the test logic quite complex. By using a separate queue it made it clearer (in my view) to understand.

Below you can see an example using Mockito as well as the event queue.



How to use it

Warning: whilst using this code for testing and experimentation does not (according to my understanding) break any Twitter T&C's you should satisfy yourself should you decide to use this code or alternatively base any future code upon it.

You will need to register an application with Twitter for which you will need to have a Twitter account, login and then go to https://dev.twitter.com/apps

Here you can register a new application, then make note of the consumerKey and consumerSecret. You will also need to make note of your Access Token (oauth_token) and Access Token Secret (oauth_token_secret).

Then you need to clone the source code and built it using sbt (http://code.google.com/p/simple-build-tool/)

The following transcript shows how to use it.



If you want to play with the code in Intellij, just run 'sbt idea' on the command line. The code is tagged with 'blog_post' at the point in time of writing this blog post.

Wednesday 27 April 2011

When the pressure is on

Ever been in that situation, its a deadline, you are under the gun, things are going wrong.  You are putting in the long hours, but no matter how hard you work things seem out of control.  I’ve been in that situation, a few times.  I’ve worked in a couple of start-ups, and one in particular had a couple of occasions where it almost all went wrong, we pulled it in, but boy it was close.  

So you think, we’ve done it before but we didn’t fail, we made that release or we went live with that customer... or maybe you didn’t... or maybe you did but in the process you burnt the relationship with the client... or burnt yourself (or your team) out or maybe you found yourself in firefighting mode for weeks post release scrambling to keep things running behind the schemes.


The problem is it can be addictive, it can be a rush to pull that rabbit out of the hat, and it tends to propagate the Hero culture and then it feeds on itself.  Eventually this will lead to an implosion where projects and possibly the company will fail.

How do we break that cycle, given that we are in the middle of it?

[Context note:  This is more aimed at small teams/startups and I am not going to go into how to avoid the situation in this post, it is about the situation at hand]

First, find some calm.


 

Second, step back, go for a walk, go to a cafe, take a nap - do what you need to do in order to get some effective head space.

Third, you need to plan, and I don’t mean a detailed plan, I mean just enough of a plan to keep you on the right track, a meta-plan if you like, something to guide you.

So you are now calm, you can judge risk more clearly, you can make decisions, you will make less mistakes.

If you are still doing features I recommend setting up a really simple “Kanban” board.  Something like [ backlog | dev | test | deploy | accept ], put up a few of the most important/riskiest tasks (negotiate, argue, decide amongst the team and the product owner) on the left, and pull them across one at a time right through to the end before starting the next - don't be tempted to work on too many at once, more team members than cards in flight is a good rule of thumb.  When that bunch is done, put up the next few important and repeat, keep the number of cards on the board small, don’t overwhelm it or yourselves.  Don’t get hung up on the column names and the card content/structure - just do the minimum to keep a tab on what you are doing, adjust as appropriate.

Its likely that you are close to release, perhaps your stuck in a nasty bug/quick fix cycle.  “This change will fix it, ah no, damn, ah, try this, and this..... “. Time to break cycle.  This sort of thing will destroy the confidence of your client, your credibility and damage the likelihood of future work.

So how do you break out of this?  Don’t guess, be data driven, work from facts and not supposition, that meas work from test results.  Ideally you’ll be using TDD or BDD style methods but if you aren’t you can still leverage some of those ideas to help break the cycle.  Reproduce failures, if possible work out how to automate that test - even a simple bash script using curl and grep with a simple assertion (exit 1) is better than nothing.  And will help prevent that problem creeping back in later.  You can take these tests and later build them into a proper BDD suite.  

Take small steps: when confidence is low move slow.  By taking small steps you can confidently move forward as you take less risk and its easier to roll back to a known, or at least better known state.

“But I don’t have an environment to test in” … so this is a huge risk, usually it is actually the case that the live environment is very different to the test one.  Time to change that, a few hours (even days) will make all the difference.  “We don’t have the hardware” … sorry that excuse is running out, AWS and other cloud providers allow you to put together a lot of hardware quickly at pretty reasonable cost that you can turn off when done - a few hundred $/£/€ may be worth it if you can solve the problems effectively.

Remember its not easy, if it was someone would have already done it, so looking for silver bullets is counter productive.  Whilst you can get away with cutting some corners (with acceptable consequences usually in the form of technical debt) there are some you can’t - sometimes you just have to accept spending that time and money to make sure you can test what you are doing adequately.  Keep note of those decisions, where you have incurred technical debt, so it can be addressed later.

The important thing is to maintain perspective.  Yes things are bad, but its possible to bring things under control, when things are in control you can make better decisions, you can make clear decisions on trade offs.

Setting up a ‘war room’ is sometimes a good idea, as long as it promotes that calm control, and doesn’t become a yelling chaos factory.  Protect your space, make it clear that you don’t want those who will disturb the groups balance invading that space.  Agree to provide status updates on your terms - ideally your simple Kanban board will give enough of that status for you!

Protect your relationship with the client/customer - keep a clear consistent communication channel, one person to inform, update and ask questions.  I sometimes call this an Anchor, like a news Anchor they are there to keep consistency and to keep it (the channel) together under pressure.  They can also help prevent the ‘quick fix/release/break’ cycle by keeping check on the team output.  They should be professional, keep emotion aside (including not reacting to emotion from the other side) and try to build a collaborative approach.

“It’s not our fault its broken … It works for us” … this is a very common attitude, however whilst at the time you may think you are right, all too often you turn out not to be.  That is embarrassing and possibly worse.  If its not your fault, be prepared to be able to prove it.  If you can’t prove it and more importantly if you can’t prove it in the environment that your client is seeing it in, then accept its more likely your problem than theirs.  The worst thing about this sort of response is that it breaks confidence, and reduces trust because once you do it once it becomes all the more difficult to work closely with the client to solve the next problem.

No plan survives contact with the enemy (http://bit.ly/kl5OgR) be prepared to adapt as necessary.  Periodically step back, breathe and review the board.  Take a quick poll of the team, chat to the client and sanity check your priorities.

One boss early in my career gave me this advice, “In the next quarter they are not going to remember that you were two weeks late, but they are going to remember if you delivered them a steaming turd on time”.

Now days I view this advice in terms of the time/scope/cost triangle - they won’t remember if you were missing a couple of features, but they will if it didn’t work at all.

The goal is to move toward sustainable development and to not be in this situation to begin with. But things don’t always work out, you and your team may not have the experience or the available resources, you may just be moving too fast so as not to miss the opportunity.  Some times you find yourself in a fire fight and you need some help with the hose and not a lecture on the dangers of playing with matches.

Importantly when the dust settles a bit you make the time to look back, do a retrospective, try and work out where things went wrong in the first place and look to improve.

Wednesday 9 February 2011

Estimation? No a distribution of confidence ...



I have always been wary of estimation... 


Well no, that's not quite true, I wasn't until I gave an estimate to a salesman, and then I found myself having to meet a hard release date... once bitten... twice bitten... well in reality I am a slow learner so it was probably a couple of dozen times.


Most estimates are really figures plucked from the air by a developer, or worse someone who doesn't actually work on the code, or even far far worse by someone who have never even seen the code.


So here is my current thoughts on estimation.. not fully formed, and probably poorly communicated, but I still think worth sharing, if only to improve my own understanding.




Even with Agile poker planning methods, estimation is still a dubious business.


An estimation is only valuable if you are aware of the context in which the estimation was made.  Then you must remember that estimation is exactly that, an estimation - it is in fact some sort of distribution curve of the confidence of the person making the estimation.


Huh?

What if you plot the 'confidence' of the person making the estimation against the possible solution time.  Typically this might look something like this.  





The estimate they will give should be around the peak of this curve - well pessimistic estimators will be a bit to the right to give them some margin, and optimistic to the left.


Now the shape of this curve is what is interesting.  It will be determined by how 'known' the problem is.  So a really well known task may have a graph like so:








And a conversely really poorly understood problem may have a distribution like:






At either end of the spectrum shown here estimation is pointless.


Either you know the problem so well that time spent estimating is actually waste or you have no idea of the problem so it becomes essentially a random guess. A random guess is more dangerous than not estimating at all.


So estimation is of most value in this middle region, where you have a reasonable understanding of the problem, but not so well that estimation is just waste.  A goldilocks zone if you like.

Good poker planning sessions often are more about the conversations and the discovery around a problem.  These discussions attempt to drag the flat distribution into a more normal one where it is sufficient to make an estimate.


One of the things I have seen here is some teams then get into an obsessive perfection of estimation loop.  Where every retrospective has a key outcome of 'we must improve our estimation'.  And I've seen teams haggle for what feels like hours as to whether it really is a 7 hour or 13 hour story.


Their key outcome is wrong, they are focusing on the wrong thing.  The real outcome should be, we should improve our understanding of the problems at hand.


I think that one way to avoid this obsession is to use T-Shirt sizing - step away from the numbers and use more coarse grain concepts like Small, Medium, Large.


Note that I think story points are better than real time measures but I still think the psychology of dealing with the numbers is the same.

T-Shirt sizes are coarse, deliberately so.  You can start with a rough equivalent to 1 day, 3 day, 5 day - but they should not be thought of in that way when doing the estimation - you just try to group them into general categories of S, M, L.


Over time you can measure what your outputs are, and then you can get average times for each of your sizes and these then can be used by those interested in trying to predict the future.


So estimation or planning sessions should really be about discovery, learning and then putting that problem into some general category.  Importantly don't be afraid to turn to the group and say - Hang on, I don't think we can even categorise this, it requires better understanding, more learning - lets take it out of this process and do some more focused work.


And even if you don't change a single thing about the way you work just think a little about the way you work.  Remind yourself that estimation is just that, estimation and just as important remind those who use your estimations.

Saturday 29 January 2011

The Failure of the Skunkworks

Something I have seen as a quite common approach in an organisation is to start a skunkworks project, to get something critical off the ground. I have been involved in several in the course my career, with varying success.

Typically these initiatives are actually done for the wrong reasons, but that’s not what I am going to talk about here.

What I have noticed is that more often than not, a skunkworks project is started; a small, skilled and motivated team is split off, it is isolated and allowed to focus on a specific issue. Now this (sometimes) results in fantastic results, the team is hugely productive and makes brilliant in-roads into the issue. Their progress reports are astounding and it promises great things.

Then they reach that critical juncture and the team is brought back into the fold. Unfortunately then more often than not things start to unravel.  The business views the final outcome as if not a failure, then a partial one.

Looking at what has happened we can find a lot of potential ‘reasons’ for this:
  • The skunkworks team just failed, they misreported their progress, they built false hope.
  • The process of creating a skunkworks team created too much animosity, those left behind were resentful and they then either consciously or unconsciously sabotaged the final integration of the solution.
  • The team produced something which whilst it solved the problem, couldn’t realistically be integrated into the ‘real’ system, it was either too advanced or couldn’t be understood by the mere mortals expected to continue or support the work.

Regardless of these and many other possible reasons I actually think there is something very different at work here – I think the perceptions of the above don’t take into account something  fundamental.

When you split off your skunk-works team, you removed them from your ‘System’. You gave them freedom, they probably self-organised, they were smart and motivated.  In their work they created their own little system in a bubble, one that suited their needs and as a result their productivity exploded. They probably even ignored the remnants of the system they were supposed to use (timesheets... nah can’t be bothered waste of time ...) and got away with it, because they were the skunkworks team, they were special.

Then you dragged them out of that bubble.

Pulled back into the ‘System’, and you crushed them, that bubble dissolved, productivity plummeted and problems rocketed – you re-imposed those constraints which held them back in the first place.

Chances are some of that team you pampered by allowing that microcosm walked out the door, they saw what was possible and had it cruelly ripped out from under them. The ‘System’ defeated them.

You see the huge gains in productivity were not down to isolating them, giving them their own coffee machine or supplying them with pizza late into the evenings. Sure they loved that stuff, but what they really loved was the freedom – the new system they created specifically to meet their goals.

By System, I don't mean process (Waterfall, SCRUM, etc) , I don't mean the Feng Shui of the office, I don't mean free pizza - the system encompasses all of those things and much more.  In turn the system produces a culture, and that culture plays a part in driving the behaviour of the people who work there.

Its an extremely complex system with positive and negative feedback loops and a million variables.  More people, more variables, more loops.

I don’t have a magical solution if you are considering or have an existing a skunkworks project, every situation is different.  But I do suggest you step back, and really think about why there needs to be a one.  I am betting that stepping back, thinking about the way you work, the system and its culture, that it will help you find alternative paths to solve your problems.



Saturday 27 November 2010

The mysterious case of the slow Alpha

One of the questions I like to ask candidates during the interview process is, "What is the most memorable problem you have had to solve". Often I am disappointed in the answers I get and thats with candidates with 10+ years of experience who I would have expected to come across some gnarly problem they had to solve.

Thinking about the question from my own experience I have several answers, but one in particular is now over 10 years old I thought was worth documenting before it fades from my memory.

In the late 90's and early 00's I worked in a startup in the telephony space, or more correctly the computer telephony integration space. It was a niche that was growing fast and one we were doing very well in. We had secured a contract with DEC (just as they merged with Compaq and since merged with HP) to port our software onto DECUnix (later Tru64). We had been running on Intel based systems for sometime using SCOUnix, Linux and Solaris. I had joined the team just after the port had been done, we had a couple of systems live and a couple running in labs.

There were some concerns raised on performance during integration testing with a client, and we embarked on some benchmarking. We pitted a brand new DS-20 on loan from DEC against our latest and greatest PIII-500 - in theory the DS-20 should have left the P-III in its dust. But it didn't, it didn't even come close to matching it, in fact in some tests it came in at almost 50% slower.

There was much gnashing of teeth. We were relying on the higher performance of the DS-20 to drive our SS7/ISUP solution into new markets at a scale our competitiors couldn't match.

I was still pretty early in my carreer, wet behind the ears if you like. I set about looking at what changes had been made during the port, and after a few days could not find any significant changes, those which were made I did some micro-benchmarks on and couldn't find any issue. We tried tuning the DECUnix kernel, we spoke to Compaq and adjusted everything to what was considered the optimum for our situation, no real change.

Tempers were rising, client was threating to pull the plug, my boss and I were spending hours on end, stepping through code, testing, tuning all to no avail. Eventually our CTO rang Compaq and gave them a barrage about how **** their Alpha's were and how we were so dissappointed with their performance we were going to drop our port and withdraw from the contract. After some too'ing and fro'ing Compaq flew an engineer out to take a look at our benchmarks and to see if they could help us out.

So the engineer arrived, I walked him through our tests, ran the benchmark and showed him the outputs. He looked over the tuning we'd done and agreed we'd done all the right things. I asked if he'd like to see the code, he said not yet, instead lets profile it. Profile? I'd not really thought of doing that on the Alpha, we'd profiled on the intel, ok so how can we do that on DECUnix? So he spent a few minutes showing me how to do it, I did a fresh build with profiling enabled, and then we re-ran the benchmark with profiling on. As expected it was slower still, but we generated a profile.

Right lets take a look.

The engineer looks at the results and within 30 seconds he said, 'Any reason you are opening /dev/null 11 million times during that test?'....huh, WTF?!

I looked through the application code, no references to '/dev/null' - it must be from somewhere else. Some find, strings and grep later we found the culprit, a shared library with no much in it, but it wasn't a DEC library. A search through our CVS repository, and I found a small module written by my predecessor who left a few days after I started - just after he'd finished setting up the new server.

In that module was a single function snprintf ... and a lot of comments abusing DEC for not including this function in their standard libraries......

And in that function was the reference to /dev/null

He had implemented his own snprintf, and thought he'd done a smart job. He was using the fact that fprintf would return the length of what was outputted, so in order to determine the length of the input he would open /dev/null, fprintf it, get the length, close /dev/null and then use that length to determine if the input needed truncation before calling sprintf. Oh crap.

Some quick hacking, a fresh build, and bang our Alpha was now screaming along just under twice as fast as the Intel.

Red faced, I thanked the engineer, and slunk off to explain to the CTO where the problem was. A day or so later we shipped a new version to our client using the GNU library instead of our insane one.

Contracts saved, faces red and a lot learnt.