Tuesday, September 27, 2016

Adventures in Databasing

AmandaDB

Some time ago (a year next January), I began working on a database called amandadb.  The goal of this database was to store daily stock values, with the intent on using machine learning techniques to extract meaning out of them.  It wasn't supposed to ever be a "real" database, but there were some practical problems I experienced along the way, which caused me to move in that direction.  For this post, I'm going to share with you some of the creation process, what worked, and what didn't work, to move this project along.  I'm kind of proud at where it is right now, but we'll get there.

Lesson 1: Don't be all things to all people...

Like I mentioned earlier, this was a decision early on.  I had a very simple goal in mind: take (specifically Yahoo) stock tickers daily, and put them in persistent storage.  Then, retrieve those data in a way that's fast  and allows me to do machine learning on them.  At this point, MySQL or SQL Server seemed overkill.  I was literally going to dump the data for each stock ticker into a file on the hard disk, no problem. I got this working within less than a day, and then I realized something.  It took a long time to read those data.  Though I had managed to persist data, I had some serious doubts as to whether it would hold up under the weight of analysis.

Let me clarify.  I wanted to slice and dice different ticker values, over different ranges of time.  I was polling for the entire spread of S&P500 stocks (so, about 500 of them).  It turned out that Yahoo Finance API allowed free download at a rate of multiple times a minute, up to a cap (I think 20k queries daily at inception).  This meant that I could do more than I originally thought (remember, daily only), and I wanted to use all of those data too.  Do the math, and you can see that that adds up pretty quickly.  My plan was to aggregate a year's worth of data, and use that to do some k-means, PCA, genetic algorithm, and eventually perhaps write my first neural network against it.  Finding the specific records I wanted to find was going to be tough.

Just to give you an idea of what I was up against.  Imagine doing essentially a table scan across 20,000 records per day * 365 days per year?  That's 7,300,000 records, being scanned, every time I want to do some data mining.  Ridiculous.

So I hit a roadblock.  Stop?  No way, I was having way too much fun at this point.  Don't be all things to all people was good that I limited my interest to only stock tickers at first.  This allowed me to quickly get to some interesting problems, ones that I thought I already knew how to solve.  

Lesson 2:  When in doubt, add an index...

Or add two.  This is when I started factoring out my code into the first rev of amandadb.  It was because I wanted to add an index.  I realized at that point, that I was basically starting down the path of building my own database.  It still was going to be pretty minimalist - just an index and some files, so no need to get one of the 3rd party solutions yet.  For an index, I just used a sorted dictionary, and serialized it to disk.  This worked well, and drastically improved my access times from something ridiculous to around 7ms per record, when selecting a range of data.  Here's my index in almost it's initial carnation.  So pretty reasonably fast, and did what I needed to do.  I used generics, so I could re-use my index on multiple fields.  I added actually two indexes, one for the date, and one for the stock ticker field.

Lesson 3: Race conditions abound...

I did one other thing at this point.  Whereas before, I just dropped each ticker into its own file, I decided to compile multiple tickers into a single file.  This is so that when searching, I can load up a bunch of records at once.  This introduced a race condition that I discovered while running automated tests. Because the tests parallel where possible in VSTest, I kept getting IO exceptions when writing to the ticker file.  This surprised me, because I wasn't planning at that point on supporting multi-write scenarios, but I had to fix them anyway.

That's when I introduced file locking.  This might be considered the rough equivalent of write locking in the database world.  When writing to the same file, I lock the file, then write, then unlock the file.  I realized that even more, due to the practical evolution of this project, I was writing a database.

Then I discovered value investing.

Lesson 4: Requirements sometimes change...

I read the book The Education of a Value Investor, and that convinced me that applying machine learning to the stock market, though interesting, was basically speculation, and too much speculation makes it difficult for companies with real value, to make that value available for sell.  That meant that I no longer had the initial requirements of limiting the records to stock ticker only.  That also meant that I was in danger of turning my database into vaporware.  I thought for a moment, and pivoted a little bit.  I considered what I had.

A database system that serialized strongly-typed C# POCOs to persistent storage, and a consistent method for retrieving them, using indexes.  If I could get the access times down from 7ms, then I could have a very useful tool on my hands, one that I could actually use in production situations. This tool would fit in that space where we need persistence, but not a full-fledged database server.  If only I could get those access times down....

Lesson 5:  Memory is faster than disk...

So remember those automated tests that I told you about?  To get those to work in a reasonable amount of time, I had essentially mocked out a file system (IAmandaFile, IAmandaDirectory).  I then created in-memory versions of these.  I had already mentally changed requirements to compete with SQL Server access times, so I wanted to be able to store 8KB of data really quickly (< 7ms).  For small records, I was already there.  For large records, not so much.  I was seeing access times of 150ms per record, sometimes.

Combining all of the above, the fix was simple.  Formalize my in-memory access stuff, and use that instead of writing and reading from file all the time.  I was able to completely change this in regards to the record access pretty quickly (since I already basically had the objects necesssary).  This reduced my access time to sub-millisecond, but introduced the requirement to synchronize to persistent storage later.  This seemed familiar to me, so I looked it up.  I had basically migrated to a write-back or write-behind cache.  This is common in many databases, and makes sense if you think about it.  There's no way to get the really, really fast database access times on disk.  This also highlights the benefit of having a persistent flash medium instead of disk, if you can afford it.

Conclusion

So this is pretty much where I'm at.  I'm about halfway through the write-behind cache, and haven't started attacking the issue of synchronization yet, in any real way.  I left out some things I did around record serialization, to avoid seek time in files, and a significant change to my new file-based indexing scheme to allow indexes to take advantage of the in-memory objects as well.  If this project interests you, please feel free to contribute.  At this point, I'm planning to get the write-behind cache fully operational, and then there's a choice to make about next steps.  Being one person, I can either add a service to accept multiple simultaneous writes, or I can add updating capability, literally there is plenty of work to do.  

I think it's neat that I arrived at this point from a very practical beginning, kind of tracing the overall evolution of databases, in a weird way.  There's more to learn here, too - and I'm almost caught up .  I'm planning on using a suffix-tree indexing scheme for strings, which should be really interesting, if it works out.

Friday, August 26, 2016

Grow Your Career with Higher Quality Code

My brother, John, has a degree in Archeology from Texas A&M University.  What this means is that while I was learning about bits, decision trees, and Von Neuman Machines, he was travelling the world to work in Guatemala in the Peace Corps, Africa attached to Americorps, and took the time to become fluent in Spanish, and decipher hieroglyphics on old pyramids.  In other words, he's the cool one, and I'm, well, a geek.  But we have been on many trips together, and (trust me this is relevant) one such trip was to Italy around six years ago.

Distinctly I recall being hurled about in a tiny taxi roughly the size of my highly-fuel efficient Prius, hurtling through the streets toward certain doom (en route to our hotel).  Frantically, John and the driver were having a conversation in a language I didn't understand, his and the driver's hands undulating wildly while pointing at various signs I couldn't read.  I realized they were arguing about which was the best route to travel when the automobile pivoted into a sharp right, barely missing oncoming traffic.

When we came to a stop, thankfully at our hotel and not in the grill of a semi-truck, I tipped the cab driver generously, and asked my brother when he had learned to speak Italian.  To my surprise, he responded that he didn't really know Italian, but that Spanish was 'close enough' to the dialect of Italian the driver spoke that they were able to communicate, nonetheless.  That's not at all the impression I got at the time.  I wonder now, thinking back, why they didn't simply both speak English (as I know the driver had greeted us in English at the airport) as opposed to attempting to communicate in an complex mixture of two different languages and manual gestures.

It has since occurred to me that coding is a lot like verbal communication.  The language used is similar to a cultural vernacular, and comments are analogous to those hand-wavy exclamations used to get certain points across which aren't clear from the language itself.  Sometimes, it's enough to get a feel for the code and interpret a few gestures, but usually, actually speaking the same language (or dialect) is preferable.  Also, there are some things you just can't say in one language or dialect versus another.  In code, this might be analogous to why someone decided to use nodejs for building a service, instead of WCF (or vice versa).

I'll come back to analogies to verbal communication throughout this talk to try to influence your perspective, as I attempt to demonstrate that code quality matters.  However, I will always defer to the number one rule: get code to work.  But once code works, what next?  I will explain why I (and other managers I know) pay attention to things like code quality when reviewing code, and what different transgressions in style or pattern mean to me, as an Manager of engineers.

My talk on this subject is roughly an hour long.  Here are the high points of what I'm going to be discussing.

  1. Intro
  2. Code is Communication (with examples)
  3. Code is Design (with examples)
  4. Code is your Career
  5. Questions
After the talk, I'll be posting my slide deck here with examples and explanations for your reference :).  My goal is to help improve your coding career by giving you a perspective on what your coding choices, style, design say about your development as an engineer.

This camp will be on Saturday, 9/10 from 3:30 to 4:30.  I'll be there most of the day, so if you see me drop by and say hello!

Sunday, June 26, 2016

Chuck Sweet's Top 5 Rules for Software Engineering

Recently, I've been thinking about how to keep myself out of trouble with regards to software engineering.  I'm a huge fan of greedy algorithms (simple, straightforward), and kind of related, the KISS principle.  Don't get me wrong, I can write some crazy complex software, but I'm past the point in my career development where I feel like the complexity of code I write has any relationship to my standing as a software engineer, except perhaps an inverse relationship might exist.

Anyway, while thinking about how I approach software engineering, I've come across these rules which I apply every time I write code.

Rule 0: Performance is king.
If I don't write performant software in the product that I'm supporting, someone in a competing product will. Features matter, but if folks can't actually use my product because it performs too slowly, then eventually, your company will be lost to those whose tools they can use.  Say it with me:  "Performance Is King".

Rule 1: Existing patterns should not be changed for light or transient causes.  
In other words, if I'm new to the project, and I don't like the patterns, don't change them.  In fact, my new code should look remarkably similar to the existing code I find.  This is to combat a tendency I have to rewrite more than I need to.

Rule 2: Design twice, code once.  
This is the rule which keeps good design firmly in place, regardless of development methodology.  It doesn't matter if we choose agile, extreme coding, six sigma, whatever new fancy methodology doesn't exist.  If our DNA makes us design twice, code once, then we can avoid the problems I describe here.

Rule 3: Don't trust your own code, ever.  
I could have named this rule 'test, test, test'.  It never fails that I'll get my code to a state where I feel like I'm ready to ship, and I've dotted my 'I's, and crossed my 'T's.  Then I'll write a test (if I didn't start with a test, as I sometimes do), and the code immediately explodes.  I was writing a multi-threading client that I got to this state, and writing and running tests actually proved that I couldn't kill the threads once I'd spawned them (coded the way I had).  It was simple to fix, but it wasn't lost on me that, in production, I would have toppled servers, with no recovery except to restart them, and potentially huge customer implications.

Rule 4: Code simpler.  
This requires a little bit of explanation.  If you're like me, and wrote a bunch of code, then there's a possibility that it's over-engineered.  My next step these days is to consciously review my code, and see what I can remove, or code that I can't really justify.  This dovetails nicely with the extreme programming YAGNI principle.  This is a reminder to me that when I think it's simple enough, make it simpler still.  As a quick aside, patterns are good.  I've mentioned this before.  The problem with patterns is that they're often pretty robust, and offer a lot of extra stuff  just in case.  That's another place this helps.  Maybe that extra interface isn't necessary- would a regular class work?

Rule 5: Trust other engineers.  
This keeps me out of the NIH trap (not invented here).  It also means that if I'm writing a class, and have a property (or method), I default to public accessibility, unless there's a solid and compelling reason to be less accessible.  This one  rule reminds me that even though I've been coding for 20+ years,  it's possible that I don't know everything.  This is why I attend meetups like the Seattle Scalability Meetup, and other events to chat with other software engineers, and keep an eye on the evolution of software engineering.

These rules are designed to catch personality hangups I have seen in myself, and so may seem mundane to you if you don't have the same personality snafus.  I write them down here, so that just in case, maybe, someone else has the same tendencies, and will get some benefit.  Happy coding!

Tuesday, May 17, 2016

This Is Not The [Evolutionary] Design You're Looking For

Let me caveat this by saying that I'm a fan of some of the more recent development trends like TDD,  Agile, and what these changes have done for software engineering generally.  As a software engineer, I know first-hand the temptation to code for every eventuality, especially when they seem so obvious sometimes.  These types of approaches help make sure that the developer stops developing at some point by providing a terminating point.  In TDD, you start with a degenerate unit test, code until the test passes.  Then you incrementally add to the test until you match what's in the requirement (and no more).  Once that final test passes, you stop writing code.  In Agile, the team agrees on what something called the MVP (minimum viable product) is, and commits to delivery in a sprint or two.  That time-boxes everything so that the there literally is not time to do much in the way of over-engineering.

In spite of the massive cost-saving benefit of making sure that we software engineers don't stray off of the beaten path, there's a down-side.  That down side is this: when are you supposed to do system design?  I would suggest that with competing priorities, evolutionary designs simply don't work in an organization.  This shouldn't be a surprise, as Martin Fowler suggests that evolutionary design, as done in practice, is actually done in a way that leads to 'chaos' [Fowler, Is Design Dead].  I've seen this play out in such a way that any attempts at refactoring were actively discouraged by the senior engineers and developers in an organization!  When that happens, any chance of a reasonable design evolving is definitely dead.

Assuming that's a worst-case situation, what actually happens in real-world software engineering?  A product owner or some customer advocate presents a set of requirements, which then the software engineering team is responsible for delivering.

In the best case, the team can do an estimate, and then propose the estimate to the product owner.  Let's be realistic and assume that the estimate is actually an under-estimate by, say, 50% (that is to say, the actual work will take 150% of the LOE suggested by the engineering team).  Simply put, the engineering team says 2 sprints, the actual LOE is 3 sprints.  At the end of those 2 sprints, there's another feature that the team is expected to start work on.  Then, when does this alleged design work happen?  At the end of 2 sprints, the team is realistically still finishing up with the previous work, and so has no actual time to do design work.

In the worst case, the product owner has already committed a delivery timeline to the customer.  This example doesn't require a lot to show that design won't actually happen.  The engineering team is not in a position to do any real design, and has likely started out behind schedule already.

The lesson here is straightforward:  the only time we engineers have to do real design is time that we carve out for ourselves.

To conclude, there are many forces acting against thoughtful design in the newer development methodologies.  I've specifically examined Agile as it is often implemented, but you can read the same thing in Martin Fowler's discussion on XP.  What he closes with, and I wish to re-iterate, is that without the will, design will simply not happen - there are too many forces pushing in the opposite direction (probably as a reaction to the design-heavy requirements analysis - not going to bother to complete this argument though).  So, as engineers, we can't assume that good design will arise by KISS, YAGNI, DRY, or any other cute phrases to help us simplify.  We must intentionally bake design into our DNA.


Thursday, April 28, 2016

Code Quality and the Bottom Line

I'm rather obsessive about code quality - ask anyone who knows me.  This is the culmination of my experience working with many software engineers with a wide range of development experience.  Ranging from mentoring individuals fresh out of college, to prior-military, to what someone once referred to as 'elite' devs, code quality has become very important to me.  As I've progressed through my many years of software engineering experience, it's become apparent to me that software engineering code quality has become a bit passe in many development environments.  Reasons that I have heard about why certain individuals don't care about code quality is that it has minimal impact on performance, it is difficult to prove that there's any significant impact to the bottom line, and slightly more abstractly, that a bad pattern followed by all developers in an organization is better than a good pattern, partially implemented.  I'm starting to hate that last argument, and I'll explain why.  What I will not do here, is describe what I mean by quality, since I already did here.

Performance is king.   That is to say, as long as the code performs (to end user expectations), then it doesn't matter how good or bad the underlying code is.  I would agree with the inverse of this statement, suggesting that non-performant code is bad, and no matter how 'good' the underlying code is.  The difference is subtle.  The first statement suggests that there's no reason to consider code quality at all, and the second suggests that it is okay to focus on code quality, as long as performance is not degraded below what the end user expects.  I would suggest that code quality can negatively impact performance.  I have seen this, in fact [link to Emergent Patterns coming soon].

I confess that it is notoriously difficult to prove that code quality has any impact on the bottom line.  But, as we software engineers know, difficult is not the same thing as impossible [link to paper on this].  Krugle suggests, by inference, that the code maintenance cost has gotten worse, and not better, with frequent releases.  They state that approximately 50% of the entire software engineering effort in a company with more than 100 developers and over 500k lines of code (although the studies are somewhat dated).  I wouldn't have guessed the number was that high, but Omnext thinks the number is actually upwards of 90%.  What does that mean to you?  Let's say you have 150 developers.  At today's rate, that's about 11.25 million spent on maintenance alone.  Hmmm... I wonder what it would look like if you could turn even half of that into features?

Finally, let's talk about patterns (since patterns are an extension of code quality necessarily, since in software engineering these days, patterns emerge more than are dictated from code).  A bad pattern, then, can be though of as a formalization of bad code quality (I don't use the term anti-pattern, because that takes the conversation in a completely different and pretty much useless direction).  So what do we then have?  I think it's obvious, but let's call it out: a highly efficient way to ensure that code quality is evenly distributed throughout new code as thoroughly as it was in old code.  In other words, you magnify the problems Krugle points out in their white paper.  Let's examine this.  Do new developers spin up faster on the code base just because it's a pattern?  No, they may not even realize it's a pattern until months after they've shipped their first feature.  Is there better research and planning (because the code is easier to understand)? Also a no, because bad code quality is still difficult to understand, even if it is repeatable.  Is there less duplication of effort?  Again, no, because of the previous statement.  The only dubious 'win' is consistency with coding standards and practices, which is questionable because are bad coding standards and practices something you want to be consistent with?

I would suggest that, once performance is consistent with user expectations (read: not the best it can ever be, just consistent with user expectations of similar applications in the space), it's time to focus on code quality before committing to additional performance tweaks.  Whether it's obvious or not, the bottom line is suffering if you have poor code quality, and if you're considering punting on it because performance gains are easier (and they are just by virtue of being easily quantified), just think about what you could do with 6.125 million dollars worth of new features.

Sunday, March 13, 2016

What is Code Quality?

There are many different definitions of code quality.  Some of the more recent definitions attempt to avoid any commitment to ideas such as loose-coupling, or adherence to known patterns in favor of a roll-your-own approach (you'd be surprised, or maybe not, how many software shops adhere to the not-invented-here syndrome).  These definitions only discuss code quality in how it relates to fulfilling the functional requirements.  Code quality is more than that, however.  It's about S.O.L.I.D. design principles, lowering static code analysis rules, and as importantly, communication with other developers about precisely what a method is or does.  I would also include unit tests as part of my overall definition.  On a weighted scale, I would say it's about 30% S.O.L.I.D., 30% static code analysis rules, and 30% established pattern reuse, and 10% communication with other developers.  I'll comment on each of these in turn.

Of the S.O.L.I.D. principles, I weight them differently as well.  Single-responsibility is by far the most important, followed by dependency inversion, and then interface segregation.  Open/closed principle and Liskov substitution I care less strongly about (unless I'm writing a framework, then I care quite a bit).  The reason single-responsibility is the first on my list, is because by following this principle, unit testable code immediately arises.  If I have a data-access party type of middle tier, for instance [link to Emergent Patterns], then my business objects aren't just my business objects, they're also my data access objects, and who knows what else.  Testing this type of scenario means I have to write expensive integration tests (in terms of developer time to find bugs).  If I have a clean division between the business and data access layers, then I can easily test my code with unit tests, and mock out everything else.  This is awesome, and I basically get it for free by following single-responsibility.  My other two favorites support test-ability as well.  Dependency inversion makes it way easier to do unit testing, especially if I segregate my interfaces to support specific clients (in this case, I mean specific clients to be classes that use implementations of those interfaces).

Why do I not favor Liskov and open/closed as much?  It's simple - they don't support testing as much.  Liskov deals strictly with object inheritance - which has more to do with swapping out code with other code which has already been tested (hopefully) at run time.  Open/closed is also about object inheritance, and suggests that objects should basically allow inheritance.  These two principles seem to contradict such practices as sealed classes (because they are closed for extension) and some simple things like adding logging to a subclass may be considered to violate Liskov (if you blur your eyes, and tilt your head).  In a word, they're 'squishy'.

With regard to static code analysis, the list of rules I favor is too long to mention here.  Suffice it to say that I believe experienced (read: senior+) software engineers should be able to agree on a list of rules which a particular software shop adheres to.  This is not the same as suggesting that rules should rise organically out of what makes it into functional code.  I suggest that healthy deliberation should ensue, and from this, a group of rules should arise which are then baked into the code base by modeling those rules with static code analysis tools (like Resharper or Visual Studio).  I acknowledge that most development shops will not go through the exercise of determining a set of rules which makes sense for their own application of software engineering.  That doesn't make it any less important for teams at least to come to some sort of coding standard which makes sense in their problem domain.

Patterns (such as exist in the Gang of Four) exist for a reason - use them.  Don't invent a custom pattern if one exists which already fulfills the need.  This is not the same thing as suggesting that every pattern you see in existing software is a good one - it must fulfill the need.  Ideally, the pattern originates from GoF or in a Security Patterns or Enterprise Patterns text, because those patterns have been distilled from an eternally growing list of patterns, and have been proven to work time and time again.  Emergent patterns originating from your own software organization are not as fortunate.  That said, if a pattern doesn't suit your needs, don't hack it up so that it does.  Some software engineering tasks don't readily lend themselves to patterns.  But...if you don't take the time to look, you won't know.  I keep a copy of GoF and Security Patterns by my desk for ease of reference.

Finally, communication with other developers is key in software engineering organizations, now many of which have hundreds if not thousands of engineers. By communication, I mean that I should not have to drill into a method to know what it does - it should be implicit in the name, and my assumption about what the name indicates should be generally correct.  I should not have to find all references of a variable to figure out what it is for (or depend on an IDE to do it for me  - it should be obvious and concise from the name).  I should not have to worry about side effects which are not called out readily in method comments...the list goes on and on.  A lot of this could be mitigated by the above paragraphs, but if not, then think of writing code as having a three-way conversation, one with the computer, and one with another developer.  Both of them should easily know what your code is meant to do without having to ask you or run it.  Robert C. Martin has a lot to say on this topic in his book Clean Code - which sits on my personal shelf as a reference.

So when I talk about poor code quality, I'm talking about code which does not consider the above very well.  When I talk about good code quality, I'm talking about code which is disciplined about this kind of 'stuff'.  I believe this is general enough to extend to casual conversations about code quality (i.e. that lot's of folks think along these lines when the squishy topic of 'code quality' arises), and I use it this way often in my blog posts.