Tuesday, September 27, 2016

Adventures in Databasing

AmandaDB

Some time ago (a year next January), I began working on a database called amandadb.  The goal of this database was to store daily stock values, with the intent on using machine learning techniques to extract meaning out of them.  It wasn't supposed to ever be a "real" database, but there were some practical problems I experienced along the way, which caused me to move in that direction.  For this post, I'm going to share with you some of the creation process, what worked, and what didn't work, to move this project along.  I'm kind of proud at where it is right now, but we'll get there.

Lesson 1: Don't be all things to all people...

Like I mentioned earlier, this was a decision early on.  I had a very simple goal in mind: take (specifically Yahoo) stock tickers daily, and put them in persistent storage.  Then, retrieve those data in a way that's fast  and allows me to do machine learning on them.  At this point, MySQL or SQL Server seemed overkill.  I was literally going to dump the data for each stock ticker into a file on the hard disk, no problem. I got this working within less than a day, and then I realized something.  It took a long time to read those data.  Though I had managed to persist data, I had some serious doubts as to whether it would hold up under the weight of analysis.

Let me clarify.  I wanted to slice and dice different ticker values, over different ranges of time.  I was polling for the entire spread of S&P500 stocks (so, about 500 of them).  It turned out that Yahoo Finance API allowed free download at a rate of multiple times a minute, up to a cap (I think 20k queries daily at inception).  This meant that I could do more than I originally thought (remember, daily only), and I wanted to use all of those data too.  Do the math, and you can see that that adds up pretty quickly.  My plan was to aggregate a year's worth of data, and use that to do some k-means, PCA, genetic algorithm, and eventually perhaps write my first neural network against it.  Finding the specific records I wanted to find was going to be tough.

Just to give you an idea of what I was up against.  Imagine doing essentially a table scan across 20,000 records per day * 365 days per year?  That's 7,300,000 records, being scanned, every time I want to do some data mining.  Ridiculous.

So I hit a roadblock.  Stop?  No way, I was having way too much fun at this point.  Don't be all things to all people was good that I limited my interest to only stock tickers at first.  This allowed me to quickly get to some interesting problems, ones that I thought I already knew how to solve.  

Lesson 2:  When in doubt, add an index...

Or add two.  This is when I started factoring out my code into the first rev of amandadb.  It was because I wanted to add an index.  I realized at that point, that I was basically starting down the path of building my own database.  It still was going to be pretty minimalist - just an index and some files, so no need to get one of the 3rd party solutions yet.  For an index, I just used a sorted dictionary, and serialized it to disk.  This worked well, and drastically improved my access times from something ridiculous to around 7ms per record, when selecting a range of data.  Here's my index in almost it's initial carnation.  So pretty reasonably fast, and did what I needed to do.  I used generics, so I could re-use my index on multiple fields.  I added actually two indexes, one for the date, and one for the stock ticker field.

Lesson 3: Race conditions abound...

I did one other thing at this point.  Whereas before, I just dropped each ticker into its own file, I decided to compile multiple tickers into a single file.  This is so that when searching, I can load up a bunch of records at once.  This introduced a race condition that I discovered while running automated tests. Because the tests parallel where possible in VSTest, I kept getting IO exceptions when writing to the ticker file.  This surprised me, because I wasn't planning at that point on supporting multi-write scenarios, but I had to fix them anyway.

That's when I introduced file locking.  This might be considered the rough equivalent of write locking in the database world.  When writing to the same file, I lock the file, then write, then unlock the file.  I realized that even more, due to the practical evolution of this project, I was writing a database.

Then I discovered value investing.

Lesson 4: Requirements sometimes change...

I read the book The Education of a Value Investor, and that convinced me that applying machine learning to the stock market, though interesting, was basically speculation, and too much speculation makes it difficult for companies with real value, to make that value available for sell.  That meant that I no longer had the initial requirements of limiting the records to stock ticker only.  That also meant that I was in danger of turning my database into vaporware.  I thought for a moment, and pivoted a little bit.  I considered what I had.

A database system that serialized strongly-typed C# POCOs to persistent storage, and a consistent method for retrieving them, using indexes.  If I could get the access times down from 7ms, then I could have a very useful tool on my hands, one that I could actually use in production situations. This tool would fit in that space where we need persistence, but not a full-fledged database server.  If only I could get those access times down....

Lesson 5:  Memory is faster than disk...

So remember those automated tests that I told you about?  To get those to work in a reasonable amount of time, I had essentially mocked out a file system (IAmandaFile, IAmandaDirectory).  I then created in-memory versions of these.  I had already mentally changed requirements to compete with SQL Server access times, so I wanted to be able to store 8KB of data really quickly (< 7ms).  For small records, I was already there.  For large records, not so much.  I was seeing access times of 150ms per record, sometimes.

Combining all of the above, the fix was simple.  Formalize my in-memory access stuff, and use that instead of writing and reading from file all the time.  I was able to completely change this in regards to the record access pretty quickly (since I already basically had the objects necesssary).  This reduced my access time to sub-millisecond, but introduced the requirement to synchronize to persistent storage later.  This seemed familiar to me, so I looked it up.  I had basically migrated to a write-back or write-behind cache.  This is common in many databases, and makes sense if you think about it.  There's no way to get the really, really fast database access times on disk.  This also highlights the benefit of having a persistent flash medium instead of disk, if you can afford it.

Conclusion

So this is pretty much where I'm at.  I'm about halfway through the write-behind cache, and haven't started attacking the issue of synchronization yet, in any real way.  I left out some things I did around record serialization, to avoid seek time in files, and a significant change to my new file-based indexing scheme to allow indexes to take advantage of the in-memory objects as well.  If this project interests you, please feel free to contribute.  At this point, I'm planning to get the write-behind cache fully operational, and then there's a choice to make about next steps.  Being one person, I can either add a service to accept multiple simultaneous writes, or I can add updating capability, literally there is plenty of work to do.  

I think it's neat that I arrived at this point from a very practical beginning, kind of tracing the overall evolution of databases, in a weird way.  There's more to learn here, too - and I'm almost caught up .  I'm planning on using a suffix-tree indexing scheme for strings, which should be really interesting, if it works out.

No comments:

Post a Comment