2010 08 18
From TheCommandLineWiki
Contents |
Feature Cast for 2010-08-18
(00:17) Intro
- CopyNight DC is coming up
- http://copynightdc.wordpress.com/2010/08/18/copynight-will-be-tuesday-824-at-630pm-at-teaism/
- Another reminder about my trip to Dragon*Con
- http://dragoncon.org
(01:58) Inner Chapter: Scalability
- In recent years, I spend an increasing amount of my time on scalability
- Building features is pretty straightforward
- Usually the hardest part is figuring out what, exactly, is needed
- This is consistent with my previous discussion about program speed and performance
- http://thecommandline.net/2010/01/06/speed_performance/
- The only really way to address both performance and scalability
- Is through iterative experiments to identify problems and test fixes
- Even the most complicated feature only needs to be built once
- Some times it takes multiple iterations to get it right, of course
- I've had good success with my current employer
- Using an agile methodology to better cope with unclear requirements
- I worked on a quota system for our software as a service
- When storage is costly, being able to keep any one tenant
- From taking more than their fair share is critical
- Unfortunately, no one had a good sense for what is fair to customers
- And was sustainable for the business in terms of the costs of resources
- There was not a lot of data available to inform guesses
- The version of the product that could cost the business for disk storage & bandwidth
- Is still very new, with only a handful of early adopters using it
- No one was confident that the numbers we could crunch for these users
- Was really represented of the majority of customers who will upgrade in coming months
- Each time I delivered an iteration on the last thinking
- We got closer, with better feedback
- Even though it took three attempts to dial in the feature
- Once it was finished, it specifically didn't need any more input
- When storage is costly, being able to keep any one tenant
- Scalability defies the idea of being able to solve it as a problem once
- New features change how a user base stresses an application
- Dealing with the issues that are unique to applications at scale
- Is often the price of building a successful offering
- It also is the kind of problem you want to try to be ahead of
- The skill and knowledge required usually take you
- Deeper into your application platform of choice
- To build features, you need a sense of capabilities of your platform
- In the sense of what is possible and what APIs to use
- For instance, how you store and retrieve data
- Or how you make a request from some other server
- That functional, working knowledge is not enough for scaling an application
- You have to start to understand what goes on under the hood
- Doing so will reveal what resources are shared and how
- And other complexities that may arise
- When you have many, many users running through your code at once
- I talked about performance, but scalability is different
- http://thecommandline.net/2010/01/06/speed_performance/
- The problems and techniques I discussed
- Are applicable but assume a single thread of execution
- Usually that means a single user
- And the problems don't vary for a single user on a system
- Or that user with some amount of activity by other users
- Scalability is all about behavior with increasingly large amounts of users
- Against constrained resources, whether those are managed as is
- Or using some design to stretch them further like resource pools
- Against constrained resources, whether those are managed as is
- It is an unavoidably aspect of most modern software
- As more and more of it is increasingly exposed through the web
- For traditional performance, problems are somewhat predictable
- That is why I spent some time of algorithmic complexity
- Cost of operations for algorithms is ideally a function of input size alone
- That is to say, for a specific algorithm, if you know how much data flows in
- You can figure out how long it will run, how much CPU it will utilize
- And the amount of memory and disk storage it will need
- You can figure out how long it will run, how much CPU it will utilize
- The calculation may be non-trivial, but it is possible
- And even close approximations are useful in terms
- Of gauging the relative efficiency of comparable algorithms
- And even close approximations are useful in terms
- Multi-user systems certainly are informed by orders of algorithms
- There are many shared resources though that are much harder to characterize
- Large scale server applications are inherently parallel
- New problems start to arise due to the timing of more than one user
- Trying to access some limited resource
- In an ideal environment, you would scale up all of the components
- So that you could treat each user as the only one on the server
- That simply isn't practical
- First, users may not make consistent use of say a database connection
- They may pause to look over some data or any other reason
- During which time a connection would be idle and wasted
- Part of scaling, then is to figure out how to better match
- What is available to actual demand under normal usage
- What makes this more difficult is normal usage is anything but normal
- More users is almost always different load demand
- Not just more of what you've seen
- I'll give you an example that I think illustrates this from my own experience
- I worked on a real time auction system
- Where performance during an event was acceptable for about ten active bidders
- More than that and the system unaccountably started to slow down
- I was able to run tests and gather data that ruled out the database and the front end code
- After some thought, it turned out to be some session data
- We had a heavy weight authentication certificate
- Carrying the identity of the logged in user and some related data
- It was naively configured as a remotable object
- For ten or so users, the cost of making remote calls between the web and application servers
- Wasn't even noticeable
- Beyond that and the calls got slower and the rate of slowdown accelerated very rapidly
- One solution would have been to tune the remote connection
- Perhaps increasing the file handles on either end
- Or beefing up memory
- In this case, the data never changed once a user logged in
- Sending the entire data once to the web server made much more sense
- And broke the scalability issue so events could scale up to hundreds of users
- BREAK
- Load is difficult to simulate effectively
- Doing so naively is pretty easy
- You can easily make up some activity that sounds reasonable
- For example
- Log into a web application
- View some data
- Add or change data
- Then log out
- Small differences in the timing of users doing the same thing
- Can have a profound affect on when threads working on their behalf
- Access the same resources within the system
- Worse, as those threads get bound up on something
- It will start to vary the response times back to the user
- Which then affects the lapsed time before they start the next step
- Users are rarely so obliging as to do the same thing at the same time anyway
- Their activity is as likely to be governed by random distractions
- As accomplishing some task anticipated by designers and programmers
- If your application has any kind of notifications, like an inbox
- Or offers different ways to navigate into the same screen
- Or to accomplish the same or similar tasks
- Then your users are likely to show more random seeming activity
- One solution I've seen pursued is to try to replay logs
- For this to be effective, the test system must match whatever generated the logs
- Data storage solutions behave differently at different scales
- In relational databases, indexes tend to degrade as they get larger
- Even before they degrade to the point of notice
- The database server may decided to use different indexes
- Depending on how effective it predicts each may be for the actual data stored
- Setting up a realistic data store is itself a challenge, then
- The best way would be to copy a production database
- But that raises privacy concerns which are tricky to sort out
- If you anonymize the data, you need to make sure doing so
- Doesn't invalidate the logs you want to play back
- I've never worked on a project that actually managed to replay production logs
- The closest I have gotten is analyzing logs to come up with a simpler simulation
- Then hand crafting playback through some software that mimicked network devices
- I got some good results with this technique
- And was able to dial it in further by comparing the logs from the test system
- To the production logs captured when a scalability problem was experienced
- The risk of working backwards like this is that you may miss the real cause
- Reproducing the symptoms with a test environment
- Doesn't guarantee that you reproduced the same cause
- I've deployed fixes worked out from simulations
- That didn't completely solve customer complaints
- Leading me back into the simulations, usually with some further user observations for guidance
- If you think about it, it makes sense that load will also vary
- Based on the kind of application you are building
- There are even some terms to roughly group activity
- This is where you where here discussions of read vs. write
- Or more formally of decision support vs. transaction processing respectively
- Even understanding these basic categories and where your application falls
- Can help in terms of getting even a first approximation simulation
- As well as leading you to some appropriate starting points to research solutions
- I have found very few resources to help
- Understand the kinds of problems you will encounter
- And some general strategies or ways of thinking about solutions
- To gauge whether they will improve scaling or not
- Tools existing for gauging specific resource consumption and load
- They can also be built relatively easily if you are familiar with instrumenting your code
- Such specific implements are best used when you have a good framework
- To guide your reasoning about their application
- There are increasingly people writing about page load times
- But not so much attention paid to back end resources
- The best I've been able to turn up are guides
- On very specific components like server threading and databases
- These are far from a complete application, though
- Even for the same part of a stack, like a database
- Two different projects can make very different use
- Often you have to try to generalize out some technique
- Like how to generate a query execution plan
- And read what it means with regards to what a database query is trying to do
- And what data is actually being stored
- An execution plan is a common feature in relational databases
- It just tells you how the server is going to try to find the data you want
- Usually given some costs for each of the steps
- You can use this as a guide for how to adjust your queries
- In order to reduce their cost
- Or to optimize the database itself to reduce unavoidable costs
- Such a plan is very specific to relational databases
- For document, key-value or object databases
- I don't know if there is a comparable tool
- Or you have to roll your own instrumentation
- To figure out the cause of some slow down
- Since the stack used by any two applications is going to differ
- Even when they are in the same space, like enterprise Java or LAMP
- I guess it makes sense there are few more comprehensive guides
- As much as I'd like to be able to just pick up a book from one of my favorite tech publishers
- Every time I have worked on scalability for a new project
- I have to start researching almost from scratch
- If I am very lucky, I'll have some applicable knowledge, like of relational databases
- But it will have to be adapted to fit the current software's model
- And the very unique sort of load and activity it experiences
- Understand the kinds of problems you will encounter
- Experimentation may be the only effective technique
- I mean rigorous testing with good observations
- Instrument the components of your own stack and application
- If you can, get data from production
- While users are experiencing scalability issues
- That may help you find more specific resources
- In the absence of general guides for making your application more scalable
- That is how I've accumulated my knowledge of scaling relational databases
- I am hopeful the technique will help with Cassandra
- You may be able to come up with your own workarounds and solutions
- Good observation lends itself well to theories
- Any theory you can test might lead to some hack
- That alleviates resource contention or minimizes delay deeper in your code
- As with other classes of trouble shooting
- I urge documenting your findings and solutions online
- It doesn't matter if this is on your own blog
- Or a forum site like Stack Overflow
- As long as what you write up is findable by someone encountering a similar issue
- BREAK
- Adding more computer power seems to be the basic idea
- You want to start by saturating the hardware you have
- That is where load testing can help and researching tuning of your stack
- Once the simplest deployment of your application is approaching full capacity
- Usually measured in terms of CPU and memory utilization at the OS level
- You'll want to either beef up the existing server hardware
- To get a linear boost in how many more users you can handle
- Or start exploring ways to add more hardware alongside the existing servers
- You can start with simple load sharing
- This is often done by having multiple full copies of your stack
- You can put a DNS round robin in front to randomly send users to each instance
- More often a smart load balancer is used
- That can use data from the running instances to send users to less loaded servers
- As well as making sure that users stick to the same servers once they are passed along
- If each user is completely independent of every other users
- This is probably all that your application needs
- More servers can be added in pretty easily and cheaply
- Unfortunately it is a rare application where each user
- Doesn't share some data with other users on the system
- Also, if each server has its own data store
- Then your load balancing has to be smart enough
- To send the same user to the same hardware every time
- Otherwise it will look like their carefully crafted, saved data is lost
- The next change most people make is some shared data storage
- Although that is not always the case
- If the layers of your application are very loosely coupled
- Then you can scale each of those layers separately
- It used to be very common that you'd load balance a bunch of web and application servers
- With a shared database server
- The database can then be beefed up, using a much more powerful server
- Doing so ensures everyone is using the same database
- Bouncing around to different web and app servers is much easier
- If the biggest cost is executing some business logic in the middle
- This strategy make sense but has a drawback
- You've re-introduced a single point of failure
- Also it is harder to scale up a traditional database
- Most of the solutions I've see for this are unreliable
- Replication, automated copying of data between servers
- Often differs in configuration and capability
- From database package to database package
- Database sharding is another solution
- Bear in mind that it is a relatively new technique in practice
- Not all implementations, like traditional scaling techniques
- Are necessarily equal so your mileage may vary
- Like that first model where each server has its own database
- You partition user data by some application aspect
- A dead simply rule would by to modulo some user identity number in the database
- The schemes for breaking your data into horizontal shards
- Differ as widely as actual applications do
- The point is you then need some logic in the middle
- To send user data to the right shard and read it back from the same shard
- At the expense of some complexity on the middle
- You now have a more tunable data storage system
- You can increase the number of shards as the data grows
- To keep up with the demand for more horsepower
- Increasingly, modern storage solutions offer sharding out of the box
- Many argue that the popular NoSQL movement is defined by starting with sharding
- It is possible to shard a relational database
- But is difficult because these are built assuming all data is present everywhere
- Newer systems starting with the assumption that sharding is necessary
- Can better prepare application developers for the necessary tradeoffs
- Or build in compensating capabilities
- I am currently working with Cassandra which transparently shards data
- On write, it automatically figures out where to send data in the cluster
- Even if you are physically connected to a different node
- Than where your data will be stored
- It does something similar on read, forwarding a request for data
- To wherever in the cluster your data resides without bouncing your connection around
- Optimizing finding data works very, very differently in a sharding environment
- I am still learning exactly how to do this
- But roughly you don't want to be trying to read related data
- From disparate nodes all over your cluster
- More thought is required in designing your storage and your indexes for finding data
- Caching is often employed to add virtual horse power
- The idea is that going all the way to disk some where is probably expensive
- If you have a simple scheme for looking up known data
- Then you can just hang onto that data in memory for a while
- Many systems already include caching already, like Enterprise Java, Django
- Even web servers like Apache and Nginx can be configured to cache
- Data stores, both relational and non-relational, usually rely heavily on caching
- The way that is most evident is when tuning
- For storage, usually all you can tweak is the size of the cache
- Caching isn't only applied to info in RAM, either
- Often the cost of fetching data from a data store and computing with it is expensive
- If the result doesn't change often, then even caching to disk can be a big speed up
- Squid and other reverse proxies are also often used towards this end on a request basis
- Whatever kinds of caches or cache components you are dealing with
- You have to have a way to invalidate entries in a cache
- Otherwise your users will start to see data
- That is increasingly inconsistent with what they are saving to the data store
- Lower level caches often tuck these details away, manging their own staleness
- Where stale data is old data, info that can likely to drop to make room in the cache
- With minimal effect on the application and what the user sees
- Where stale data is old data, info that can likely to drop to make room in the cache
- In a scaled up environment, caching can be a problem across servers
- If caches are siloed, then a user bouncing around through a load balancer
- Will see different data as he works
- Or may see different data than another user sitting right next to him
- There are distributed caching systems that help solve this
- Look at memcached, one of the most popular, or the commercial outfit Terra Cotta
- But they add a bit of complexity and overhead
- Much like sharding a data store does
- In my experience, caching helps more with saturating your existing hardware
- It uses up memory that otherwise would be unused
- To pull some load off of your data store
- In memory hierarchy, RAM is always faster than disk
- So it makes sense to saturate RAM as much as possible for speed
- Including figuring out how to just make it more available
- BREAK
- The urge to treat data storage as zero cost is often a hurdle
- I haven't used a storage system yet that scales without some effort
- Programmers, not surprisingly, just want to throw data into a store
- Where all they need to do is get data back out by some unique identity
- That works well enough, you could just use file storage
- The problem is that retrieval by a unique key is rarely enough
- Unavoidably some feature will require a sophisticated condition to be met
- The most common form this takes is a project framework
- That includes some so called dynamic query mechanism
- The idea is less experienced developers can just add some conditions to be met
- The framework code mechanical turns that into the query logic that is actually run
- Using an approach like this robs you of the ability to optimize queries
- To be general enough for anyone to use
- The query logic it produces is of necessity very naive
- At my current job, I did considerable work to optimize our dynamic query system
- I think it falls into the same class of work
- As building optimization logic into language compilers
- You have to have a thorough and deep understanding of both
- Your inputs in the form of what conditions can be set
- And of your output, the actual query logic that will be expressed against your data store
- For SQL, I think this is a bit easier as SQL is well documented and understand
- Of course, at scale, counter intuitions start to creep into relational data
- Some of the optimization you have to be able to apply
- Needs to be statically derived and applied independently
- Such as in the forms of indexes on tables
- For non-relational data, it is more difficult as to optimize fetches
- You often have to be able to alter the way data is stored
- By way of example, you need to be able to determine
- That some dynamic query is going to fetch data from multiple places
- Places that need to be kept together across a sharding or partitioning scheme
- I am anticipating a great deal of headache with Cassandra
- And our dynamic query system
- Either the rest of the team is going to have to learn how to optimize fetches
- Or the tools is going to get even closer to the sophistication of an optimizing compiler
- I never took any deep computer science courses so if I had my druthers
- The rest of the team would pick up some of the burden
- It will be interesting--good, bad or otherwise--to see how this particular instance winds up
- The challenge of scaling applications may explain the appeal of cloud computing
- If the simplest solution is to throw more horsepower at an application
- Then masses of commodity, virtualized servers would seem to fit the bill
- In a nutshell, that is one large appeal of the cloude
- Elastic computing platforms like Amazon's or the new open stack
- Ease the effort of adding in more real hardware
- Into a platform that applications can readily use right away
- Once you cross the chasm of single server to some sort of distribution
- Whether that is shared data and shared cache
- Or a full one, savvy clustering system
- Then in a cloud, you should be able to scale horizontally
- Just by spinning up and starting the meter on more nodes
- Whether that is shared data and shared cache
- That is what really drives me interest in stories about cloud computing
- It isn't the applications, as a developer, that fascinate me
- But the possibility of evolving the systems I build
- To take advantage of seamlessly scaling hardware
- There is still quite a burden on developers to take full advantage
- There are cloud platforms that combine the programming and the scaling
- Google App Engine requires you design your data storage in specific ways
- And that you code up your application logic as stateless
- But promises you can just pay for more horsepower
- If your application follows the strictures of using the platform
- I am sure there are other examples that work the same
- Not surprisingly, I am most interested in experiments in open and portable clouds
- Heavy weight virtualization still makes the most sense
- As you can have the same stack on development workstations as in the cloud
- I wonder if it limits scalability, though, because of the overhead
- Of setting up all of the stack per instanced, from OS on up
- As opposed to more app development centric offerings, like App Engine
- I am optimistic we'll see more positive experimentation
- The idea of portable packaging for server applications isn't new
- Say what you will about Java, it does a pretty good job here
- With standardized means of describing resources and deployments
- And single file archives for containing all of the pieces of an application
- That in theory can just be dropped into a compatible server
- In practice, vendors have diluted this to drive lock-in
- But there is still some good thought to mine for more experimentation
- Especially with more free and open platforms like Django and Rails
- If the simplest solution is to throw more horsepower at an application
(32:25) Outro
- Contact me
- Email to feedback@thecommandline.net
- Web site at http://thecommandline.net/
- IM to command.line@skype
- Listener comment line is 240-949-2638
- http://twitter.com/cmdln
- http://identi.ca/cmdln
- I'd like to thank libsyn.com for AAC hosting and Wouter de Bie for MP3 hosting
- These notes and the show audio and music are covered by a Creative Commons license
- http://creativecommons.org/licenses/by-sa/3.0/us/
- Attribution, share alike

