27.9.13

How to not be a supernova in the clouds - Design time

When designing an Exalogic based solution is fundamental to use redundancy as a key aspect to fault tolerance (as already assembled in Exalogic, with double power source, networking equipment or storage heads). When we consider that everything have two pathways load balanced, is expected that if one pathway fails the other will be forced to deal with the double of the common load. From this point of view we can consider that either, an external demand increase or an component failure will generate an quick demand increase. The failures can have origin in hardware failures but also in human interventions, like rolling upgrades/patches, or someone that erases all the content from one storage volume.

The first thing that should be defined is the isolation and availability of applications. In terms of isolation we can consider three main patterns:

  • maximal: separated virtual datacenter accounts,servers,networks and storage volumes
  • medium: shared virtual datacenter account,networks but separated servers and storage volumes
  • low: shared accounts, servers, networks and storage volumes
Sure we can define more refine combinations depending on what we need, the Exalogic Elastic Cloud Software is very flexible when defining those kind of combinations.

From the point of view of availability, we can consider some main evaluation points, this will help us to understand how to design the environment inside Exalogic. All the steps defined here can be applied on an OVM installation or event other virtualization software installation, may be important to adapt to the in use technology. 

 The evaluation points:

1-Scaling patterns:

When we start preparing one elastic environment, is very important to take care about how much time we need to create a new resource and what downtime this allocation process can cause (the environment is over demanded and the resource is not yet available), to address this situation I defined two types of scaling to be used in conjunction to reduce the impact of an demand increase. Are the two pattern types:
  • Quick scaling 
    • small amount of  unnecessary (in normal situation) allocated resources
      • those resources are expensive (we don't want to waste those resources)
      • we have always some margin
    • In weblogic servers permits ManagerServers creation without  vServers creation
    • very fast start procedure
      • provides the resource necessary to sustain the environment while allocation slow scaling resources
      • demand minimal configuration changes and almost no human intervention
      • normally no need to change vServers geometry (CPU,RAM,...)
  • Slow scaling
    • no allocated resources
      • the resource are not used
      • the resource to be used can be de-allocated from other (less important) application 
    • slow (and maybe complex) start procedure
      • may involve external resources like DNS, Firewall or load balancers
      • may need more intense human intervention
The idea is to have always two sets of scaling plans, one for immediate reaction (and for more common situations) and the other for more durable reconfigurations. Those configuration changes can be durable or transient, is very important to have ways to measure if the allocated resources are enough or is necessary to allocate or release resources. 

2-In servers configuration

The power of Exalogic Elastic Cloud is based on the well understanding the philosophy of the product. Some good practices (already defined in EECS documentation) can make the environments administration much more easy. Some good practices:
  • distribution groups
    • one per service vServer set
    • if defined 4 vServers for WLS ManagedServers for our application, but we are running into a 1/4 Exalogic, is good idea to define one distribution group for ManagedServer for this application with a maximum size of 8. When scaling this set of vServers we can count with up to 1 vServer per compute node.
    • When reaching the limit of computing nodes is a good idea to start thinking in creating bigger vServers (in CPU and RAM).
  • spare memory for more managed Servers inside vServers
    • allow quick scaling to support slow scaling
    • consider wasting some memory to peak moments, may help to survive while scaling.
  • not software installed inside internal vServer disk
    • Exalogic still not allowing the change of CPU and RAM of one vServer, when those changes are needed, the vServer must be recreated, if is there software inside the internal disk of the vServer the vServer recreation can become a nightmare (mainly when doing that under pressure).
  • take control of singleton components
    • singletons normally means single point of failure, that when possible should be avoided and when not possible automated.
  • optimize your templates as much as possible (I will talk about using iaas to build templates in another post)
    • avoid doing human tasks when commissioning new servers that can lead to very long scaling process.
  • automate as much as possible servers creation using iaas CLI
    • avoid human errors on server creation
    • allocate quickly new servers 
    • makes possible to create scripts that calculates automatically server names and IP addresses. 
3-In network configuration
  • Start with a default pattern for networks
    • service
      • exposure of the application normally OTD or OHS
      • normally as a datacenter network segment, with outside of Exalogic visible IP addresses. 
      • supports normally HTTP
    • web
      • web traffic between OTD and OHS
      • serves dynamic content and static content
      • normally IPoIB network, that will be only used by vServers.
      • support normally HTTP
    • application
      • application traffic between OTD or OHS and WLS
      • is the HTTP channel for weblogic managed servers
      • serves dynamic content generated by the managed servers
      • supports normally HTTP
    • cluster
      • cluster broadcasting for WLS managed servers
      • session replication for WLS managed servers (maybe over (SDP)) 
      • coherence nodes intercommunication network
      • for single Exalogic topology used as default channel
    • management
      • HTTP channel for AdminServers
      • vServers only accept ssh from this network
      • DNS,LDAP/NIS can be configured to traffegate over this network
  • check where you can use SDP (currently only session replication and JDBC)
    • be careful with SDP bugs 
  • design realistic network masks
    • considering the distribution groups sizes and the different sets of servers that will be connected in this network, can be easy to define the maximum number of vServers connected to this network. 
  • automate as much as possible networks creation using iaas CLI 
4-In storage volumes 
  • Considering isolation levels create per set of applications project (in ZFS) or per application. More projects more maintenance and bigger isolation
  • Per node volumes
    • consider one NFS volume created per node when each vServer should have ist own disk volume and should never be able to read or write from other vServer volume. This topology means a total storage volume isolation.
    • Useful for WLS domain when we want the all vServer have exactly the same path for domains.
  • duplicated volumes
    • consider one pair of volumes divided between evens and odds vServers, when something goes wrong only 50% of the service is affected, when upgrading is very useful for rolling upgrades.
    • Useful for binaries (that do not change frequently)
    • Useful for configuration that can be reused by more than one instance of the software.
  • single volumes
    • consider um volume shared between all servers when something is not core or does not denial of service, normally with many directories inside, one per each vServer.
    • Useful for logs,backups,shared staging,Oracle Cloud Control agent software 
  • use NFSv4
    • the behavior is different between NFSv3 and NFSv4 (Using LDAP for Shared Authentication from Donald Forbes)
    • be careful to have NIS available to ZFS and vServers (if not available, everything will appear as nobody inside storage volumes :( and the applications will have very strange behavior). 
Planning the resource allocation procedure
  • When create more machines when create bigger machines?
    • when reaching the distribution group limit, makes no sense having two vServers inside one computing node to do the same thing, two kernels, paging systems, Java with the same Perm Gen contents, a waste. When reaching a sustained growing that reaches the maximum number of computing nodes available for the distribution group, start recreating the vServers with more RAM and/or CPU.
  • industrialize the allocation procedure, let the Elastic Cloud Software work for you.
    • use OVAB
    • use scripting with substitution
-------------------------------------------------
Management aspects

  • vServers lifecycle
  • Oracle Fusion Middleware lifecycle
  • Monitoring and alarmistic

Oracle Fusion Middleware aspects
  • OTD
    • use throttling
    • use DOS detection mechanisms 
  • OSB
    • use throttling
    • validate XML schemas before passing back
  • WLS
    • use the new 12.1.2 dynamic clusters when possible
    • use Exalogic optimizations
The planning the cross datacenter clustering using Oracle MAA will be another post, does not change fundamentally how to design the environment but networking design patterns.

The next part will be about run time experiences and then I expect to start talking in more details about some items (like scripting, deployment automation, ...) that I have successfully used in some projects. 

No comments:

Post a Comment