UNDER CONSTRUCTION - needs splitting into "GOAL -> SOLUTION -> IMPLEMENTATION STEPS/MATURITY"

1. Best Practices Audit Checklist


All areas are described in terms of a progression or timline from chaos and ad hoc work in a smaller organisation to mature and best practice work methods in a larger or better organised workplace:

goal (uptime/uninterrupted service):

want to keep services available at all times
disk redundancy -> offline (tape) backups -> online (disk) backups -> failover box/site (manual) -> failover box/site (automatic) -> load balanced cluster -> Failover Clusters -> Auto pickup by failover site -> Business continuity planning

goal - performance and optimization
Doing the best with what you have
configuration of app -> filesystem -> network -> load balancing

goal - progress and improvement
You don't have to stand still and firefight, you want ot take on projects to move infrastructure forward
dedicated helpdesk -> project/callout days split -> review of helpdesk tickets for project selection -> auditing  for improvement

goal - backups
single backup -> offsite storage -> periodic backups -> backup verification -> backup monitoring and alerts -> multiple backup types for different scenarios

goal - troubleshooting
knowledgebase of past problems -> access to user systems -> access to developers or vendors -> training

goal - authentication
passwords -> procedures for password change for leavers etc ->  secure passwords -> single sign-on -> security audits and reviews ->

goal consistency
build sheet -> scripted -> Configuration Management software

goal - time synchronisation
Manually sync time -> Use network time -> in-house network time server -> multiple servers -> multiple monitored servers with alerting

goal - system documentation
systems logs -> systems central logging -> full systems audit list -> systems diagrams and topology maps -> systems documentation library (disks used etc so known for parts, IP address list etc, server history, problems etc)

goal - process documentation
common procedure howtos -> scripts for common tasks

goal - change control
policy on types of change allowed -> log of changes -> established sign-off procedures for common tasks -> scripts/method for non-standard changes -> rollback plans -> request process for non-standard changes -> proper version control -> authenticated version control with role-based systematic approval process

goal - system security
Security, Updates and Patching
Directory, Lookup and Authentication
Audits?
Log, central logs, alerting? - NIDS/HIDS and IPS
firewalls, minimal access set

goal - system management
Some values live in configs and infrastructure can change but servers don't get updated. Examples are DNS nameservers or NTP servers, set at machine build time and never changed thereafter. Need a system for managing these.
Manual log of what is set where -> central synched copy of configs -> central maintenance push/pull from central configs -> config-based push/mount to machines

goal - remote management
on-network in-band access -> vpn inband access  -> on-network oob (ILOM)  to hosts-> on-network oob access to routers and switches -> fallback oob access from off-network

goal - monitoring and alerting
system monitoring (ie host level) -> alerts when server unavailable (ping script) -> service monitoring (ie major application level) -> tiered service monitoring -> tiered alerting -> off-network paging -> job monitoring (ie task level) -> documentation-driven monitoring configuration (so don't have to maintain monitoring system as separate config) -> distributed monitoring

goal - reporting
logs -> central logging host -> reporting and monitoring of logs -> uptime and availablility from monitoring system -> non-error reporting (performance/capacity mgmt)