Redis Production Safety Checklist for Teams
Everyone has a Redis horror story. The intern who ran FLUSHALL on production. The senior dev who deleted the wrong keys with a wildcard pattern. The automated script that wiped session data during peak hours.
Redis doesn't ask "are you sure?" It just does what you tell it. Fast.
Here's how to not become the next cautionary tale.
The Danger Zone
Redis commands that can ruin your day:
| Command | Damage Potential |
|---|---|
FLUSHALL |
Wipes every database. Everything. Gone. |
FLUSHDB |
Wipes current database |
KEYS * |
Blocks server on large datasets |
DEL with wildcards |
Deletes more than you intended |
CONFIG SET |
Can break replication, persistence |
DEBUG SEGFAULT |
Crashes the server (yes, this exists) |
The scary part? These commands execute instantly. No transaction log. No recycle bin. No "undo" button.
Checklist: Before You Connect
1. Know Your Environment
Before running any command, verify:
# Which server am I connected to?
redis-cli INFO server | grep redis_version
# How much data is here?
redis-cli DBSIZE
# Is this production?
redis-cli CONFIG GET maxmemory
If you're using a GUI, check the connection name. "prod-redis-main" should trigger different behavior than "local-dev".
2. Use Separate Connections for Prod
Never have production and development in the same connection list without clear visual distinction.
Naming convention:
[PROD] Main Redis[STAGING] App Cachedev-local
Some teams use color coding. Others require VPN for production access. Whatever works - just make it obvious.
3. Disable Dangerous Commands
In redis.conf:
rename-command FLUSHALL ""
rename-command FLUSHDB ""
rename-command DEBUG ""
rename-command CONFIG ""
Or rename them to something obscure:
rename-command FLUSHALL "FLUSHALL_CONFIRM_a8f3b2"
This doesn't prevent determined admins, but it stops accidents.
Checklist: Querying Safely
4. Never Use KEYS in Production
KEYS * scans every key in the database. Synchronously. Blocking everything else.
On a database with millions of keys, this can hang your server for seconds. During which no other commands run. Your app times out. Users complain.
Instead, use SCAN:
# Bad
KEYS user:*
# Good
SCAN 0 MATCH user:* COUNT 100
SCAN is cursor-based and non-blocking. It returns results incrementally.
5. Test Patterns Before Bulk Operations
Want to delete all cache keys?
Step 1: Count first
# How many keys match?
redis-cli --scan --pattern "cache:*" | wc -l
Step 2: Review a sample
redis-cli --scan --pattern "cache:*" | head -20
Step 3: Only then delete
redis-cli --scan --pattern "cache:*" | xargs redis-cli DEL
Or better yet, use a tool with preview and undo.
6. Set Appropriate TTLs
Every cached value should have an expiration. No exceptions.
SET cache:user:1001 "{...}" EX 3600 # 1 hour
Keys without TTL accumulate. Memory fills up. Eviction kicks in and starts deleting random (or LRU) keys. Now you've got cache misses on important data while stale garbage sticks around.
Checklist: Team Practices
7. Read-Only Access for Most People
Not everyone needs write access to production Redis.
Access levels:
- Read-only: Developers debugging issues
- Limited write: Deploy scripts, specific maintenance
- Full access: Senior ops, emergencies only
Redis 6+ ACLs support this:
ACL SETUSER readonly on >password ~* +@read -@write
ACL SETUSER deployer on >password ~app:* +SET +DEL +EXPIRE
8. Log Dangerous Operations
Enable Redis command logging for audit trails:
CONFIG SET slowlog-log-slower-than 0
CONFIG SET slowlog-max-len 10000
This logs everything. For production, you might set a threshold (e.g., 10000 microseconds) to catch slow queries.
Better yet, use MONITOR in a logging service - but be careful, MONITOR itself is expensive on busy servers.
9. Backup Strategy
Redis persistence options:
| Method | Pros | Cons |
|---|---|---|
| RDB snapshots | Fast recovery, compact | Data loss since last snapshot |
| AOF | Point-in-time recovery | Larger files, slower restart |
| Both | Best of both worlds | More disk usage |
For production: Enable both. Test recovery regularly.
# In redis.conf
save 900 1
save 300 10
save 60 10000
appendonly yes
appendfsync everysec
10. Have a Rollback Plan
Before any bulk operation:
- Snapshot:
BGSAVEor RDB backup - Document: What keys you're modifying
- Test: Run on staging first
- Execute: With monitoring active
- Verify: Check application behavior
- Cleanup: Remove temporary backups
Tool-Assisted Safety
Manual discipline works until it doesn't. 3 AM, tired, production is down - that's when mistakes happen.
What helps:
- Visual environment indicators: Red banner for production connections
- Confirmation dialogs: For destructive operations
- Undo support: Delete something? Get it back within a time window
- Pattern preview: See what matches before you act
- Safe Mode toggle: Disable writes entirely for browsing
Redimo builds these in because we've seen the disasters. Production connections show warning indicators. Bulk deletes require confirmation. Every deletion is undoable.
When Disaster Strikes
Already deleted something important? Act fast:
Immediate Steps
- Stop the bleeding: Prevent further writes if possible
- Check replicas: If you have read replicas, data might still exist there
- Check AOF: If append-only is enabled, recent commands are logged
- Check RDB: Restore from latest snapshot
Recovery Commands
# Stop writes (if you have replicas)
CONFIG SET min-replicas-to-write 99
# Check last save
LASTSAVE
# Restore from RDB backup
redis-cli DEBUG RELOAD
# or
redis-cli SHUTDOWN NOSAVE
# then replace dump.rdb and restart
Post-Mortem
Don't just fix it and move on. Document:
- What happened
- Why it happened
- How to prevent it next time
Then implement the prevention.
Summary
| Category | Action |
|---|---|
| Environment | Clear naming, visual distinction, VPN for prod |
| Commands | Disable/rename dangerous ones, use SCAN not KEYS |
| Patterns | Count and preview before bulk operations |
| Access | Read-only default, limited write access |
| Backup | RDB + AOF, test recovery regularly |
| Tooling | Safe Mode, confirmations, undo support |
Redis gives you speed. It's your job to add the guardrails.
Using Redimo? Enable Safe Mode in Settings for production connections. It's one toggle that could save your week.