This blog post is a precursor to a presentation I’ll be giving at SQL Saturday in NYC on May 30th, 2015. For those of you who will be attending the conference and this session, be warned that this blog post contains spoilers!
I began working for kCura in 2009 as an Application Support Specialist. I was hired in a senior role, and over the next few months I worked with 6 other application specialists, back when training was baptism in fire and our clients’ SQL databases were just starting to grow at a very rapid rate.
Our key challenges then were using the right tools, understanding how to read the tools, and working together with our clients. There were some very long conversations, and some very, very deep dives into performance tuning across all aspects of the platform. When problems were experienced, if SQL experienced an “event” of some sort, such as a shut down, a stall, blocking, whatever – simply “getting through it” was never enough. We wanted to know what caused it. We had a burning desire to find the root cause.
Over the years, a few recurring themes surfaced. Of course, we had a couple of oddballs – things you may never see or may even believe can’t possibly have happened. You are entitled to your own opinion, I am not here to argue the points of fact or the historical record. I am here to tell a story, a story of 10 SQL server flameouts.
Firstly, you may ask, “What’s a flameout?” Analogous to when a SQL Server “crashes” or loses it’s ability to perform it’s primary function – which is to run and complete queries against a database – a flameout is a non-technical term for the loss of “flame” in a jet-engine. It can be caused by any number of things – a failed fuel pump, a fire, bird-strike, etc. This is not a blog post about top ten jet engine flameouts, though. It’s about SQL. The bottom line is that if a fighter jet, with just one engine, has a flameout – it’s lost all propulsion. The same could be said for a SQL server.
Fatal Error 211 (flameout 1 of 10)
This is about corruption, and the sudden discovery of it. This blog post was a direct result of an assertion that was made that the corruption was caused by a log drive becoming full.
This problem did not happen in the logs. Data on disks becomes corrupt just sitting there. Data on drives can become corrupt, even when they are not under power.
Log files becoming corrupt isn’t an issue, either. There is this trick where you can rebuild a missing log, and that is the topic of a later flameout.
There were other questions about what caused this: Could a corrupt document or load file cause this? In the application that loads data to SQL. data is stringently typed, and there is no character or combination thereof that could have caused this. Documents also don’t get loaded into Relativity, just the metadata from them, which goes into SQL via typed variables. As for why the corruption didn’t cause trouble until recently, there was a restart.
The most probable theory of why corruption appears after a restart is that the (clean) system table was cached in RAM and was different than the (corrupted or missing) one on disk. When the server restarted, the instant SQL tried to look at that bad table, it choked. SQL always looks to RAM for the most recent data, and if a table has cached, it will never go back to disk for it unless it is changed. Data on disk NEVER changes before it changes in RAM, so SQL would never have any reason to go back and check the table on disk. In other words, SQL looks for changes in RAM and writes them to disk – not vice versa. It knows from the SQL log where the most recent data is and whether or not it has been written to disk. This relationship is at the foundation of something called ACID – atomicity, consistency, independence, and durability of transactions.
This also explains why the backup was bad. Backups always get the most recent data, and it gets it from disk, and since it had no reason to believe the system allocation pages had changed….
The corruption may also have been noticed (suddenly) if something changed the system allocation pages in memory during the week since the last checkDB. As soon as it went to commit the change, to that corrupt spot on disk that the DBCC was checking, then the roof caved in.
At this time, there are no known unpatched defects in SQL Server 2008 that cause corruption. Certainly, a defect in SQL could be at the root cause, but the more likely instigator is the physics of electromagnetism and entropy.
RAM – maxed out (2 of 10)
We’ve all done it – we’ve all been guilty of thinking that some setting or another on our SQL Server is a “set-it-and-forget-it” (SIFI) setting. In fact, we have a 60+ page guide dedicated to optimizing SQL for running Relativity. This guide serves as a good starting point. Key words are “Starting” and “point.” Most settings in SQL will need adjustment based on differing workloads – even in the same application. If you don’t know why you are locking pages in memory, and you don’t know why you are raising your minimum memory value, then you shouldn’t do it. You should get help. Professional help.
A couple of key things to know:
- There are three RAM counters that are presented on the performance tab in task manager (Windows Server 2008):
- Cached RAM = RAM that was in use but isn’t in use now and is available for use.
- Available RAM is roughly = cached + free
- Free RAM
- Total RAM = A + B + C
- in Windows Server 2012:
- In Use RAM
- Available RAM
- Committed RAM
- Cached RAM
You shouldn’t necessarily be overly concerned about Free RAM hovering on or around zero, unless of course this is a new thing that is not normal for your system. When Available RAM continuously hovers between 0 and 100 MB, you have a problem that you need to fix. Unless, of course, this is normal for your system and you are so awesome you can walk right up to that line and flirt with it as though she were superman’s wife and you have a bucket of kryptonite in your left hand and a crowbar in your right.
Without exception, when I have seen Available RAM at 0 and had a sluggish and non-responsive system on my hands, this was the problem. Most of the time, my first action is to lower max memory in SQL first, and ask questions later. Why? Because I need a system that is responding to my actions in order to figure out what happened. SQL server is usually very responsive to this action, and the system becomes almost immediately useable. Then, with the client and other support representatives from kCura, we would go on our scavenger hunt and figure out what had happened. It can be anything from a bloated file-cache – a user dragged and dropped a large file onto the desktop of the SQL server, which over time erodes away at the RAM for the SQL Server, or maybe they had web browsers open and were generally treating the SQL server as though it were a desktop machine.
By the way, its a good security AND performance practice to prevent Internet access from your SQL server. Furthermore, you should not have SSMS even installed. If you can run Core,so much the better and congratulations to you for this – it is a great practice.
Conclusively – keeping an eye on your max memory setting, and knowing who and what are stealing precious RAM from SQL, is a top ten skill in the DBA toolkit. Knowing whether or not you should even set Min memory and if you should use lock pages in memory – that is an advanced topic and I have my own opinions about it, some of which are unpopular and if you can find me after my SQL Saturday session, I will happily discuss it with you.
Stuck in Recovery (3 of 10)
Hands-down, this is probably one of my all-time favorites. All Relativity databases (with a couple of exceptions) ship, by default, in FULL recovery model. Sometimes it becomes necessary to restart SQL (I know, right?) and if we don’t have the healthiest of log files, restarting can become a most painful process. You can read about log file maintenance here (link to relativity tech support document about Relativity log file maintenance) which is a document we created and has been peer reviewed by Brent Ozar and Mike Walsh.
If you are not doing anything special to maintain your log files, you are missing out on some of the sweetest performance benefits you can have. Aside from decreasing restart times, and not having your database get stuck in recovery for 2 days (yes, this can happen), you can boost performance by removing LOG WRITE WAITS – completely. Yes. You can COMPLETELY remove LOG WRITE WAITS. This is not a WAIT type that should ever be in even your top 100 WAITS, unless the WAITS were created by some maintenance that you did. (you of course can’t COMPLETELY remove them.)
This link takes you to a SQL script that is useful in counting how many VLFs you have.
And here is a post by Jonathan Kehayias that explains log management more thoroughly. If you plan to grow your log file organically – that is, just set it at some size and then forget about it, and if you expect the log file to grow, then you should consider to set the growth size to be big, such as 512 MB or 1 GB. Consider that, if this becomes a problem and you begin to see LOG WRITE WAITS, then you should make the decision to attempt to predict how big your largest read/write table with a non-sequential primary clustered index will become, and you should force-grow your log file somewhat larger than that.
“Why Read/Write, non-sequential?” you ask? While it is true that there are a few outlier situations where a linear key, clustered index on a large table may need to be rebuilt, mainly the log file space operations that will consume the most space are as follows (not in any necessary order):
- non-sequential, clustered index rebuild – it’s so fragmented that only a complete rebuild of the index will do any good.
- non-clustered index re-orbs and rebuilds
There are probably other things that will chew up a lot of log space, but as a good DBA you should know what they are and be prepared for when they will happen. You must be able to anticipate these things. The main reason for concern in Relativity about the type of table that may be rebuilt lies in our auditing structure. Every single user action, every data load, every mass edit, is logged in our audit record table. While you may (in Relativity) have a 300 GB Document table, you may also have a 3 terabyte auditRecord_PrimaryPartition table. This Document table, which is read/write, may have some very large non-clustered indexes that need to be rebuilt, and has had on occasion needed to be rebuilt due to the addition of computed column indexes. The auditing table, conversely, has realtivitley small, non-clustered indexes, and its primary key is monotonically increasing and is of little concern. This table will never need to be rebuilt, and if it does, then you will know about it. It is not something that will happen in the middle of the day or over a weekend in a maintenance window. If the non-clusred index on this table has actually exceeded the size of the Document table, then that would be an interesting thing, and we would like to see this and of course you would then adjust your behavior accordingly.
So, these are just the lessons we have learned. and they may or may not be applicable to all databases everywhere. SIFI is not in the cards for any of us in this room, we must learn from experience and make wise decisions based on past performance. This is the best we can do, and while I can share with you my experience, and the experiences of kCura, I can not give you SIFI guidelines. It is the sign of an immature DBA that would even ask for such a thing.
“hey!…that’s my turbo button!?” (4 of 10)
If you call into your software support representative, and you have turned on Priority Boost on your SQL server, you may expect to be asked, quite firmly, to remove that setting. If you refuse, you may find that your tech support company may refuse to offer you any further assistance regarding your issue until such time as this box has been unchecked and SQL has been restarted. This feature marks SQL Server threads with the highest priority. No other processes, including those of Windows, will get a higher priority. Combine this with someone setting MAXDOP to 40 on a 40 core server, and you have just ghosted your SQL Server. Flame. Out.
Paging (5 of 10)
This relates somewhat to flameout 2 of this series. Sometimes, if SQL comes under memory pressure, it will choose to page memory out to disk. Once this memory is paged out to disk, we have seen that it does not return to VRAM, even after when whatever circumstances that caused the paging have been alleviated. If it is an option, I recommend you restart SQL. Of course you should know your VLF situation before you do that, because if you restart a SQL server and get stuck in pending RECOVERY for 2 days, you will have a very upset stakeholder on your hands. This is one of two circumstances where I will recommend a SQL server Restart. The other one is if a SAN gets unplugged in the night. SQL will show the database as being there, but it won’t be.
Another way to get the cache to clear out might be to runn
DBCC FREEPROCCACHE
DBCC DROPCLEANBUFFERS
but I have found that these two commands do not reliably dump ALL buffers. You may still have something left in the paged buffer pool that will create significant impedance. There are probably some additional commands I could enter here.
Cache out (unsolved mysteries) (6 of 10)
ON three separate occurrences, so yes this is a more esoteric top ten item, we have come across a situation where the oldest item in the cache was about 10 minutes old, rolling forward. This means that as we sat and watched, even after 30 minutes of observation, we saw no significant aging. All the settings were to best practices – plus there is no one setting that could cause ALL items in the cache to never age.
Here is where a DBA that faithfully follows a strict protocol of change control must be lauded. Were it not for the DBA, able to tell us the ONLY change that had been made in recent memory – we would still be scratching our heads. I had been on site with this client just three weeks prior to this follow-up visit, and the problem of the vanishing cache did not exist then.
He had set MIN Memory = MAX Memory. We all doubted that this could be the root cause (we still do) but when he set Min Memory to 0, the problem stopped. To this day, when we talk about it, we refuse to believe that this setting CA– USED this, but since then I have seen it on two other servers. There may be some other environmental factor at play here, something to do with clustering or maybe to do with another configuration setting, we don’t know. For now, if you see this, change the min memory, which you probably should not have set at all on a stand-alone instance of SQL where it is the only application. even were this not the case, there is no reason to set them exactly the same, there is no detrimental effect to leaving them a few MB or GB apart.
Wait, did you just say NTFS? (7 of 10)
One time I received a ticket that involved optimization. Our client complaint was slow queries. She was right, the query was slow, and it was a query that I already knew about, and I knew what index to apply to get it to run very fast, sub 100ms. After optimization, though, the query would not go faster than 1800ms. “This is still too slow”. I sniffed around the system, ran some checks looked at WAITS, and it become pretty apparent to me that we had a problem with storage. I let her know, and a week later we were on a call with one of kCura’s storage and virtualization experts, the client, and me, the tuner. I began asking some questions and as the client’s storage expert answered my usual questions, a text message popped up. Our storage expert said “He just said NFS.”
“What??” I replied via text. “I thought he said NTFS. Who puts their SQL on NFS?”
Next, I asked, “Did you just say ‘NTFS’?” and the client replied, “No, I said ‘NFS’”
“Are you running SMB 3.0?”
“No. Is this a problem?” he asked.
“Yes, it’s safe to say that we do not need to have any further discussions. This is our root cause.”
Several weeks later, the client moved on to an SQL server that has dedicated storage.
It worked fine yesterday 8 of 10 (top 10)
Inevitably, you get the call, “This query worked fine yesterday, today it is not working at all and won’t finish.” Here are ten possible reasons. Can you think of anymore/
- The root query fundamentally changed
- An index was dropped
- The size of the data being queried increased 50x overnight
- Storage was swapped out
- A setting was changed (Max memory set wrong, MAXDOP set to 40/40 cores, etc.)
- The user is confused – their query is brand new and has not been run before today (plan cache)
- The server was restarted, the cache is not warm
- There was a failover (we’re not home alone anymore)
- Maintenance plans have been failing. FOR WEEKS.
- We are fresh out of magic pixie dust
Blue Screen (9 of 10)
I have very vivid memories of this issue. I started at kCura in 2009 – up until this point, i had been a paid consultant, where the affected users I served were either employees of the billed party or were a partner in a law firm that would be paying my bill. Hands down – across the board, they would rather buy a new server than have me spend 20 – 30 hours (or more) digging into a root cause. kCura has a different approach – we don’t bill our clients by the hour to help them solve their technical challenges. We just help, and we keep helping until we either have an answer, or everyone agrees that an answer is not possible.
We received a call one afternoon of a server that had become unresponsive. During the course of the call, the SQL server blue screened. What followed was a very demanding client, insisting that we explain how and why this happened – they wanted to know what caused it. Review of the error log showed that SQL server in fact had shut down and we had a windows dump file that we were able to send to Microsoft for analysis. In the error log, we had a portion of a query.
Relativity allows users to create and run custom search queries against sometimes large data sets. You are probably all aware of Google’s limitation of some 4o or so words in a search. At the time Relativity had no such limit – and the search query that SQL had dropped into the error log was very large. So large, in fact, that it had been truncated when it was inserted into the error.
This query became central to our investigation. Two days later, I had completed the design and creation of a query that would search through all workspaces and pull back any saved search that had a size greater than 10,000 characters. It returned a large number of searches. I increased to 20k. Then 100k. Then 400k. Two searches came back – and one of them was an exact match of our suspect. With this information, we were then able to figure out who had run the query, and when it had been run. We were able to train the user, and we also modified our application so as to allow a configurable upper limit on search submissions.
By listening to our clients, and by digging deep, we were able to make our application better, and improve the customer experience. This query I built to analyze searches across all workspaces was the beginning of a view and recurring saved search analyses framework, and its development would span 5 years and result in the complete automation of search analysis. This one thing that would normally have been swept under the rug in most corporate cultures became instrumental to development that would take place 5 years later.
Warning: Disk IO took longer than 15 seconds (10 of 10)
One thing is certain – where there is smoke, there is fire. This error message in your log reinforces the lesson “Check the error log first.” It is your first go-to when troubleshooting a SQL problem. Much of what can go wrong in SQL will be reported here, and if SQL Server can’t talk to it’s files for more than 15 seconds, it logs it in the error log. We had the chance to dive deep into this error on a large, VCE converged system that had been experiencing extensive issues. Ultimately, we felt the fault lay in the fast cache, that it was stalling and not responding. During troubleshooting, the client suggested that the disk IO errors were not related, and that we should seek the root cause elsewhere. This prompted us, in tandem with the client’s assistance, to develop a script that could detect IO taking longer than 1 second. This involved a little custom scripting and analysis of SQL’s ring buffer.
After the script was completed, and we ran it on a 2 second wait interval, we learned that not only were there an occasional I/O wait that took longer than 15 seconds, but there were 1 second long burps happening almost constantly.
This script that ran this check was not difficult to write, but it has fallen to the doom of too many tabs opened in Management Studio and it wasn’t saved. According to Mike Walsh (Linchpin DBA) this information can also be teased from extended events.