For example, If a SQL statement runs 10 seconds in production and 20 seconds in QAT, but the production system is twice as fast as QAT, is that a problem? It's challenging to compare SQL run times when the same SQL resides in different environments.
In this posting, I present a way to remove the CPU speed differences so an appropriate "apples to apples" SQL elapsed time comparison can be made, thereby improving our ability to more correctly detect risky SQL that may be placed into the upgraded production system.
And, there is a cool, free, downloadable tool involved!
There are a number of reasons why a SQL's run time is different in different systems. An obvious reason is a different execution plan. A less obvious and much more complex reason is a workload intensity or type difference. In this posting, I will focus on CPU speed differences. Actually, what I'll show you is how to remove the CPU speed differences so you can appropriately compare two SQL statements. It's pretty cool.
If a SQL statement's elapsed time in production is 10 seconds and 20 seconds in QAT, that's NOT an issue IF the production system is twice as fast.
If this makes sense to you, then what you did was mentally adjust one of the systems so it could be appropriately compared. This is how I did it:
10 seconds in production * production is 2 times as fast as QA = 20 seconds
And in QA, the SQL ran in 20 seconds… so really, they ran "the same" in both environments. If I consider placing the SQL from the test environment into the production environment, this scenario does not raise any risk flags. The "trick" is determining "production is 2 times as fast as QA" and creatively using that information.
Determining the "Speed Value"
Fortunately, there are many ways to determine a system's "speed value." Basing the speed value on Oracle's ability to process buffers in memory has many advantages: a real load is not required or even desired, real Oracle code is being run at a particular version, real operating systems are being run, and the processing of an Oracle buffer highly correlates with CPU consumption.
Keep in mind that this type of CPU speed test is not an indicator of scalability (benefit of adding additional CPUs) in any way, shape, or form. It is simply a measure of brute force Oracle buffer cache logical IO processing speed based on a number of factors. If you are architecting a system, other tests will be required.
As you might expect, I have a free tool you can download to determine the "true speed" rating. It shows the execution plan of the speed test tool SQL. If the execution plan used in the speed tool is different on the various systems, then obviously, we can't expect the "true speeds" to be comparable.
You can download the tool HERE.
Before we can analyze the risk, we need the "speed value" for both systems. Suppose a faster system means its speed rating is larger. If the production system speed rating is 600 and the QAT system speed rating is 300, then production is deemed "twice as fast."
Now let's put this all together and quickly go through three examples.
This is the core math:
standardized elapsed time = sql elapsed time * system speed value
So if the SQL elapsed time is 25 seconds, and the system speed value is 200, then the standardized "apples-to-apples" elapsed time is 5000, which is 25*200. The "standardized elapsed time" is simply a way to compare SQL elapsed times, not what users will feel and not the true SQL elapsed time.
To make this a little more interesting, I'll quickly go through three scenarios focusing on identifying risk.
1. The SQL truly runs the same in both systems.
Here is the math:
QAT standardized elapsed time = 20 seconds X 300 = 6000 seconds
PRD standardized elapsed time = 10 seconds X 600 = 6000 seconds
In this scenario, the true speed situation is, QAT = PRD. This means, the SQL effectively runs just as fast in QAT as in production. If someone says the SQL is running slower in QAT and therefore this presents a risk to the upgrade, you can confidently say it's because the PRD system is twice as fast! In this scenario, the QAT SQL will not be flagged as presenting a significant risk when upgrading from QAT to PRD.
2. The SQL runs faster in production.
Now suppose the SQL runs for 30 seconds in QAT and 10 seconds in PRD. If someone was to say, "Well, of course, it runs slower in QAT because QAT is slower than the PRD system." Really? Everything is OK? Again, to make a fair comparison, we must compare the system using a standardizing metric, which I have called the "standardized elapsed time."
Here are the scenario numbers:
QAT standardized elapsed time = 30 seconds X 300 = 9000 seconds
PRD standardized elapsed time = 10 seconds X 600 = 6000 seconds
In this scenario, the QAT standard elapsed time is greater than the PRD standardized elapsed time. This means the QAT SQL is truly running slower in QAT compared to PRD. Specifically, this means the slower SQL in QAT cannot be fully explained by the slower QAT system. Said another way, while we expect the SQL in QAT to run slower than in the PRD system, we didn't expect it to be quite slow in QAT. There must be another reason for this slowness, which we are not accounting for. In this scenario, the QAT SQL should be flagged as presenting a significant risk when upgrading from QAT to PRD.
3. The SQL runs faster in QAT.
In this final scenario, the SQL runs for 15 seconds in QAT and for 10 seconds in PRD. Suppose someone was to say, "Well, of course, the SQL runs slower in QAT. So everything is OK." Really? Everything is OK? To better understand the true situation, we need to look at their standardized elapsed times.
QAT standardized elapsed time = 15 seconds X 300 = 4500 seconds
PRD standardized elapsed time = 10 seconds X 600 = 6000 seconds
In this scenario, the QAT standard elapsed time is less than the PRD standardized elapsed time. This means the QAT SQL is actually running faster in the QAT, even though the QAT wall time is 15 seconds and the PRD wall time is only 10 seconds. So while most people would flag this QAT SQL as "high risk," we know better! We know the QAT SQL is actually running faster in QAT than in production! In this scenario, the QAT SQL will not be flagged as presenting a significant risk when upgrading from QAT to PRD.
In Summary...
Identifying risk is extremely important while planning for an upgrade. It is unlikely the QAT and production system will be identical in every way. This mismatch makes identifying risk more difficult. One of the common differences in systems is their CPU processing speeds. What I demonstrated was a way to remove the CPU speed differences so an appropriate "apples to apples" SQL elapsed time comparison can be made, thereby improving our ability to more correctly detect risky SQL that may be placed into the upgraded production system.
What's Next?
Looking at the "standardized elapsed time" based on Oracle LIO processing is important, but it's just one reason why a SQL may have a different elapsed time in a different environment. One of the big "gotchas" in load testing is comparing production performance to a QAT environment with a different workload. Creating an equivalent workload on different systems is extremely difficult to do. But with some very cool math and a clear understanding of performance analysis, we can create a more "apples-to-apples" comparison, just like we did with CPU speeds. But I'll save that for another posting.
All the best in your Oracle performance work!
Craig.