This post is about how an intense and diverse SQL statement workload can cause Oracle parsing problems. Big problems.
There are many reasons for Oracle parsing-related performance issues and many ways to identify them. So, I am narrowing the scope by focusing on one possible cause, that is, an intense and diverse SQL statement workload. I will use Oracle's wait interface to identify the root cause.
Sound dry and boring? To make this a little more interesting and a whole lot more fun, I'm going to approach this from a non-traditional, almost backward perspective. After explaining some of the Oracle internals, I'm going to create a parsing-intensive workload and then watch what happens... in real-time. The educational impact is so powerful, I recorded a video (scroll to the bottom).
Enjoy the read... and the video!
When a client|user process submits SQL or PLSQL to the database, before it can be executed, it must be parsed. So, parsing had better happen fast and consume minimal resources.
Parsing requires a number of actions take place and in the proper order. For example, there are various checks related to syntax, semantics, and permissions. There are memory management tasks such as looking for SQL in the library cache and also when creating cursors and child cursors.
Did you notice that all of this requires CPU? This is why "too much" parsing activity can consume so much CPU that you can end up with a CPU bottleneck, which makes a bad situation even worse.
So, if there is "too much" parsing, there will likely be a lack of available CPU resources and Oracle latching and mutex contention issues. If the situation becomes completely out of control, there could be out of shared pool memory errors or even an instance freeze. Great stuff if you enjoy solving Oracle performance problems, but bad stuff from an organization/business perspective.
There are three calls that a user|client process can make to a foreground process. They are parse, execute, and fetch. So parsing can be simplified to say that it's everything up until the execution phase begins. A gross simplification for sure.
By the way, the instance statistic "user calls" is incremented anytime one of these three calls is received by a foreground process. If you want to dig into user calls a little deeper, I blogged about them HERE and HERE.
There are many ways to force Oracle library cache problems. My focus today is on pounding Oracle with tons of unique SQL statements, each having a unique SQL_ID.
Another perspective is pounding Oracle with a massive diversity of SQL statements that result in unique SQL_IDs.
In this post, I will not go into strategies related to reducing the number of unique SQL_IDs.
Take a close look at the below SQL. My objective is to dynamically create three similar, yet unique SQL statements.
SQL> set echo on verify on tab off
SQL> set linesize 150
SQL> col sql_text format a95
SQL>
SQL> alter system flush shared_pool;
System altered.
SQL> drop table toomuch;
Table dropped.
SQL> create table toomuch ( c1 number );
Table created.
SQL> insert into toomuch select level just_a_column from dual connect by level <= 50000;
50000 rows created.
SQL> commit;
Commit complete.
SQL> select count(*) from toomuch;
COUNT(*)
----------
50000
SQL> select 'select '||dbms_random.value(0,9999999)||' from toomuch where c1 = 1' from dual;
'SELECT'||DBMS_RANDOM.VALUE(0,9999999)||'FROMTOOMUCHWHEREC1=1'
-------------------------------------------------------------------------
select 3368727.13606693486484336772687471926489 from toomuch where c1 = 1
SQL> /
'SELECT'||DBMS_RANDOM.VALUE(0,9999999)||'FROMTOOMUCHWHEREC1=1'
-------------------------------------------------------------------------
select 4443872.66702328733095306687574442351562 from toomuch where c1 = 1
SQL> /
'SELECT'||DBMS_RANDOM.VALUE(0,9999999)||'FROMTOOMUCHWHEREC1=1'
-------------------------------------------------------------------------
select 7250473.28864361529432819773982295315706 from toomuch where c1 = 1
Next, I am going to run each of the three generated "similar" SQL statements. No tricks, just a straightforward copy and paste into SQLPlus.
SQL> select 3368727.13606693486484336772687471926489 from toomuch where c1 = 1;
3368727.13606693486484336772687471926489
----------------------------------------
3368727.14
SQL> select 4443872.66702328733095306687574442351562 from toomuch where c1 = 1;
4443872.66702328733095306687574442351562
----------------------------------------
4443872.67
SQL> select 7250473.28864361529432819773982295315706 from toomuch where c1 = 1;
7250473.28864361529432819773982295315706
----------------------------------------
7250473.29
Next, I am going to run each of the three generated "similar" SQL statements. No tricks, just a straightforward copy and paste into SQLPlus.
SQL> select 3368727.13606693486484336772687471926489 from toomuch where c1 = 1;
3368727.13606693486484336772687471926489
----------------------------------------
3368727.14
SQL> select 4443872.66702328733095306687574442351562 from toomuch where c1 = 1;
4443872.66702328733095306687574442351562
----------------------------------------
4443872.67
SQL> select 7250473.28864361529432819773982295315706 from toomuch where c1 = 1;
7250473.28864361529432819773982295315706
----------------------------------------
7250473.29
Now comes the interesting part. Look at the output from the below SQL statement pulling from v$sql. Focus in on our three similar SQL statements. Notice anything about their SQL_IDs?
SQL> select sql_id, sql_text
2 from v$sql
3 where lower(sql_text) like '%from toomuch%'
4 and sql_text not like '%v$sql%';
SQL_ID SQL_TEXT
------------- -----------------------------------------------------------------------------------------------
60yv8s17u4992 select 7250473.28864361529432819773982295315706 from toomuch where c1 = 1
cdmcvdn7uwm66 select count(*) from toomuch
2zu2t949nx3j1 select 'select '||dbms_random.value(0,9999999)||' from toomuch where c1 = 1' from dual
8kng37shkpwwq select 3368727.13606693486484336772687471926489 from toomuch where c1 = 1
b1h92czr43b2a select 4443872.66702328733095306687574442351562 from toomuch where c1 = 1
SQL>
Our three similar SQL statements each have a different SQL_ID. What this means is, each of the SQL_IDs will have its own cursor within the shared pool's library cache.
What I write below is super important to understand.
Because these three SQL statements have never been run before, they did not reside in the library cache. And, they will likely never be run again! Therefore, our Oracle foreground process must do additional parsing-related work, called "hard parsing," to get our SQL properly in the library cache. If these SQL statements had already resided in the library cache, the associated CPU and memory management would be significantly reduced, resulting in a "soft parse." (There is a softer and no-parse, but that's a conversation for a later time.)
Not convinced that having the SQL already in the library cache would make a significant performance difference? Check this out.
Now it's time to think about making the library cache angry.
Now consider not just three statements but hundreds or thousands. You may say it will never happen. To be blunt... you're wrong. It does happen and it happens in production Oracle systems.
Look at the below script.
$ cat angry.sh
sqlplus system/manager <<EOF
declare
sql_var varchar2(500);
begin
while 1 = 1
loop
sql_var := 'select c1 from toomuch where c1 = '||dbms_random.value(0,9999999) ;
execute immediate sql_var ;
dbms_lock.sleep(0.01);
end loop;
end;
/
EOF
The above script simply creates and executes an endless supply of similar yet unique SQL statements. (Yes, I realize there could be duplicates.)
Running the script will likely be undetectable, consuming very little resources. But that's not my plan. My plan is to launch this script in the background multiple times! This is how you make your library cache angry!
When life is good and the database is humming along there is plenty of available CPU resources and Oracle sessions do not have to wait for anything... just milking the CPU subsystem for resources.
Below is the result from my real-time session sampler (rss.sql) that is contained in my free OSM Toolkit.
The columns, left to right, are a sequence-like number, sample time, session ID, session serial number, Oracle user, session state (CPU, WAITING), SQL_ID, wait event (if waiting), and other stuff.
In this example, I am sampling all active Oracle sessions once every 0.25 seconds. The workload is very light because only a couple angry.sh scripts are running. Notice all the active sessions are consuming CPU, and therefore no sessions are "waiting."
190 11:42:47:764 401 28120 SYSTEM CPU 52zu30zuscaaw - - [,,]
190 11:42:48:015 15 12091 SYSTEM CPU 3u565sybjb9zh - - [,,]
190 11:42:48:266 254 27097 SYSTEM CPU 3qxun4q1zp54n - - [,,]
190 11:42:48:517 322 29824 SYSTEM CPU d8sx36rhqm31m - - [,,]
190 11:42:48:768 412 45369 SYSTEM CPU 6zcs88udqq41b - - [,,]
190 11:42:49:019 412 45369 SYSTEM CPU fst6zwt7mfcqw - - [,,]
190 11:42:49:019 333 49784 SYSTEM CPU f6pj32wcuy1d0 - - [,,]
190 11:42:49:019 324 31501 SYSTEM CPU 0k4xj6zha67pd - - [,,]
190 11:42:49:019 86 30715 SYSTEM CPU g625nbz82r4rd - - [,,]
190 11:42:49:270 333 49784 SYSTEM CPU dmun9dstqq8zj - - [,,]
191 11:42:49:521 335 2765 SYSTEM CPU 9ra4g5b1s26fb - - [,,]
191 11:42:49:521 180 50922 SYSTEM CPU fca7a1v14mf63 - - [,,]
191 11:42:49:772 411 52258 SYSTEM CPU 72hj0fr4b5wfx - - [,,]
191 11:42:49:772 405 51108 SYSTEM CPU g625nbz82r4rd - - [,,]
191 11:42:49:772 400 27805 SYSTEM CPU dv6ypf4paa8wq - - [,,]
Below is what the Linux OS was experiencing based on a vmstat report (vmstat 3 9999). There are 6 CPU cores in the bare metal server. The far left "r" columns show the average CPU run queue is much less than 6, indicating there is plenty of CPU resources. And, the average CPU utilization is around 7% ("us" + "sy" columns). So, the CPU subsystem looks great!
procs -----------memory---------- ---swap-- -----io---- --system-- -----cpu-----
r b swpd free buff cache si so bi bo in cs us sy id wa st
2 0 0 5789780 284132 17808136 0 0 0 11 544 253 7 0 93 0 0
0 0 0 5789780 284132 17808136 0 0 0 16 542 248 7 0 93 0 0
1 0 0 5790276 284132 17808136 0 0 0 17 702 308 9 0 91 0 0
0 0 0 5790276 284132 17808136 0 0 0 16 535 272 7 0 93 0 0
0 0 0 5790276 284140 17808136 0 0 0 16 527 239 7 0 93 0 0
1 0 0 5790276 284140 17808136 0 0 0 13 524 228 7 0 93 0 0
0 0 0 5790276 284140 17808136 0 0 0 16 520 245 7 0 93 0 0
1 0 0 5790276 284140 17808136 0 0 0 11 529 235 7 0 93 0 0
1 0 0 5790276 284144 17808136 0 0 0 16 530 237 7 0 93 0 0
The above looks great... until the workload changes. I personally altered the workload by firing off 20 simultaneous "angry" sessions which generate over a thousand unique SQL statements every few seconds. Below is the result of me doing a simple count related to my "angry" SQL statements every second or so.
SQL> select count(*) from v$sql where lower(sql_text) like '%from toomuch%' and sql_text not like '%v$sql%';
COUNT(*)
----------
6921
SQL> /
COUNT(*)
----------
10392
SQL> /
COUNT(*)
----------
12398
SQL> /
COUNT(*)
----------
14847
SQL> /
COUNT(*)
----------
17998
Each Oracle session is generating hundreds of unique SQL statements every few seconds. To complete this work, each Oracle session must create new cursors and child cursors which require CPU and memory resources along with Oracle latch/mutex acquisitions. That is fine until there are lots of sessions doing this same thing... over and over.
At some point, the combination of workload, Oracle memory serialization requirements, and OS resources is just too much. The result is lots of Oracle sessions waiting before they can consume CPU to get on with executing their SQL statement.
Below is the result from my real-time session sampler during the mayhem. As you can see, the activity profile is very different... lots of waiting for library cache mutexes and shared pool latches.
203 11:43:19:647 401 28120 SYSTEM WAIT 82kb2uku8wp47 other library cache: mutex X [74979849,64424509440,485997024444417]
203 11:43:19:647 336 63081 SYSTEM WAIT 1zkzf3j26ugg3 other library cache: mutex X [74979849,64424509440,485997024444417]
203 11:43:19:647 333 49784 SYSTEM WAIT d24u3kfd7hrhp other library cache: mutex X [74979849,64424509440,485997024444417]
203 11:43:19:647 324 31501 SYSTEM WAIT 5vw6uhbwsm19h other library cache: mutex X [74979849,64424509440,485997024444522]
203 11:43:19:647 83 46586 SYSTEM WAIT cxs8tjq89qxgv other library cache: mutex X [74979849,64424509440,485997024444417]
203 11:43:19:647 9 31992 SYSTEM WAIT 4hbtgmr7st7j3 other library cache: mutex X [74979849,64424509440,485997024444522]
203 11:43:19:898 166 28604 SYSTEM CPU 5z4yaucgshgpx - - [,,]
203 11:43:20:149 412 45369 SYSTEM CPU 8cfkahv227q4w - - [,,]
203 11:43:20:400 412 45369 SYSTEM WAIT 70ug99n8xh9vk other latch: shared pool [1612000384,453,0]
203 11:43:20:651 333 49784 SYSTEM CPU 8pkntfm7v8w6r - - [,,]
203 11:43:20:902 335 2765 SYSTEM CPU 9ra4g5b1s26fb - - [,,]
203 11:43:21:153 258 53179 SYSTEM CPU 1xfwcurvsa1dz - - [,,]
203 11:43:21:153 15 12091 SYSTEM WAIT 79dqx1b39ttbw other latch: shared pool [1612000544,453,0]
203 11:43:21:153 9 31992 SYSTEM CPU c08uwsrzduf65 - - [,,]
203 11:43:21:404 400 27805 SYSTEM WAIT g625nbz82r4rd other latch: shared pool [1612000384,453,0]
Take a look at the vmstat output below. Wow! The CPU run queue is over 10, and remember the bare metal box only has 6 physical CPU cores. So, it makes sense the average CPU utilization is 99%. Ouch!
procs -----------memory---------- ---swap-- -----io---- --system-- -----cpu-----
r b swpd free buff cache si so bi bo in cs us sy id wa st
14 0 0 5562620 284624 17808272 0 0 0 11 5987 1485 99 0 1 0 0
7 0 0 5562620 284632 17808272 0 0 0 20 6005 1509 99 0 1 0 0
10 0 0 5562588 284632 17808272 0 0 0 11 6024 1493 99 0 1 0 0
15 0 0 5562588 284640 17808272 0 0 0 20 6006 1502 99 0 1 0 0
11 0 0 5562588 284640 17808272 0 0 0 11 5999 1481 99 0 1 0 0
13 0 0 5562596 284648 17808272 0 0 0 28 6019 1508 99 0 1 0 0
In this video, I'm going to do exactly what I wrote about above. That is, cause shared pool latching and library cache mutex contention by having multiple sessions quickly flooding Oracle with thousands and thousands of unique SQL statements.
What are the lessons learned here?
Of course, understanding and correctly diagnosing the problem does not solve the problem. However, most DBAs struggle with performance tuning because they do not correctly diagnose the problem.
While I purposely did not get into solving the problem, the initial solution is to directly attack the cause of the problem. That is, reduce the number of concurrent Oracle sessions, reduce the number of unique SQL statements and increase the CPU capacity. There are a number of other options as well.
I hope you enjoy the video and thanks for reading!
Craig.
Start my FREE 18-lesson Machine Learning For Oracle Professionals E-Course here.