How Not to Find Data Guard Gaps

Written by Sean Scott | Jul 29, 2024 5:19:20 PM

I've recently had several encounters with a query for checking for archive log gaps. It appears on multiple blogs, and I suspect it may originate from MOS note 1542907.1, "Standby MRP is looking for very old archive log sequence even after restore new incremental backup at standby site." The query in question:

  SELECT ARCH.THREAD# "Thread", ARCH.SEQUENCE# "Last Sequence Received", APPL.SEQUENCE# "Last Sequence Applied", (ARCH.SEQUENCE# - APPL.SEQUENCE#) "Difference"
    FROM (SELECT THREAD# ,SEQUENCE#
    FROM V$ARCHIVED_LOG
   WHERE (THREAD#,FIRST_TIME ) IN (SELECT THREAD#,MAX(FIRST_TIME)
    FROM V$ARCHIVED_LOG
GROUP BY THREAD#)) ARCH,
 (SELECT THREAD# ,SEQUENCE#
    FROM V$LOG_HISTORY
   WHERE (THREAD#,FIRST_TIME ) IN (SELECT THREAD#,MAX(FIRST_TIME)
    FROM V$LOG_HISTORY
GROUP BY THREAD#)) APPL
   WHERE ARCH.THREAD# = APPL.THREAD#
ORDER BY 1;

This may sometimes show correct results, but isn’t a reliable way of getting gaps from a standby. To understand why its output is misleading and sometimes wrong, let's break down what it does.

The first problem is it will report different results on the primary and standby, depending on the number of configured destinations.

Primary with three log_archive_dest set:

    Thread Last Sequence Received Last Sequence Applied Difference
---------- ---------------------- --------------------- ----------
         1                   2158                  2158          0
         1                   2158                  2158          0
         1                   2158                  2158          0

Standby with two log_archive_dest:

    Thread Last Sequence Received Last Sequence Applied Difference
---------- ---------------------- --------------------- ----------
         1                   2158                  2158          0
         1                   2158                  2158          0

Standby with one log_archive_dest:

    Thread Last Sequence Received Last Sequence Applied Difference
---------- ---------------------- --------------------- ----------
         1                   3100                  3100          0

Having multiple destinations isn't unusual, so seeing multiple rows in the output suggests the authors didn’t anticipate more complex environments, didn't understand the data, or didn't test this across different environments. Each line in the output represents a different log destination. Do two rows mean that there are two standby databases, and both are in sync? No!

I have another issue with this. Working on critical issues should be as foolproof as possible. A query that produces different results on different hosts means people who may already be dealing with cognitive overload (phone calls, chat and text messages, emails, alerts from monitoring systems, listening to and answering questions on a bridge, people camped out in their cubicle) must remember this query has to run on a particular system. The different results further complicate things. In these situations, I prefer something that works reliably and provides identical results no matter where it's executed.

Pet Peeve: Messy Code

The query is based on two similar subqueries against v$archived_log and v$log_history:

 (SELECT THREAD# ,SEQUENCE# FROM V$ARCHIVED_LOG WHERE (THREAD#,FIRST_TIME ) IN (SELECT THREAD#,MAX(FIRST_TIME) FROM V$ARCHIVED_LOG GROUP BY THREAD#)) ARCH
 (SELECT THREAD# ,SEQUENCE# FROM V$LOG_HISTORY WHERE (THREAD#,FIRST_TIME ) IN (SELECT THREAD#,MAX(FIRST_TIME) FROM V$LOG_HISTORY GROUP BY THREAD#)) APPL

The output of the ARCH subquery (from v$archived_log) represents the "Last Sequence Received".
The output of the APPL subquery (from v$log_history) represents the "Last Sequence Applied".

Both look at the maximum value of FIRST_TIME to limit the result to a single sequence per thread. But… FIRST_TIME is the timestamp of the first redo change (SCN) and corresponds to the SCN in FIRST_CHANGE#. The problem? This column has nothing to do with redo transport or apply! A single thread/sequence can have multiple rows in v$archived_log—one for each log destination—but the FIRST_TIME is always the same for all, regardless of whether it's been shipped or applied on a standby!

Since FIRST_TIME doesn't differentiate records, the outermost select against the subquery is redundant. They can be simplified to:

 (SELECT THREAD# ,max(SEQUENCE#) as sequence# FROM V$ARCHIVED_LOG GROUP BY THREAD#) ARCH
 (SELECT THREAD# ,max(SEQUENCE#) as sequence# FROM V$LOG_HISTORY GROUP BY THREAD#) APPL

I'm a code minimalist. While the results are identical, code should be as simple as possible, especially when used during critical events—like disaster recovery. The extra code adds noise and makes it harder to read (and understand).

OK, so duplicate rows and some extra SQL aren't the end of the world. Let's look at the far more significant problem: wrong results!

There Was a Gap, but There Was No Gap

I mentioned that this query appears in MOS and multiple blog posts. I haven't looked at all of them, but I haven’t seen any guidance regarding where to run this. Logically, it should run on the standby because it uses v$log_history, which tells us nothing about logs on the standby (more on why later.) However, running it on a standby guarantees situations where it will not report a gap, even when there is a gap.!

In this case, the query showed no gap. DBAs used output from v$managed_standby to further “confirm” that log shipping working:

SQL> select process,thread#,sequence#,status from v$managed_standby;

PROCESS      THREAD#  SEQUENCE# STATUS
--------- ---------- ---------- ------------
ARCH               0          0 CONNECTED
ARCH               0          0 CONNECTED
ARCH               0          0 CONNECTED
ARCH               1       2093 CLOSING
RFS                0          0 IDLE
MRP0               1       2156 APPLYING_LOG
RFS                0          0 IDLE

If you aren’t familiar with the output of v$managed_standby, this might suggest the system is healthy. Yet things were very, very Not Good. Redo wasn't shipping to the standby, thanks to a bad connection string in the Data Guard Broker configuration.

Why did the query show a gap of 0? It has no information to the contrary, and will thus never report a gap in this scenario, thanks to the nature of data in these views.

The "Last Received Sequence" Check

Following the event, I selected additional columns from v$archived_log for three sequences:

2093, shown in the v$managed_standby output above, with a status of CLOSING;
2158, the sequence reported as current, with no gap;
2272, the (then) latest log sequence.

alter session set nls_date_format='YYYY-MM-DD HH24:MI:SS';
break on sequence# on thread#
  select sequence#, thread#, recid, dest_id, creator, registrar, standby_dest, archived, applied, deleted, status, completion_time, first_time, fal
    from v$archived_log
   where sequence# in (2093, 2158, 2272)
order by sequence#, recid;

Results on the primary:

 SEQUENCE#    THREAD#      RECID    DEST_ID CREATOR REGISTR STA ARC APPLIED   DEL S COMPLETION_TIME     FIRST_TIME          FAL
---------- ---------- ---------- ---------- ------- ------- --- --- --------- --- - ------------------- ------------------- ---
      2093          1       1059          1 ARCH    ARCH    NO  YES NO        YES D 2024-06-25 02:08:17 2024-06-25 02:05:28 NO
                            1060          3 ARCH    ARCH    NO  YES NO        YES D 2024-06-25 02:08:17 2024-06-25 02:05:28 NO
                            1192          2 ARCH    ARCH    YES YES YES       NO  A 2024-06-25 19:37:56 2024-06-25 02:05:28 YES
      2158          1       1189          1 ARCH    ARCH    NO  YES NO        YES D 2024-06-25 19:31:53 2024-06-25 19:23:19 NO
                            1190          3 ARCH    ARCH    NO  YES NO        YES D 2024-06-25 19:31:53 2024-06-25 19:23:19 NO
                            1191          2 ARCH    ARCH    YES YES YES       NO  A 2024-06-25 19:31:54 2024-06-25 19:23:19 YES
      2272          1       1596          2 LGWR    LGWR    YES YES YES       NO  A 2024-06-27 12:06:41 2024-06-27 12:04:08 NO
                            1597          1 ARCH    ARCH    NO  YES NO        NO  A 2024-06-27 12:06:53 2024-06-27 12:04:08 NO
                            1598          3 ARCH    ARCH    NO  YES NO        NO  A 2024-06-27 12:06:53 2024-06-27 12:04:08 NO

And, on the standby:

 SEQUENCE#    THREAD#      RECID    DEST_ID CREATOR REGISTR STA ARC APPLIED   DEL S COMPLETION_TIME     FIRST_TIME          FAL
---------- ---------- ---------- ---------- ------- ------- --- --- --------- --- - ------------------- ------------------- ---
      2093          1       4303          1 ARCH    RFS     NO  YES YES       YES D 2024-06-25 19:37:41 2024-06-25 02:05:28 NO
                            4304          3 ARCH    RFS     NO  YES YES       NO  X 2024-06-25 19:37:41 2024-06-25 02:05:28 NO
      2158          1       4301          1 ARCH    RFS     NO  YES YES       YES D 2024-06-25 19:31:39 2024-06-25 19:23:19 NO
                            4302          3 ARCH    RFS     NO  YES YES       NO  X 2024-06-25 19:31:39 2024-06-25 19:23:19 NO
      2272          1       4527          1 ARCH    RFS     NO  YES YES       NO  A 2024-06-27 12:06:32 2024-06-27 12:04:08 NO
                            4528          3 ARCH    RFS     NO  YES YES       NO  A 2024-06-27 12:06:32 2024-06-27 12:04:08 NO

Notice the following:

These results confirm that FIRST_TIME is identical for every thread/sequence on both systems and doesn't vary based on whether logs were shipped or applied.
On the primary, each sequence has three records—one for each log destination. Destinations 1 and 3 are local. Destination 2 is the standby.
On the standby, there are two records per sequence. Destination 1 is FRA. Destination 3 is a local directory.
The COMPLETION_TIMEs at the primary for sequence 2093 are identical at the local destinations, but delayed by ~17.5 hours at the standby.
The COMPLETION_TIMEs at the standby do not match the primary.
The FAL column on the primary—indicating whether the standby issued a FAL (Fetch Archive Log) request to the primary—is YES for the standby destinations of sequences 2093 and 2158.
- FAL requests originate at the standby database and indicate that the log wasn't available locally (eg, hadn't been shipped as expected). A FAL request asks the primary (or other upstream participant in a Data Guard configuration) to re-send the missing file.
- Seeing FAL requests for sequences 2093 and 2158 means those logs were missing on the standby.
- Once the configuration was corrected and the primary and standby resumed communication, the standby identified the missing sequences and requested them from the primary.
Look closely at the RECID values. These reflect the order that log records were created.
- They're sequential on the standby, but on the primary, the RECID for the standby destinations are significantly delayed for sequence 2093.
- The record for shipping sequence 2093 was also created after the record for sequence 2158.
- The standby destination COMPLETION_TIME for sequence 2158 is also earlier than the completion time for sequence 2093—consistent with the RECID values.
The completion time and RECID of the standby destination row for sequence 2272—created after the configuration was corrected—are earlier than the times for the local destinations and earlier still on the standby than the primary.

Based on the contents of the view on the primary, determining whether a log was received via v$archived_log would require—at minimum—confining results to rows with STANDBY=YES and APPLIED=NO.

Determining whether a sequence was received but not applied on a standby is a bit more difficult. We could filter on APPLIED=NO, but that doesn't consider the possibility of multiple destinations. We might look to v$archive_dest, but in this case, the only thing different about destinations 1 and 3 are the values of DESTINATION. One is set to USE_DB_RECOVERY_FILE_DEST, while the other is a directory. It's dangerous to assume standby apply always runs on DEST_ID=1 or DESTINATION=USE_DB_RECOVERY_FILE_DEST, and this is precisely the sort of thing that leads to flawed conclusions during a crisis!

The "Last Applied Sequence" Check

Recall that the parent query gets its "Last Sequence Applied" value from v$log_history:

  SELECT ARCH.THREAD# "Thread", ARCH.SEQUENCE# "Last Sequence Received", APPL.SEQUENCE# "Last Sequence Applied", (ARCH.SEQUENCE# - APPL.SEQUENCE#) "Difference"
    FROM
 (SELECT THREAD# ,max(SEQUENCE#) as sequence# FROM V$ARCHIVED_LOG GROUP BY THREAD#) ARCH,
 (SELECT THREAD# ,max(SEQUENCE#) as sequence# FROM V$LOG_HISTORY GROUP BY THREAD#) APPL
   WHERE ARCH.THREAD# = APPL.THREAD#
ORDER BY 1;

Here is the output for the three sequences:

  select thread#, sequence#
    from v$log_history
   where sequence# in (2093, 2158, 2093)
order by sequence#;

    THREAD# SEQUENCE#
----------- ---------
          1      2093 
          1      2158
          1      2272

This view has a single entry for each thread/sequence because it reflects only that a sequence exists—the history. It has no detail and nothing identifying whether a log was applied! The value of the "Last Applied Sequence" (on the standby) will always be the latest received sequence.

That’s why this query is probably intended to run on the standby, and also why it should never be run on the standby. The sequence in v$log_history at the primary is definitive. v$log_history on the standby only knows what it’s received.

But if log shipping isn't working, the sequence in the standby’s v$log_history is the last sequence it registered. It can't know about anything more recent, and if it isn't communicating with the primary, the last sequences in v$archived_log and v$log_history will always match, and the gap will always be 0!

(Note that the v$log_history view also can’t tell us if a sequence is applied, only that it was received and registered with the database. Using this view to get the “Last Applied Sequence” is flawed. That could only, perhaps, come from v$archived_log.)

That's why gap checks shouldn’t run from standby databases. If contact is broken, these views are still populated, but not current. Processes will probably still be running, and there may be no outward indication of a problem.

Do This Instead

v$archived_log is not an authoritative source for identifying gaps in Data Guard. If the primary can't reach one or more standby databases, or if there are errors in transport or apply, the presence of a sequence doesn’t mean data is moving.

We might be able to refine the original query, adding checks against the APPLIED and STANDBY columns, but consider other factors that complicate attempts to find gaps in Data Guard:

The protection mode
The sync mode
Multiple, alternate, and deferred destinations
Destination priorities
The log_archive_min_succeed_dest setting
The archive_lag_target setting
FarSync
Cascading standbys

It’s a lot to consider. Writing a query to report accurate gaps from Data Guard—under every situation—is difficult. Fortunately, we don't need to write anything. Oracle built a check into the Data Guard Broker, and there’s a dedicated view that reports gaps, as well as shipping and apply progress:

DGMGRL> show configuration verbose;

Configuration - orcl_config

  Protection Mode: MaxPerformance
  Members:
  orcl_a       - Primary database
    orcl_b       - Physical standby database

  Properties:
    FastStartFailoverThreshold      = '30'
    OperationTimeout                = '30'
    TraceLevel                      = 'USER'
    FastStartFailoverLagLimit       = '30'
    CommunicationTimeout            = '180'
    ObserverReconnect               = '0'
    FastStartFailoverAutoReinstate  = 'TRUE'
    FastStartFailoverPmyShutdown    = 'TRUE'
    BystandersFollowRoleChange      = 'ALL'
    ObserverOverride                = 'FALSE'
    ExternalDestination1            = ''
    ExternalDestination2            = ''
    PrimaryLostWriteAction          = 'CONTINUE'

Fast-Start Failover: DISABLED

Configuration Status:
SUCCESS

DGMGRL> show database orcl_b;

Database - orcl_b

  Role:               PHYSICAL STANDBY
  Intended State:     APPLY-ON
  Transport Lag:      0 seconds (computed 0 seconds ago)
  Apply Lag:          1 second (computed 0 seconds ago)
  Average Apply Rate: 2.57 MByte/s
  Real Time Query:    OFF
  Instance(s):
    orcl

Database Status:
SUCCESS

The Broker provides identical results on any member of a Data Guard configuration. If something is wrong, it doesn’t work or returns errors.

The v$archived_dest_status view has a column that reports the gap status, archived and apply sequences, and other information:

col dest_id for 99
col dest_name for a20
col db_unique_name for a12
  select dest_id
,        dest_name
,        gap_status
,        archived_thread#
,        archived_seq#
,        applied_thread#
,        applied_seq#
,        type
,        database_mode
,        recovery_mode
,        status
--,        error
    from v$archive_dest_status ads
   where status              != 'INACTIVE'
     and recovery_mode       in ('MANAGED', 'MANAGED REAL TIME APPLY')
order by dest_id
,        archived_thread#;

DEST_ID DEST_NAME            GAP_STATUS               ARCHIVED_THREAD# ARCHIVED_SEQ# APPLIED_THREAD# APPLIED_SEQ# TYPE             DATABASE_MODE   RECOVERY_MODE           STATUS
------- -------------------- ------------------------ ---------------- ------------- --------------- ------------ ---------------- --------------- ----------------------- ---------
      2 LOG_ARCHIVE_DEST_2   NO GAP                                  1          2298               1         2297 PHYSICAL         MOUNTED-STANDBY MANAGED REAL TIME APPLY VALID

This view can be joined with v$archive_dest, v$archived_log, v$thread, v$database, v$instance, and others (or any of their gv$ counterparts) to paint a more complete picture of Data Guard activity. However, this only works from the primary. When run on the standby, the GAP_STATUS is NULL.

What About `v$managed_standby`?

If you've made it this far, congratulations! I want to revisit one last piece of this puzzle that added confusion for the DBAs working on this issue.

Remember that query against v$managed_standby? It seemed to confirm that Data Guard was working because the SEQUENCE# of the MRP process was advancing:

SQL> select process,thread#,sequence#,status from v$managed_standby;
 
PROCESS      THREAD#  SEQUENCE# STATUS
--------- ---------- ---------- ------------
ARCH               0          0 CONNECTED
ARCH               0          0 CONNECTED
ARCH               0          0 CONNECTED
ARCH               1       2093 CLOSING
RFS                0          0 IDLE
MRP0               1       2111 APPLYING_LOG
RFS                0          0 IDLE
 
SQL> select process,thread#,sequence#,status from v$managed_standby;
 
PROCESS      THREAD#  SEQUENCE# STATUS
--------- ---------- ---------- ------------
ARCH               0          0 CONNECTED
ARCH               0          0 CONNECTED
ARCH               0          0 CONNECTED
ARCH               1       2093 CLOSING
RFS                0          0 IDLE
MRP0               1       2156 APPLYING_LOG
RFS                0          0 IDLE

Note the following:

All but two processes are attached to (the nonexistent) thread 0.
The MRP0 (Managed Standby Recovery Process) process reads redo and applies it on the standby, and the values are increasing.
However the ARCH process didn't advance—both results showed the same sequence, 2093, in a CLOSING status.
There is only one ARCH process for thread 1. We expect to see one per log group.
The RFS (Remote File Server) processes on the standby receive incoming redo and write to the standby redo logs. Both are idle and assigned to thread 0.

This reveals the danger of ruling out a problem by using something that doesn’t clearly state a condition or error or that you aren’t familiar enough with to (instantly) separate normal from abnormal.

After correcting the configuration, the output of the healthy system has multiple ARCH and RFS processes:

PROCESS      THREAD#  SEQUENCE# STATUS
--------- ---------- ---------- ------------
ARCH               1       2297 CLOSING
ARCH               1       2294 CLOSING
ARCH               1       2298 CLOSING
ARCH               1       2295 CLOSING
ARCH               1       2299 CLOSING
ARCH               1       2296 CLOSING
ARCH               0          0 CONNECTED
ARCH               1       2293 CLOSING
MRP0               1       2300 APPLYING_LOG
RFS                0          0 IDLE
RFS                             IDLE
RFS                             IDLE
RFS                             IDLE
RFS                             IDLE
RFS                             IDLE
RFS                1       2300 IDLE
RFS                0          0 IDLE

After a log switch, the ARCH process starts writing the new, incoming log. RFS begins real-time apply for the new log. MRP starts applying the new log sequence:

PROCESS      THREAD#  SEQUENCE# STATUS
--------- ---------- ---------- ------------
ARCH               1       2297 CLOSING
ARCH               1       2294 CLOSING
ARCH               1       2298 CLOSING
ARCH               1       2295 CLOSING
ARCH               1       2299 CLOSING
ARCH               1       2296 CLOSING
ARCH               0          0 CONNECTED
ARCH               1       2300 WRITING
MRP0               1       2301 APPLYING_LOG
RFS                0          0 IDLE
RFS                             IDLE
RFS                             IDLE
RFS                             IDLE
RFS                             IDLE
RFS                             IDLE
RFS                1       2301 IDLE
RFS                0          0 IDLE

v$archived_log shows apply running IN-MEMORY when the database is in Real-Time Apply mode (these results are for slightly different sequences—sorry):

 SEQUENCE#    THREAD#      RECID    DEST_ID CREATOR REGISTR STA ARC APPLIED   DEL S COMPLETION_TIME     FIRST_TIME          FAL
---------- ---------- ---------- ---------- ------- ------- --- --- --------- --- - ------------------- ------------------- ---
      2282          1       4547          1 ARCH    RFS     NO  YES YES       NO  A 2024-06-27 14:12:47 2024-06-27 14:08:18 NO
                            4548          3 ARCH    RFS     NO  YES YES       NO  A 2024-06-27 14:12:47 2024-06-27 14:08:18 NO
      2283          1       4549          1 ARCH    RFS     NO  YES IN-MEMORY NO  A 2024-06-27 14:13:56 2024-06-27 14:12:55 NO
                            4550          3 ARCH    RFS     NO  YES IN-MEMORY NO  A 2024-06-27 14:13:56 2024-06-27 14:12:55 NO

The newly archived log switches from WRITING to CLOSING, while RFS changes to RECEIVING:

PROCESS      THREAD#  SEQUENCE# STATUS
--------- ---------- ---------- ------------
ARCH               1       2297 CLOSING
ARCH               1       2294 CLOSING
ARCH               1       2298 CLOSING
ARCH               1       2295 CLOSING
ARCH               1       2299 CLOSING
ARCH               1       2296 CLOSING
ARCH               0          0 CONNECTED
ARCH               1       2300 CLOSING
MRP0               1       2301 APPLYING_LOG
RFS                0          0 IDLE
RFS                             IDLE
RFS                             IDLE
RFS                             IDLE
RFS                             IDLE
RFS                             IDLE
RFS                1       2301 RECEIVING
RFS                0          0 IDLE

RFS returns to an IDLE state as it waits for the next log, and MRP continues the IN-MEMORY log apply:

PROCESS      THREAD#  SEQUENCE# STATUS
--------- ---------- ---------- ------------
ARCH               1       2297 CLOSING
ARCH               1       2294 CLOSING
ARCH               1       2298 CLOSING
ARCH               1       2295 CLOSING
ARCH               1       2299 CLOSING
ARCH               1       2296 CLOSING
ARCH               0          0 CONNECTED
ARCH               1       2300 CLOSING
MRP0               1       2301 APPLYING_LOG
RFS                0          0 IDLE
RFS                             IDLE
RFS                             IDLE
RFS                             IDLE
RFS                             IDLE
RFS                             IDLE
RFS                1       2301 IDLE
RFS                0          0 IDLE

I don't consider v$managed_standby a good source for confirming Data Guard is shipping and applying logs. There are no clear errors or exceptions; during a critical event, the output takes more time to review and process. I prefer error messages, binary conditions, or non-zero values that unequivocally indicate problems. (It is excellent for investigating issues with Data Guard processes, though.) However, if you aren't seeing sequences advancing for the ARCH processes, or if there's a gap between the sequences connected to the MRP and ARCH processes, there's a good chance that log shipping or redo apply isn't healthy!

View full post