Data Guard “CORRUPTION DETECTED: In redo blocks starting at block…” issues

I’ve been pulling my hair out over this one, so hopefully this post will prove useful to someone else experiencing similar problems with Data Guard traffic.

One of our Cloud hosted environments (IaaS) has an Oracle 11.2.0.4 Data Guard (physical standby) setup on Windows.  Recently, the standby database started logging the following errors in it’s alert log:

Fri June 06 08:51:16 2016
RFS[1085]: Assigned to RFS process 8996
RFS[1085]: Opened log for thread 1 sequence 72899 dbid -2002036753 branch 876434118
CORRUPTION DETECTED: In redo blocks starting at block 135169count 2048 for thread 1 sequence 72899
Deleted Oracle managed file H:\FAST_RECOVERY_AREA\SNAPF\ARCHIVELOG\2016_06_03\O1_MF_1_72899_CMC1VNVP_.ARC
RFS[1085]: Possible network disconnect with primary database

The logs were being transported across from the primary site, but the media recovery process was reporting corrupt blocks when trying to apply the archive redo log files, and so recovery stalled.

Validating the archive logs at the primary site showed us that the files were indeed valid at the source (primary):

rman target /
validate archivelog sequence 72899;
...
List of Archived Logs
=====================
Thrd Seq     Status Blocks Failing Blocks Examined Name
---- ------- ------ -------------- --------------- ---------------
1    72899   OK     0              350165          H:\FAST_RECOVERY_AREA\SNAPF\ARCHIVELOG\2016_06_03\O1_MF_1_72899_CM3533SG_.ARC
Finished validate at 03-JUN-16

Attempting a dump of the log file contents would also demonstrate whether or not the log file was valid:

ALTER SYSTEM DUMP LOGFILE 'H:\FAST_RECOVERY_AREA\SNAPF\ARCHIVELOG\2016_06_03\O1_MF_1_72899_CM3533SG_.ARC';

So we know the logs are clean and intact at the primary site, which would suggest that something in the log transport process was corrupting the logs.  Further, manually copying the files across, and re-registering would resolve the problem, until the next error occurred (not a sustainable work around):

ALTER DATABASE REGISTER LOGFILE 'H:\FAST_RECOVERY_AREA\SNAPF\ARCHIVELOG\2016_06_03\O1_MF_1_72899_CM3533SG_.ARC';

Oracle were quite helpful in suggesting we check the firewall(s) to ensure the follow features were disabled:

  • SQLNet fixup protocol
  • Deep Packet Inspection (DPI)
  • SQLNet packet inspection
  • SQL Fixup
  • SQL ALG (Juniper firewall)
  • Oracle DB-control component DOS

After further investigation, it would seem that the Cisco switches being used between our primary and standby sites had “SQL*Net inspection enabled” by default (deep packet inspection).  As a result, because we were using the default 1521 listener port, packets were being scanned and reaching the standby site in a malformed/corrupted state.

Disable this feature wasn’t so straight forward unfortunately, so as a work around (and to avoid other 1521 port scanning protocols interfering), I opted to change the Data Guard listener port instead from 1521 to 1528 by adding another listener service:

SID_LIST_LISTENER =
 (SID_LIST =
 (SID_DESC =
 (SID_NAME = CLRExtProc)
 (ORACLE_HOME = E:\app\oracle\product\11.2.0.4)
 (PROGRAM = extproc)
 (ENVS = "EXTPROC_DLLS=ONLY:E:\app\oracle\product\11.2.0.4\bin\oraclr11.dll")
 )
 )

LISTENER =
 (DESCRIPTION_LIST =
 (DESCRIPTION =
 (ADDRESS = (PROTOCOL = TCP)(HOST = win02-stby.vbox)(PORT = 1521))
 )
 )

ADR_BASE_LISTENER = E:\app\oracle

# DG listener created to use port 1528, following SQL*Net packet inspection issues
SID_LIST_LISTENER_DG =
 (SID_LIST =
 (SID_DESC =
 (GLOBAL_DBNAME = DATAMARTF_DGMGRL) # Data Guard Manager
 (ORACLE_HOME = E:\app\oracle\product\11.2.0.4)
 (SID_NAME = SNAPF)
 )
 (SID_DESC =
 (GLOBAL_DBNAME = SNAPF) # Data Guard Broker Process
 (ORACLE_HOME = E:\app\oracle\product\11.2.0.4)
 (SID_NAME = SNAPF)
 )
 )

LISTENER_DG =
 (DESCRIPTION_LIST =
 (DESCRIPTION =
 (ADDRESS = (PROTOCOL = TCP)(HOST = win02-stby.vbox)(PORT = 1528))
 )
 )

ADR_BASE_LISTENER_DG = E:\app\oracle

Sure enough, after starting up the new LISTENER_DG service, the corruption issues disappeared!

NOTE: Don’t forget to also change the port number at your primary site for your Data Guard TNS entries.

References:
MAA Best Practices – Oracle Database
Data Guard Redo Transport & Network Best Practices Oracle Database 10g Release 2
SQL*Net (A.K.A Oracle TNS) And Firewalls
Cisco ASA – Configuring Inspection of Database and Directory Protocols

One thought on “Data Guard “CORRUPTION DETECTED: In redo blocks starting at block…” issues

Leave a comment

Your email address will not be published. Required fields are marked *