21.8.11

SAP Note 26837 - MaxDB: Data corruption, error -9026/-9028

Symptom:

Database error message -9026 BAD DATA PAGE
If the database determines such an error, the affected table can only be processed in reading status from that moment on. The following describes how write accesses with the error message BAD FILE (-9028) are confirmed.

In the SAP system, the error is logged as error -602 (for example, in the syslog or in a short dump).
The exact error message is logged in file knldiag in the database working directory.

Example (Version 7.4 or below):
...
11.11-17:25:53 19669 -512 page 0000270F08000001...0079005500290001
11.11-17:25:53 19669 -9026 BAD DATA PAGE 1355
11.11-17:25:53 19669 -9026 on DEVNO 1 DEV_OFFSET 1755
11.11-17:25:53 19669 -9026 /u4/filedevs/N61Adat1
11.11-17:25:55 19669 -512 page 0000270F08000001...0079005500290001
11.11-17:25:55 19669 -9026 BAD DATA PAGE 1355
11.11-17:25:55 19669 -9026 on DEVNO 1 DEV_OFFSET 1755
11.11-17:25:55 19669 -9026 /u4/filedevs/N61Adat1
11.11-17:25:57 19669 -512 page 0000270F08000001...0079005500290001
11.11-17:25:57 19669 -9026 BAD DATA PAGE 1355
11.11-17:25:57 19669 -9026 on DEVNO 1 DEV_OFFSET 1755
11.11-17:25:57 19669 -9026 /u4/filedevs/N61Adat1
11.11-17:25:57 19669 -11066 vfopen 'd1355.bad'
11.11-17:25:57 19669 -514 BAD FILE: 1359 (ROOT)
...

Other examples for error messages in file knldiag from Version 7.4:

    1. Incorrect check sum:

ERR 4 Data Checksum mismatch; calculated: 38068804 found: 380688
ERR 13 IOMan Bad page on Data volume 1 blockno 262

The working directory contains file Data-1-262.bad.

    2. Incorrect WriteCount:

ERR 5 Data Write count mismatch; header: 1, trailer: 51143
ERR 13 IOMan Bad page on Data volume 1 blockno 228

The working directory contains file Data-1-228.bad.

    3. Incorrect page type:

ERR 6 Data Bad data page type 1
ERR 13 IOMan Bad page on Data volume 1 blockno 262

The working directory contains file Data-1-262.bad.

    4. Header/trailer mismatch:

ERR 4 Kernel Header/Trailer mismatch
ERR 5 Kernel Header : ID: 0 Type: 1 Type2: 2 CheckType: 5 Mode: 0.
ERR 6 Kernel Trailer: ID: 0 Type: 1 Type2: 2 CheckType: 5 Mode: 0.
ERR 13 IOMan Bad page on Data volume 1 blockno 1

The working directory contains file Data-1-1.bad.

    5. Incorrect page number:

ERR 18 IOMan Wrong page 14888 on Data volume 1 blockno 262
ERR 18 Data Bad data page 14888 of filetype 13 identified by root 14888
ERR 53021 B*TREE BAD FILE: 14888 (ROOT)

The working directory contains file Data14888.bad.

Other terms

BAD FILE, block corruption, -9028, -602, -9018

Reason and Prerequisites

Error message '-9026 BAD DATA PAGE ...' occurs when the page delivered from the operating system does not correspond to the required database page during an I/O request. This can be determined using several tests. For example, the page number is checked and a header trailer check is carried out.

The cause for such errors is found in the areas disk periphery, disk controller, RAID controller, RAID software, disk caches and so on.
This is why the affected HW components and their 'accessories' must be thoroughly tested and defect parts be replaced after the error occurs.

Solution
The database provides some information for the error analysis in knldiag or knldiag.err.

The database page should only be read with the logical number 1355 in this example. The space required on the disk is determined from the control structures: Device (volume) 1 Offset 1755. The reading procedure delivers a block, whose first and final 8 bytes are also logged. In the example, the page read has in the first 4 Bytes the value x'270F' = 9999, this does not, therefore, match the required 1355.

Since the peripheries are often driven with RAID (1 or 5), there is the possibility that the RAID is implemented in such a way that the corrections are not executed synchronously, but only once the read request takes place. For this reason, the read request is repeated twice in intervals of 2 seconds after the error occurs, and is only noted as an error after the occurrence.
An additional file is created in the working directory (*.bad) which contains the complete 8K block as it was delivered by the operating system. This image can be displayed using the MaxDB diagnostics tool x_diagnose.

The root of the affected table or the B* tree of the affected table is issued as further important information. This number (1359 in this example) enables you to determine the name of the affected table using the Database Manager (CLI) tool:

dbmcli -d -u , -uSQL , sql_execute select * from roots where root =

For example:
dbmcli -d N61 -u control,control -uSQL sapr3,sap sql_execute select * from roots where root = 1359

This statement delivers the name and the entry, whether it is a index, a primary table or a table for LONG buffers with short LONG column.

If the table name is known, execute the following SQL statement in the Database Manager (CLI):

dbmcli -d -u , -uSQL , sql_execute check table

This statement checks the tree consistency of the operative data of this table, and even resets the file status to ok, if required, if the table was subject to an asynchronous RAID correction (pseudo corruption), but the B* structures should now be faultless again.

In general, the complete operative dataset should be checked using a verify (check data).
You can use the 'Check -> Database' menu option in the Database Manager (GUI) tool.

If the check data or check table continues to return errors, a recovery of the data is required.

Since a data backup does not run correctly if a BAD DATA PAGE is contained in the contents to be backed-up, you can always access the most recent successfully saved complete backup; you must then still import incremental backups or log backups which correspond to the backup scheme, or simply restart if the log information in the log volume (Logdevspace) has not been overwritten yet.

If the defect objects are indexes exclusively, they can be deleted and recreated using the SAP System Data Dictionary. Following the index reconstruction, a verify in database status COLD (ADMIN) is required. This only works using the Database Manager (GUI and CLI) tool; the xpu cannot be used in COLD (ADMIN) status.
Due to the defect pages, the B* tree of an index may not have been deleted completely. During the deletion process, an error is not reported, but it is impossible to create a backup in this state. The verify in cold status (check data with update) deletes all remaining pages of the deleted B* tree.

If there is no backup for a recovery, your consistency check and backup concept is insufficient.
You cannot correct the corrupt data. Development Support can carry out an analysis as part of the SAP remote consultation (subject to charges, Note 423778) and can try to save the data that is readable. However, SAP cannot provide a guarantee for success.

During remote consulting, individual tables may be restored from other tables. Application Support will decide this.

The following is a list of tables that can probably be restored by application support:
BSAD, BSAK, BSAS, BSID, BSIK, BSIM, BSIS, BSIW, BSIX

Tables that can be created as empty tables and are automatically filled when the system is stopped (even if only one of the tables is affected, all of the tables for the relevant row must nevertheless be emptied for consistency reasons):
D010L, D010Q, D010Y, D010LINF
D020L, D020LINF, D021L, D021LINF
D346T, D342L
DDFTX
D344L
REPOLOAD
DYNPTXTLD, DYNPLOAD
DDLOG

No comments:

Post a Comment