Techrelam

Analyzing log volume full situations

Symptom

Alerts having ID 2 "Alert Disk usage" for "storage type: LOG" are generated.
The log volume is full, thus the database cannot be started or does not accept new requests. Note that HANA Studio or OS commands like "df -h" might show a different output (indicating no problem at all) when a cluster file system like HP IBRIX or IBM GPFS is used. In this case, use the filesystem-specific commands like "mmdf/mmrepquota" for IBM GPFS.
The database trace file of a particular service contain "rc=24 no space left on device errors" for the basepath_logvolumes or basepath_logbackup

Reason and Prerequisites

The disk usage of log volumes only grows, if there are no more segment files available for overwriting/re-use (having state FREE, see Db-view m_log_segments). Log segments are available for re-use when they have been successfully backed up (in log_mode = normal) and are not required for a database restart. The aim of this SAP Note is to help you identify the root cause, why log segments are not getting freed and remain in state TRUNCATED (indicating that the log-segment has not yet been backed up successfully). In a log volume full situation, no free log-segments are available for re-use & no more new log segments can be allocated due to limited disk quota/size, thus the database cannot be started or stops accepting requests. Apart from bringing the system up again as soon as possible, the root cause needs to be investigated and resolved. Otherwise, you may run into log volume full situation again soon.

Solution

Do not remove either data files or log files using operating system tools as this will corrupt the database! Follow SAP Note 1679938 to temporarily free up space in the log volume, this way you should be able to start up the database for root cause analysis and problem resolution (after the situation is resolved, please undo the workaround). The questionnaire below might help you to narrow down the root cause:

1) Which database service is affected and what is the current state of the log-segments allocated?

In case the database accepts SQL requests:

Execute the below SQL statement:

select b.host, b.service_name, a.state, count(*) from "PUBLIC"."M_LOG_SEGMENTS" a join "PUBLIC"."M_SERVICES" b on (a.host = b.host AND a.port = b.port) group by b.host, b.service_name, a.state;

1.a) If most of the log segments are in state FREE, this is an indicator for a similar log volume full situation in the past. Although the log volume appears to be full, the free & already allocated (e.g. indexserver) log segments will be re-used. You can release the space occupied by the allocated log-segments in your file system by executing the SQL statement:

ALTER SYSTEM RECLAIM LOG;

1.b) If most of the log segments are in state TRUNCATED, there appears to be a problem with log backup which needs to be identified using the next steps. In case you see a lot of log segments in state BACKEDUP this indicates a problem with a (hanging) savepoint. See the table in the bottom of this section for details on the different states.

In case the database is offline:

Execute the below command in the log directory e.g. /hana/log//mnt00001 for single node systems. This way we can identify the service affected, furthermore it might help to decide which volume to choose for the temporary workaround presented in SAP Note 1679938.

File-count Disk-usage
for i in $(find . -type d ) ; do
echo $i ;
( find $i -type f | wc -l ) ;
done
du -h

./hdb00001
5
./hdb00002
8
./hdb00003
95
130M ./hdb00001
50M ./hdb00002
93G ./hdb00003

It will give you a file-count / disk-usage for each of the subdirectories (representing a volume) hdb00001, hdb00002, hdb00003, hdb0000... You can identify the relevant service by searching for "volume=" in the topology.txt contained in the fullsysteminfodump or by searching the trace file of each service individually for "assign to a volume started".

Using the command "hdblogdiag seglist" on the log segment directory identified, you can find out the state of the log segments:

hdblogdiag seglist /hanalog/SOH/mnt00001/hdb00003
LogSegment[0/90:0x70cb3180-0x7108fa40(0xf723000B)/<...>,TS=2014-10-24 12:31:59.480279/Free/0x0]@0x00007fd7696bfb00
LogSegment[0/92:0x7108fa40-0x7108fb40(0x4000B)/<...>,TS=2014-11-04 02:19:58.774171/Free/0x0]@0x00007fd7696c0000
LogSegment[0/91:0x7108fb40-0x71091300(0x5f000B)/<...>,TS=2014-11-07 00:46:58.377588/Free/0x0]@0x00007fd7696bfd80
LogSegment[0/3:0x71091300-0x710f6780(0x1952000B)/<...>,TS=2014-11-07 02:18:50.000233/Free/0x0]@0x00007fd76a710400
LogSegment[0/6:0x710f6780-0x710f6900(0x6000B)/<...>,TS=2014-11-07 02:50:16.406526/Free/0x0]@0x00007fd76a710b80
LogSegment[0/0:0x710f6900-0x71144f40(0x1399000B)/<...>,TS=2014-11-07 03:02:03.221834/Free/0x0]@0x00007fd76a70fc80
...

To get e.g. the count of the log segments being in state Free/Truncated/Writing execute the below command:

sohdb:/hanalog/SOH/mnt00001/hdb00003> hdblogdiag seglist /hanalog/SOH/mnt00001/hdb00003 | grep -i Free | wc -l

91

sohdb:/hanalog/SOH/mnt00001/hdb00003> hdblogdiag seglist /hanalog/SOH/mnt00001/hdb00003 | grep -i Truncated | wc -l

0

sohdb:/hanalog/SOH/mnt00001/hdb00003> hdblogdiag seglist /hanalog/SOH/mnt00001/hdb00003 | grep -i Writing | wc -l

1

In general, a log-segment can have one of the following states:

State Description
Writing Currently writing for this segment.
Closed Segment is closed by the writer.
Truncated Truncated, but not yet backed up. The backup will remove it.
BackedUp Segment is already backed up, but a savepoint has not yet been written. Therefore it needs to be kept for instance recovery.
Free Segment is free for reuse.

2) Do you have the automated log backup enabled?

You can check the corresponding database configuration parameter "enable_auto_log_backup" in global.ini either in the configuration tab in HANA Studio when the database is online, or on filesystem level in the location where the custom configuration files are located (logon as adm onto the HANA appliance)

cdglo (usually resolves to /usr/sap//SYS/global)

cd hdb/custom/config

If it has been deactivated (value = no), revert the parameter to its default value "yes" to activate automated log backup again.

3) Where is the storage location for the log backups?

Does this parameter point to a mounted/external storage location or to the local filesystem (see also the database parameter: global.ini > "basepath_logbackup")?

3.a) If it resolves to external storage, does the tracefile of the affected service show I/O related errors on writing to this location? If this is true, check the external storage for availability and I/O errors.

3.b) If it resolves to the local filesystem, are the "basepath_logbackup" and the "basepath_logvolumes" located on the same disk? When this true, the log backups very likely filled up the entire disk and thus neither new log segments can be created nor old ones can be backed up. In order to resolve the issue - depending on your backup strategy - either move the log backups to a different location to free up space or perform a full data backup and delete older (log) backups from the catalog & file-system using the HANA Studio Backup Editor (for details also the SAP HANA Administration Guide).

4) What is the throughput on writing to the basepath_logbackup location? To check that execute either:

4.a) The SQL Statement below:

select v.host, v.port, v.service_name, s.type, round(s.total_read_size / 1024 / 1024, 3) as "Reads in MB", round(s.total_read_size / case s.total_read_time when 0 then -1 else s.total_read_time end, 3) as "Read Througput in MB", round(s.total_read_time / 1000 / 1000, 3) as "Read Time in Sec", trigger_read_ratio as "Read Ratio", round(s.total_write_size / 1024 / 1024, 3) as "Writes in MB", round(s.total_write_size / case s.total_write_time when 0 then -1 else s.total_write_time end, 3) as "Write Throughput in MB", round(s.total_write_time / 1000 / 1000, 3) as "Write Time in Sec", trigger_write_ratio as "Write Ratio" from "PUBLIC"."M_VOLUME_IO_TOTAL_STATISTICS_RESET" s, PUBLIC.M_VOLUMES v where s.volume_id = v.volume_id and type ='LOG_BACKUP' order by type, service_name, s.volume_id;

4.b) Or create a dummy file in the basepath_logbackup using:

Writing to the log area dd if=/dev/zero of=/hanalog/SOH/mnt00001/testfile bs=1M count=1024
Reading from the log area dd if=/hanalog/SOH/mnt00001/testfile of=/tmp/testfile bs=1M count=1024
Note that the testfile first needs to be written before it can be read back again. Please be very careful with the input (if) / output (of) locations used as you can easily overwrite existing files!

4.c) Set the below trace level to print additional information:

Enable the trace
ALTER SYSTEM ALTER CONFIGURATION ('global.ini', 'SYSTEM') SET ('trace', 'stream') = 'interface' with reconfigure;

Disable the trace
ALTER SYSTEM ALTER CONFIGURATION ('global.ini', 'SYSTEM') UNSET ('trace', 'stream') with reconfigure;

Note that in case the database does not accept SQL requests, you need to maintain this parameter by directly adding:

[trace]
stream = interface

in global.ini in the path holding the custom parameter files (see point 2). In addition the database needs to be reconfigured (by calling "hdbnsutil -reconfig", might not work depending on the state of the database) or restarted to make the changes active. This increased trace level works both for data and log backups:

[28055]{-1}[-1/-1] 2014-11-13 20:09:28.755532 a Stream ChannelUtils.cpp(00365) : SynchronousCopyHandler::doCopy finished, source="/logsegment_000_00000002.dat" (mode= RW, access= rwrwr-, flags= DIRECT|MUST_EXIST|LAZY_OPEN), factory= (root= "/usr/sap/UPD/SYS/global/hdb/log/mnt00001/hdb00002/" (access= rwrwr-, flags= AUTOCREATE_DIRECTORY, usage= LOG, fs= ext3, config= (AsyncWriteSubmitActive=auto,AsyncWriteSubmitBlocks=new,AsynReadSubmit=off,#SubmitQueues=1,#CompletionQueues=1))), destination="/.log_backup_2_0_538414336_538623872.1415905765532" (mode= W, access= rwrwr-, flags= DIRECT|UNALIGNED_SIZE|TRUNCATE|MULTI_WRITERS), factory= (root= "/usr/sap/UPD/HDB00/backup/log/" (access= rwrwr-, flags= AUTOCREATE_PATH|DISKFULL_ERROR, usage= LOG_BACKUP, fs= ext3, config= (AsyncWriteSubmitActive=auto,AsyncWriteSubmitBlocks=new,AsynReadSubmit=off,#SubmitQueues=1,#CompletionQueues=1))) copySize=13410304, copyTimeUs=2738384 totalReceiveTimeUs=2647457, minReceiveThroughput=4.83MB/s, maxReceiveThroughput=4.83MB/s, avgReceiveThroughput=4.83MB/s, receiveCount=1 totalSendTimeUs=88337, minSendThroughput=144.78MB/s, maxSendThroughput=144.78MB/s, avgSendThroughput=144.78MB/s, sendCount=1

5) What is the size of the backup catalog, the time needed for its backup and the log backup frequency?

What is the size of the backup catalog?

In case the database accepts SQL requests:

The below SQL statement will give you the backup catalog size in Mb of the latest 10 backup catalog backups:

select top 10 round(backup_size/1024/1024,2) from"PUBLIC"."M_BACKUP_CATALOG_FILES"where source_type_name = 'catalog' orderby backup_id desc;

In case the database is offline:

Go to the basepath_logbackup and execute in the log subdirectory:

ls -lth log_backup_0_0_0_0* | head -n 10

linux-upd:/usr/sap/UPD/HDB00/backup/log> ls -lth log_backup_0_0_0_0* | head -n 10
-rw-r--r-- 1 updadm sapsys 11M Nov 6 23:32 log_backup_0_0_0_0.1415313121836
-rw-r--r-- 1 updadm sapsys 11M Nov 6 23:28 log_backup_0_0_0_0.1415312899446
-rw-r--r-- 1 updadm sapsys 11M Nov 6 23:28 log_backup_0_0_0_0.1415312898184
-rw-r--r-- 1 updadm sapsys 11M Nov 6 23:28 log_backup_0_0_0_0.1415312896945
-rw-r--r-- 1 updadm sapsys 11M Nov 6 23:28 log_backup_0_0_0_0.1415312895417
-rw-r--r-- 1 updadm sapsys 11M Nov 6 23:27 log_backup_0_0_0_0.1415312823518
-rw-r--r-- 1 updadm sapsys 11M Nov 6 23:17 log_backup_0_0_0_0.1415312221722
-rw-r--r-- 1 updadm sapsys 11M Nov 6 23:02 log_backup_0_0_0_0.1415311321688
-rw-r--r-- 1 updadm sapsys 11M Nov 6 22:47 log_backup_0_0_0_0.1415310421760
-rw-r--r-- 1 updadm sapsys 11M Nov 6 22:47 log_backup_0_0_0_0.1415310419517

What is the value of the database configuration parameter "log_backup_timeout_s" (in global.ini)? Is this parameter set to a smaller than the default value 300 (seconds)? In case this amount of time passes before the segment is full, the segment will be closed prematurely and put to the log segment backup queue.

Thus if the log backup frequency is too high and the time needed for the "backup catalog" backup itself takes (due to its size) longer than the time needed to write a log backup, this might cause a queuing situation where log segments cannot be released due to not yet performed "backup catalog" backups. You can identify the "backup catalog" backups using the below lines in the backup.log:

2014-11-06T20:46:10+01:00 P026822 14986a489b2 INFO LOGBCKUP state of service: nameserver, linux-upd:30001, volume: 0, BackupExecuteCatalogBackupInProgress
2014-11-06T20:46:10+01:00 P026822 14986a489b2 INFO LOGBCKUP state of service: nameserver, linux-upd:30001, volume: 0, BackupExecuteCatalogBackupFinished

In this situation either reduce the size of the backup catalog (delete older backups) or improve the I/O performance to the log backup location. If a lot of log segments are generated due to the high amount of changes on the data, you may consider increasing the log segment size of the particular service accordingly (see "log_segment_size_mb" parameter in the ini-file of the particular service).

6) What is the time difference between log backup process states "100%" and "BackupLogSuccess"?

In general this is the time that is consumed by updating the backup catalog itself and writing it to the (external) filesystem location. Only after these steps were completed, a log backup is considered as successful. If you still see a lot of log segments in state TRUNCATED, please check the time difference between log backup process states "100%" and "BackupLogSuccess" for a particular log backup in backup.log.

backup.log

2014-11-17T04:44:00+00:00 P012889 149bc0de8ff INFO LOGBCKUP progress of service: indexserver, linux-upd:30003, volume: 4, 88% 939524096/1073229824
2014-11-17T04:44:02+00:00 P012889 149bc0de8ff INFO LOGBCKUP progress of service: indexserver, linux-upd:30003, volume: 4, 100% 1073229824/1073229824
...
2014-11-17T04:54:43+00:00 P012889 149bc0de8ff INFO LOGBCKUP state of service: indexserver, linux-upd:30003, volume: 4, BackupLogSuccess

In this example, the time difference is about 10 minutes, which is very long and could be responsible for the "queueing" situation. If the I/O performance is ok (see step 5), first try to reduce the size of the backup catalog (which should result in shorter log backup post-processing time). If reducing the backup catalog did not help or is not feasible, you can also increase the values of the two parameters log_segment_size_mb and log_backup_timeout_s in global.ini, so that less log backups are performed and log backup queuing in this particular scenario is avoided.

[persistence]

log_segment_size_mb = 4096
log_backup_timeout_s = 1200

Note that changing the log_backup_timeout_s has an impact on the Recovery Point Objective (this interval is the maximum time span of data that will be lost if the log area cannot be used for recovery). The value 1200 (seconds) is a starting point and can be increased up to 3600 (default = 900). See the SAP HANA Administration Guide (SPS 8) chapter 4.2.4.2.3 Log Backup Options for details.

7) Do you use third-party backup software (using Backint) to perform the log backups?

Is the database configuration parameter "log_backup_using_backint" in global.ini set to true? If no, should it be enabled? If yes, check the below steps.

6.a) Are there errors reported in the backint.log?

6.b) On checking the backup.log, did a previous log backup attempt of a particular service terminate prematurely because of OOM or a particular SIGNAL? If yes, is the corresponding pipe & backup agent process - using the pid (process id) from the backint.log still running on HANA operating system level (use "ps -ef | grep ")? In case this is true, kill the dangling backup agent process manually.

backint.log

2014-08-27 01:56:20.000 backint started:
command: /usr/sap/UPD/SYS/global/hdb/opt/hdbbackint -f backup -p /hana/shared/UPD/global/hdb/opt/initEH0.utl -i /var/tmp/hdbbackint_UPD.3wmm3a - /var/tmp/hdbbackint_UPD.Qkgptl -u UPD -s 1409097380583 -c 1 -l LOG
pid: 13506
input:
#SOFTWAREID "backint 1.04" "HANA HDB server 1.00.74.03.392810"
#PIPE "/usr/sap/UPD/SYS/global/hdb/backint/log_backup_3_0_11908452928_11909495616"

8) If none of the steps presented so far helped to identify the issue:

Open an incident with SAP HANA Product Support on component HAN-DB or HAN-DB-BAC (for backup related issues) and provide the below information:

Increase the database tracelevel for component "backup" to level debug and re-create the issue.
Enable the trace ALTER SYSTEM ALTER CONFIGURATION ('global.ini', 'SYSTEM') SET ('backup', 'trace') = 'debug' with reconfigure;
Disable the trace ALTER SYSTEM ALTER CONFIGURATION ('global.ini', 'SYSTEM') UNSET ('backup', 'trace') with reconfigure;
Create a full-system-info-dump according to SAP Note 1732157.
In case your hardware vendor is IBM, please provide also the output of the 1661146 (IBM Check Tool for SAP HANA appliances)
Open a SSH (see SAP Note 1275351) remote connection to your system and add valid OS logon credentials to the Secure Area.

More details Refer KBA 2083715

Techrelam

Wednesday, 7 February 2018

Analyzing log volume full situations in HANA

HANA Log Volume full