Question :
I am looking at sql server error log and the cluster log (cluster.log) file for each node in the cluster.
I can see couple of errors scattered around like failed a periodic health check on file share
, failing over group
, because of 'Shutting down'
, resource x is causing group y to failoer
…
What is specific error record to look for, that guarantees that the WSFC initiated either a restart or failover?
Answer :
You could use this command Get-ClusterLog to generate cluster logs for each node,
Example command:
Get-ClusterLog -Node Node01, Node02, Node03 -Destination 'C:Temp' -UseLocalTime
The above command generates log files of each node at C:Temp
of each node.
Once you have the output, you could do following on those files to filer,
- To filter only errors from the log file.
Select-String -Path C:TempNode03_cluster.log -Pattern ' ERR '
- To filter
failoverCount
messages.
Select-String -Path C:TempNode03_cluster.log -Pattern 'failoverCount'
You would be able to mix and match the string filters to get appropriate information you are looking for.
Sample from one of our instances with filter (2) applied,
Line 40: ObjectId,ObjectName,resources,_acceptOwnershipCounts,_pOwnerNode,_state,_stateCounter,_markedBusyBy,_failoverInProgress,_failoverCount,_lastFailoverTime,_beingForcefullyDeleted,_operationFlags,_bounceBackFlags,_waitStart,_placementAttempts,_groupType,_numPreemptions,_numBlindPreemptions,_waitingForFirstPlacement,_priority,_defaultOwner,_flags,_persistentState,_failoverThreshold,_failoverPeriod,_autoFailbackType,_failbackWindowStart,_failbackWindowEnd,_description,_groupStartDelay,_lastOnlineOffline,antiAffinityClassNames,PreferredSite,_preferredOwners,_isCore,_lastOnlineNode,_groupStatusInformation,_moveTarget,_moveTargetBirthdate,_failoverTarget,_moveType,_isTargetedMove,_previousOwner,_queuedTarget,_hasIssuedMoveWithThisQueuedTarget,_targetedQueue,_savedLastOperationStatusCodeDuringQueue,_onlineTime,lastStateChangeTime,lastSeenMoveTime_GetSystemTime,lastSeenMoveTime_NodeId,_coldStartSetting,_placementOptions,providers
Line 10958: [Verbose] 000012e0.000036bc::2022/05/02-07:34:14.124 INFO [RCM] move of group AG01 from Node02(2) to Node03(3) of type MoveType::Failover is about to succeed, failoverCount=1, lastFailoverTime=2022/05/02-07:31:16.711 targeted=false
Line 15455: [Verbose] 000012e0.00002b08::2022/05/02-07:34:23.178 DBG [RCM] rcm::RcmGroup::UpdateAndGetFailoverCount=> (1, 2022/05/02-07:31:16.711)
Line 15457: [Verbose] 000012e0.00002b08::2022/05/02-07:34:23.178 WARN [RCM] Failing over group AG01, failoverCount 2, last time 2022/05/02-07:31:16.711.
Line 16165: [Verbose] 000012e0.0000139c::2022/05/02-07:34:24.140 INFO [RCM] move of group AG01 from Node03(3) to Node02(2) of type MoveType::Failover is about to succeed, failoverCount=2, lastFailoverTime=2022/05/02-07:34:23.170 targeted=false
Line 16994: [Verbose] 000012e0.000036bc::2022/05/02-07:37:58.017 INFO [RCM] move of group AG01 from Node02(2) to Node03(3) of type MoveType::Failover is about to succeed, failoverCount=3, lastFailoverTime=2022/05/02-07:35:00.680 targeted=false
Line 19109: [Verbose] 000012e0.00002a1c::2022/05/02-07:38:01.398 DBG [RCM] rcm::RcmGroup::UpdateAndGetFailoverCount=> (3, 2022/05/02-07:35:00.680)
Line 19111: [Verbose] 000012e0.00002a1c::2022/05/02-07:38:01.398 WARN [RCM] Failing over group AG01, failoverCount 4, last time 2022/05/02-07:35:00.680.
Line 19608: [Verbose] 000012e0.00001c6c::2022/05/02-07:38:02.370 INFO [RCM] move of group AG01 from Node03(3) to Node02(2) of type MoveType::Failover is about to succeed, failoverCount=4, lastFailoverTime=2022/05/02-07:38:01.385 targeted=false
Line 29278: [Verbose] 000012e0.00001d68::2022/05/02-08:27:51.277 INFO [RCM] move of group Available Storage from Node03(3) to w8mssql110agp0(1) of type MoveType::Drain is about to succeed, failoverCount=0, lastFailoverTime=1601/01/01-00:00:00.000 targeted=false
Hope this information helps.