Quantcast
Channel: High Availability (Clustering) forum
Viewing all 6672 articles
Browse latest View live

W2008 R2 SP1 - Add node validate cluster - losing disk tru event 1568

$
0
0

We had a problem adding a third node to our existing cluster with a communication time out. Therefor we choose to update the servers and try with the latest up-to-date fix levels.

When validating the cluster in order to add a third node, we saw in the validation log:

List disks visible to two or more nodes that will be validated for cluster compatibility. Online clustered disks will be excluded.

Disk with identifier 6390744f has a Persistent Reservation on it. The disk might be part of some other cluster. Removing the disk from validation set

Disk with identifier ca0db766 has a Persistent Reservation on it. The disk might be part of some other cluster. Removing the disk from validation set

And:

Cluster disk 8 from node SVR01.domain.local has 8 usable path(s) to storage target
Cluster disk 8 is not managed by Microsoft MPIO from node SVR02.domain.local

Cluster disk 8 is not managed by Microsoft MPIO from node svr03.domain.local

There are 11 disks, so 2 were excluded from validating and 1 disk failed MPIO which is strange as it is for sure on SVR02, the existing cluster node.  

And on every SVR node:

Getting SCSI page 83h VPD descriptors for cluster disk 8 from node SVR01.domain.local SCSI page 83h VPD descriptors for cluster disk 8 and 9 match

SCSI page 83h VPD descriptors for cluster disk 8 and 10 match

At the end of this test:

An error occurred while executing the test.
Specified argument was out of the range of valid values.
Parameter name: percentage

So it failed the validation test. We checked the cluster event log and saw no errors, some warnings and everything was online. We logged in to the VMs to check the event logs and on one server we were welcomed by a screen saying that a disk needed its MBR record to be set.

When checking the disk in disk management on the node we saw it was unallocated with status reserved. When looking under the Storage resource of the Cluster we can see the disk is online but the volume path is not there.

When looking at the cluster event log we can see:

Event 1568 - Cluster disk resource 'SQLProd_Log' found the disk identifier to be stale. This may be expected if a restore operation was just performed or if this cluster uses replicated storage. The DiskSignature or DiskUniqueIds property for the disk resource has been corrected.

This is a pass tru disk and the disk the VM wanted to set the MBR record on.

We removed the storage resource, the disk, MPIO and SAN volume and exposed a new SAN volume, set MPIO, disk and added the new storage resource and restore the data.

What can cause validating a cluster to create such a potentially disastrous problem?

TIA,

Fred

 




disk failover works only in one direction winserver 2012R2

$
0
0

I have been banging my head against the wall trying to figure this out...

I have a cluster with WS2012R2

The cluster is on 2 HP DL360 G7 all latest rom drivers etc..

Connection to the disks is via emulex AH403A dual port fiber cards (the servers are identical)

3 disks are presented in the failover cluster manager.

When I try to move the disk from server1 (owner) to server2 it goes to status failed and owner as server2.

The error is :

Cluster resource 'Cluster Disk 1' of type 'Physical Disk' in clustered role '103f5606-e10d-46bd-83b7-2e4e770b5112' failed. The error code was '0x80070490' ('Element not found.').

I try to bring on line and stays failed. I then move it back to server1 and it goes online.

To move it I need to take it off line then move then take it online and this works.

If I then move the disk whose owner is now server2 to server1 it works without any problems (like it should)

I cant figure out why this move only work correctly in one direction.. I have all drivers and roms up to date.

Any help would be appreciated....

SERVER 2012 FAILOVER CLUSTER – HYPER-V – Errors after Security Updates

$
0
0

Setup a Windows Server 2012 Failover Cluster with two hosts and a backend SAS SAN (Dell MD3200 with Dell Dual Port 6Gbps HBAs).

Did a number of tests and failover of VMs from one host to the other worked great in restarting one host or the other. Livemigration and moving storage ownership between the two was working. I was missing a few windows updates on one to have it completely up to date so I did those. Now  after restart (and I’ve tried several restarts) the cluster doesn’t come up immediately and it takes about 5 minutes of thinking and trying before it comes up when logged into windows.

Here were the updates done:

KB3042058

KB890830

KB3080446

KB3088195

KB3058163

KB3093983

KB3097966

 

Should I just try removing one at a time and seeing which culprit caused the issue? Or is there a better way at handling this so that I can still have the windows updates?

The Critical error in the system log is:

Microsoft-Windows-FailoverClustering

1146

Critical

Task Category: Resource Control Manager

The cluster Resource Hosting Subsystem (RHS) stopped unexpectedly. An attempt will be made to restart it. This is usually associated with recovery of a crashed or deadlocked resource.  Please determine which resource and resource DLL is causing the issue and verify it is functioning properly.

SERVER 2012 FAILOVER CLUSTER – HYPER-V – MPIO for Dell SAS SAN Configuration

$
0
0

I have a Server 2012 Failover Cluster for Virtual Machines configured and seems to be working fine when testing failovers, etc. This is on two hosts.

My backend SAN is a Dell MD3200 SAN (firmware fully up to date). HBAs are Dell SAS 6Gbps Dual Port with each port filled and connected to both controllers.

I can’t find any clear documentation for how to properly setup the MPIO on the Dell SAS HBA Adapters. There’s plenty of documentation on how to do it with ISCSI but I can’t see anything on SAS.

On one server where I installed the Dell Modular Disk Storage Array, in device manager the two LUNs show up as one disk and when I right click > properties there’s an MPIO tab. On the 2<sup>nd</sup> server, each LUN shows up twice and when I right click > properties there’s no MPIO tab.

  1. How do I get the 2<sup>nd</sup> server to show only one device in device manager and have the MPIO tab.
  2. What are the optimal settings for the MD3200 in these windows in the attached images? a. Fail Over Only b. Round Robin with subset c. Least Queue Depth d. Weighted Paths.
  3. Do I need to use the MPIO module installed by server manager?

FileServer Cluster with 2 HP P4500

$
0
0

Hello Guys,

I have one case and i need some solution from you. I need to deploy two clusters one of them is Hyper-V Cluster and second one is File Server Cluster (With SOFS).

I know how to deploy and configure cluster, but i don't know which hardware i need to use and what is best solution for me in this case.

I Need to use 2 X HP P4500 and use their local hard drives as shared disks and build my Scale-Out File Server on it, to use for virtualization for Hyper-V. I need that both clusters make High-Available if one File Server node or one Hyper-V node is down second FileServer node and Hyper-V node will work without downtime.

for now i don't have any hardware, i can test any software solution provided from you with my virtual lab.

Thank you in advance, and sorry for my "Good" English :)

SQL 2012 Failover Cluster - unable to start because of 'Network Name' failed.

$
0
0

Hi all,

Running a 2012R2 Failover Cluster with SQL 2012. I'm unable to start the SQL 2012 Cluster Role because of the following error;

Log Name:      System
Source:        Microsoft-Windows-FailoverClustering
Event ID:      1069
Description:
Cluster resource 'SQL Network Name (SCSQLCL01)' of type 'Network Name' in clustered role 'SQL Server (VMM)' failed.

Failover cluster manager shows the following;

 

Observations thus far;

  • Passes all cluster validation tests (no issues)
  • Am sometimes seeing Kerberos errors in the log for both cluster members, but it's not consistent and I cannot pin down the cause;

The Kerberos client received a KRB_AP_ERR_MODIFIED error from the server scsqlcl01-2$. The target name used was MSServerClusterMgmtAPI/SCSQLCL01CORE.service.local.

  • The cluster computer object has been granted permissions on the cluster
  • All computer objects are created, and DNS entries are present.
  • It sometimes "just works". It comes online without a hitch and I can communicate with the cluster name using the SQL instance no problems

Any help would be appreciated.

Thanks.

 

 

 

 

Should you Optimize the Quorum Disk? (Defragment and Optimize Drives - Disks to Select)

$
0
0

Should you Optimize the Quorum Disk? 

Scenario:

Failover Cluster Server Windows 2012 R1 - 2 nodes - 1 x Quorum Disk.

Regards,

K


Kathy

Senior IT Support Analyst



The wrong diskette is in the drive

$
0
0

Hello all,

I have a strange issue occurring on my windows 2008R2 Enterprise 2 node Cluster. Both machines are physical.

One of the Luns doesn't work as intended in one  those nodes.

So i can explain myself better in the node that is working ok i can see its information as intended. ( see next image)

And when i move this resource to the other node that what you see...

In windows explorer i can see the Disk and navigate trough it. Although when i run a validation report it fails on this disk with the error:

“The wrong diskette is in the drive. Insert  (Volume Serial Number: ) into drive . (Exception from HRESULT: 0x80070022)”

Can anyone help me with this?<o:p></o:p>

Regards,<o:p></o:p>

Arestas<o:p></o:p>



Cluster CSV had errors, failed over, I see two volumes appear under CSV one is "unknown"

$
0
0

So, we've been having issues with one of our clusters. Yesterday in the evening when no one was working it seems like a bunch of VMs went down. I found some errors in a couple event logs that show it seems the CSV failed but I can't find any indication as to why. My storage appliance has no record of any problems at that time, and I can't find any other possible reasons apart from a problem within the cluster.

All six nodes are running up to date Server 2012 R2, and are Managed by SCVMM 2012 R2 running off a virtual machine hosted by another cluster. My storage is a Tegile ZEBI unit, and I've thin provisioned 20TB of disk space. Disk is accessed by iSCSI on separate NICs and separate switches from other normal cluster/VM traffic.

Below are the errors, and a screenshot of an "unknown" volume listed under my CSV, seems odd? In cluster Failover manager, under storage\Disks after selecting my CSV, in the bottom pane I see two volumes listed:

In cluster manager, I found this error:

Log Name:      System
Source:        Microsoft-Windows-FailoverClustering
Date:          2014-12-16 6:41:33 PM
Event ID:      5120
Task Category: Cluster Shared Volume
Level:         Error
Keywords:      
User:          SYSTEM
Computer:      CLUSTERHOST4.DOMAIN.INTERNAL
Description:
Cluster Shared Volume 'Volume 1' ('CSV') has entered a paused state because of '(c000000b5)'. All I/O will temporarily be queued until a path to the volume is reestablished.


I went to the node who owned the CSV, and in the event log I found this error:

Log Name:      System
Source:        Microsoft-Windows-FailoverClustering
Date:          2014-12-16 6:48:22 PM
Event ID:      1230
Task Category: Resource Control Manager
Level:         Error
Keywords:      
User:          SYSTEM
Computer:      CLUSTERHOST1.DOMAIN.INTERNAL
Description:
A component on the server did not respond in a timely fashion. This caused the cluster resource 'CSV' (resource type 'Physical Disk', DLL 'clusres.dll') to exceed its time-out threshold. As part of cluster health detection, recovery actions will be taken. The cluster will try to automatically recover by terminating and restarting the Resource Hosting Subsystem (RHS) process that is running this resource. Verify that the underlying infrastructure (such as storage, networking, or services) that are associated with the resource are functioning correctly.


Then this error:

Log Name:      System
Source:        Microsoft-Windows-FailoverClustering
Date:          2014-12-16 6:59:39 PM
Event ID:      1146
Task Category: Resource Control Manager
Level:         Critical
Keywords:      
User:          SYSTEM
Computer:      CLUSTERHOST1.DOMAIN.INTERNAL
Description:
The cluster Resource Hosting Subsystem (RHS) process was terminated and will be restarted. This is typically associated with cluster health detection and recovery of a resource. Refer to the System event log to determine which resource and resource DLL is causing the issue.


Then this error:

Log Name:      System
Source:        Microsoft-Windows-Ntfs
Date:          2014-12-16 7:01:03 PM
Event ID:      140
Task Category: None
Level:         Warning
Keywords:      (8)
User:          SYSTEM
Computer:      CLUSTERHOST1.DOMAIN.INTERNAL
Description:
The system failed to flush data to the transaction log. Corruption may occur in VolumeId: VirtualMachines, DeviceName: \Device\HarddiskVolume7.
(A device which does not exist was specified.)




Event 6, SMBWitnessClient critical event through DirectAccess

$
0
0

I have a Server 2012 R2 Failover Cluster with 2 nodes and with a few clustered disks volumes for shares.  I also use the Disk Witness in Quorum.  Everything has been working fine, expect when I connect through Direct Access my event log gets hammered with the Event 6, SMBWitnessClient critical event.

Witness Client failed to find a Witness Server for NetName "my cluster role dns record" with Error (element not found).

Any ideas why I see this through DA?  It doesn't seem to cause issues and I can still connect, just the never ending events.

Production Client Access Point of SQL instance is updated with Backup IP and Production IP

$
0
0

Hi All, 

I have a Two Node widows 2012 Enterprise Fail over cluster with one SQL 2008 R2 instance

There are three networks configured in the cluster 

1. Production

2. Backup

3. Heartbeat

SQL instance is accessible through the production IP and production Name (SQLProd).

As part backing up the database through the BUR network I have,

1. Added a client access point with SQLBkp name and with a backup network IP

2. Tried accessing the instance through the BUR IP/Name but failed

3. I have taken the production name(sqlprod) properties and added BUR IP in the dependency tab.

4. Post configuring the dependency the instance was accessible through production and bur IP.

5. But Issue now is - in DNS the production name (SQLProd) is registered with the production IP and the BUR IP.

6. As per the below blog verified the cluster parameter "RegisterAllProvidersIP" of the network name resources, but it is already set to 0.

http://blogs.msdn.com/b/sambetts/archive/2014/02/04/multi-subnet-clustered-sql-registerallprovidersip-sharepoint-2013.aspx

Request you to kindly suggest, how to fix this issue.

Thanks in advance for your time.

Shaji 

New SMB shared folder cannot be created. The object already exists.

$
0
0
Hi Tried to create a new shared folder on a drive and got the message " New SMB shared folder cannot be created. The object already exists." This is a Windows 2008 Cluster.  log has this entry:
[RCM] rcm::RcmApi::CreateResource: ERROR_OBJECT_ALREADY_EXISTS(5010)' because of 'Resource already exists.'
[API] s_ApiCreateResource: ERROR_OBJECT_ALREADY_EXISTS(5010)' because of 'g_rcm->rcmApi->CreateResource( pGroup, name, lpszResourceType, dwFlags )'

any help is much appreciated.

Disk not show for cluster

$
0
0

hi , 

i am trying to create cluster sql on guest level . i have two hyper-v server connected direct to storage EMC through fiber .

i have created three LUN and map them to both hyper-v host .

i can see the LUN presented to both hyper-v . when create VHDx on the first Hyper-v and store it on the LUN shared between both i can see the vhdx only on one server and the other server i can see that the space is getting decreased .

my question is how to make this vhdx visible for both hyper-v  server so i can attached them for virtual to create cluster ?

Multi-site cluster with one node in each of two sites

$
0
0

I think I know what the answer will be, but my question is which is the best quorum configuration?  I have a 2 node cluster with one node in each of two sites and a replicated SAN between the sites.  I do not have a third site nor do are we using Azure.  

I'm leaning towards Node and Disk Majority.  Being an old school cluster admin, in the old days there were only quorum drives.  How is Node and Disk Majority better than No Majority: Disk Only given that we could have a shared volume via SAN replication?


Thanks,

KB3093571 doesn't do what it was originally stated to do and now KB looks like it is pulled?

$
0
0
I downloaded Hotfix KB3093571 last week hoping to gain the ability to replicate VMs that were utilizing shared VHDX.  After installing the hotfix I still receive an error message via Failover Cluster manager that I cannot replicate VMs that are using shared VHDX.  I thought that is what this hotfix was supposed to enable, and now the original KB article looks like it has been pulled.  Can anyone provide some insight as to what is happening here? 

Cluster Aware Updating - Possible owners causes drain failure

$
0
0

Hello all,

I have a question regarding possible owners that a VM can have and Cluster Aware Updating. We have several servers that cannot be moved between HyperV hosts, because the high availability is configured in the application the VM is running, or it is not supported for an application (Lync HA, Exchange DAG, etc.). We configured the possible owner of VM's through SCVMM to be only a certain node.  However, we want to use Cluster Aware Updating and when we run it, the process fails. Of course there is no way to drain a role that is fixed on a cluster node, but we shut these servers down and had hoped that the roles were not required to be drained that way. It appeared that this was not the case and the update process failed. 

The workaround was to set the possible owners to several other hosts as well (they were shutdown, so no problem). This was only for a couple of VM's on this cluster, but we have clusters that have much, much more of these type of hosts so we want to see if there is another solution.

Isn't there any way to make sure that Cluster Aware Updating does not require hosts that are shutdown, to be drained to another node? Or any other brilliant idea? :-) I found out that you can do a "forced" drain through Powershell, but I do not think that that is a good solution and that we cannot apply it via Cluster Aware Updating. Anyone?


Resource Cluster Exceed Threshold

$
0
0

We have facing issue on Hyper V 2012 r2 Failover Cluster .

A component on the server did not respond in a timely fashion. This caused the cluster resource 'Virtual Machine Configuration "ABC"(resource type 'Virtual Machine Configuration', DLL 'vmclusres.dll') to exceed its time-out threshold. As part of cluster health detection, recovery actions will be taken. The cluster will try to automatically recover by terminating and restarting the Resource Hosting Subsystem (RHS) process that is running this resource.

Event ID 1230 & 1146 reported .

Any idea please..

WSFC - Unable to successfully cleanup. An error occurred while creating cluster

$
0
0

Hello everyone,

I am trying to do windows clustering 2012 for sql server 2012 clustering. I created 3 VMs, 1 DC and 2 server nodes and with in the DC I inatalled iSCSI target and created a iSCSI SAN.

I provided minimum required NIC cards and made the setup for 2-node-clustering. With administrator account only I am trying to create a cluster.

Validation of Cluster is successfully and it is showing setup is ready to cluster, but while creating cluster I am encountering with an error at Forming cluster....Unable to successfully cleanup....An error occurred while creating cluster......to troubleshoot run validate the cluster

But the validation is showing the setup is ready for cluster. And in some blogs I have seen to create cluster object and to give admin previlages to the account tryong to create cluster, but I am trying with administrator account only.

If anyone having idea about this please help me

Thanks in advance,

Bhargava K


Hyper-V Error

$
0
0

Hi,

I got a lot of errors in Hyper-V Cluster. Please help.

Event ID: 1254

Clustered role 'Archibus' has exceeded its failover threshold.  It has exhausted the configured number of failover attempts within the failover period of time allotted to it and will be left in a failed state. 

No additional attempts will be made to bring the role online or fail it over to another node in the cluster.  Please check the events associated with the failure.  After the issues causing the failure are resolved the role can be brought online manually or the cluster may attempt to bring it online again after the restart delay period.

Event ID: 1205

The Cluster service failed to bring clustered role 'Archibus' completely online or offline. One or more resources may be in a failed state. This may impact the availability of the clustered role.

Event ID : 1069

Cluster resource 'Virtual Machine Archibus' of type 'Virtual Machine' in clustered role 'Archibus' failed. The error code was '0x2' ('The system cannot find the file specified.').

Regards,

Ruel


How to upgrade hardware on Hyper-v cluster nodes

$
0
0
I have a Windows 2012 R2 cluster with four nodes running Hyper-v. I would like to upgrade the four nodes to new hardware. Can I remove a node from the cluster, rebuild the server on new hardware and then join the node back to the original cluster?
Viewing all 6672 articles
Browse latest View live


<script src="https://jsc.adskeeper.com/r/s/rssing.com.1596347.js" async> </script>