Network Uptime

We upgraded the firmware on some network devices during last month's maintenance window. Before that, they had some impressive uptime:

Firewall Uptime

Switch Uptime

The devices are configured with HA redundancy, so the rolling firmware upgrades went beautifully with minimal downtime during the route convergence and no manual intervention.

Tuesday, June 08 2010   |   Tagged as:   |   Jeff Graves   |   Comments

High CPU on Cisco 4500 with MSFT NLB multicast cluster

Recently, we were alerted to higher than normal CPU on some of our core Cisco Catalyst 4507 switches running IOS 12.2. Using Cisco's CPU troubleshooting doc, I was able to narrow down the source to the Cat4K Mgmt LoPri process. From there, issuing a "sh platform health" command found it was the K2CpuMan Review process meaning packets are being forward by the CPU. To find out which queue, we issued the "sh platform cpu packet statistics" command. That showed the L3 Fwd Queue was much higher than normal.

By creating a CPU span and monitoring the traffic with Network Monitor 3.3, we could see that all of the traffic destined for VIP's in our 2008 NLB clusters was hitting the CPU. I checked the configuration to ensure it matched the Catalyst and MSFT NLB example on Cisco's site which it did. We were using multicast NLB configuration as explained in the document. I setup a test NLB cluster to play with the settings to figure out why cluster bound packets were hitting the CPU. What I found was in relation to this section:

However, since the incoming packets have a unicast destination IP address and multicast destination MAC the Cisco device ignores this entry and process-switches each cluster-bound packets. In order to avoid this process switching, insert a static mac-address-table entry as given below in order to switch cluster-bound packets in hardware.

mac-address-table static 0300.5e11.1111 vlan 200 interface fa2/3 fa2/4

Note: For Cisco Catalyst 6000/6500 series switches, you must add the disable-snopping parameter. For example:

mac-address-table static 0300.5e11.1111 vlan 200 interface fa2/3 fa2/4 disable-snooping

The disable-snooping parameter is essential and applicable only for Cisco Catalyst 6000/6500 series switches. Without this statement, the behavior is not affected.

I double and triple checked that our switches had the satic mac entry for the CAM tables and they did. So, I reconfigured my test cluster from the ground up and found that cluster bound packets only hit the cpu AFTER this command was entered. By removing this command from my switches for our production, CPU dropped 30-40% instantly. This seems to contradict what Cisco has posted in their example.

There was no adverse affect or downtime from removing this command. Both cluster nodes are connected locally to the switch however, and this command may be necessary if a NLB node is connected to a down-level swtich. Furthermore, a "sh int stats" is showing that no packets are switched by the "processor."

Wednesday, May 26 2010   |   Jeff Graves   |   Comments

SCVMM 2008 R2 Installation defaults and Self Service Portal

SCVMM is an excellent product for managing your Hyper-V environment. The Self Service Portal (SSP), a component of the SCVMM install, allows end users to manage and deploy VM's remotely. However, but sure to read the fine print when installing.

During the installation process, you will be prompted to select what ports the VMM Agent will use when communicating with the Hyper-V host. The default ports WinRM and BITS use are 80 and 443 respectively. If you plan on running the Self Service Portal from the same host system, you will either need to change the ports the VMM Agent uses or change the ports the Self Service Portal uses.

Since browsers and IIS always default to 80 and 443 for HTTP and HTTPS, I would recommend making the change to the VMM Agent. Port 8080 for the VMM Agent control port (WinRM) and 8443 for the VMM Agent data port (BITS) are nice alternatives. Note that using a different IP for the SSP is NOT an option, as WinRM and BITS will self-configure to listen on all IP addresses thereby hijacking the ports.

A quick note, changing the default ports is recommended if you are planning on running ANY website on the same box. For instance, we had initially installed SCVMM on the same box running Operations Manager. It wasn't until our VM migrations began failing that we realized the default installation of SCVMM was being interfered with the Operations Manager Console which was also running on the same system.

Lastly, you may not even receive an error of any type when having this issue - rather, the SSP simply won't install. You may see behavior similar to this:

http://social.technet.microsoft.com/Forums/en-US/virtualmachinemanager/thread/bba52f08-7b95-4a74-9c9b-ceaf0499e29c/#1dbef478-f896-48e4-af4e-b455d120c10b

Thursday, April 08 2010   |   Jeff Graves   |   Comments

Error 0x800423f3 backing up Hyper-V VM with DPM 2007

One error you may receive while backing up a Hyper-V VM with DPM 2007 is the generic "DPM encountered a retryable VSS error. (ID 30112 Details: Unknown error (0x800423f3) (0x800423F3))." There are a couple of different things that could cause this error. The two most common are:

1. You are running a Windows Server 2008 SP1 Hyper-V host and do not have the appropriate pre-requisites installed. Specifically, the hotfix described in KB959962.

http://technet.microsoft.com/en-us/library/dd347840.aspx

2. There is a VSS error of some kind inside the VM causing the Hyper-V VSS writer to fail.

One of the most common VSS errors inside a Server 2008 VM I have seen, is event id 8193:

Log Name:      Application
Source:        VSS
Date:          <DateTime>
Event ID:      8193
Task Category: None
Level:         Error
Keywords:      Classic
User:          N/A
Computer:      <ComputerName>
Description:
Volume Shadow Copy Service error: Unexpected error calling routine ConvertStringSidToSid.  hr = 0x80070539.

Operation:
   OnIdentify event
   Gathering Writer Data
Context:
   Execution Context: Shadow Copy Optimization Writer
   Writer Class Id: {4dc3bdd4-ab48-4d07-adb0-3bee2926fd7f}
   Writer Name: Shadow Copy Optimization Writer
   Writer Instance ID: {3586f039-f2f9-4dcb-a46e-3aaa20f1a2fa}
 
This error can be solved by following the instructions in this blog post. Specifically, perform these steps outline in KB947242:
  • Delete unresolvable SIDs in the 'Administrators' group on the VM.
  • Open regedit and locate 'HKEY_LOCAL_MACHINE\SOFTWARE\Microsoft\Windows NT\CurrentVersion\ProfileList'
  • Under the ProfileList subkey, delete any subkey that is named SID.bak

This has resolved the issue in most cases where I have seen that DPM error occur. Some other suggested troubleshooting tips that have solved this problem for me in the past:

  • Re-install the Integration Components and reboot the VM
  • Resolve issues for any VSS writers not listed as stable from the "vssadmin list writers" command on the host or inside the VM. You can restart the following services to resolve some problems
    • System Writer - Cryptographic Services service (doesn't affect the system)
    • IIS Metabse Writer - IIS Administrative service (will reset all of IIS)
    • SqlServerWriter - SQL VSS service (doesn't affect SQL)
    • WMI Writer - Windows Management Instrumentation service (WMI will be unavailable during the
      service restart)
    • BITS Writer - BITS service (BITS will be unavailable during the service restart)
  • Re-register VSS components as described in KB940032
  • Ensure there is sufficient space inside the VM for shadow copies
     
Thursday, April 08 2010   |   Jeff Graves   |   Comments

Using RPC custom port range with Windows Firewall

I ran into an interesting issue today. We use a dedicated port range for RPC connections through firewall per this Microsoft article. Doing so allows RPC to work through dedicated hardware firewalls. We also enable the local Windows firewall on several boxes as this provides a firewall for any systems not using a dedicated piece of hardware or from other systems behind dedicated firewalls.

While using Shavlik NetChk Configure to scan systems for compliance, I noticed some inconsistencies which I traced back to a firewall issue on the server being scanned. The scans perform some of the checks over RPC. I confirmed that Remote Administration had been enabled using this command:

netsh firewall set service REMOTEADMIN enable

However,  netstat would show the connection in a SYN_SENT state on a port in the dedicated RPC range. Buried in this technet article, I found the reason:

Remote Administration

Adds TCP ports 135 and 445 to the exceptions list. Also adds Svchost.exe and Lsass.exe to the exceptions list to allow hosted services to open additional, dynamically-assigned ports, typically in the range of 1024 to 1034. This setting allows a computer to be remotely managed with administrative tools, such as the Microsoft Management Console (MMC) and Windows Management Instrumentation (WMI). It also allows a computer to receive unsolicited incoming Distributed Component Object Model (DCOM) and remote procedure call (RPC) traffic.

It seems that when setting a custom range of ports for RPC via the HKLM\Software\Microsoft\RPC\Internet key, it "breaks" the Remote Administration firewall rule in the Windows Firewall. This was tested on a Server 2003 R2 SP2 system, but I suspect similar issues would apply to Server 2008.

Wednesday, January 06 2010   |   Jeff Graves   |   Comments

Correcting SFC (System File Checker) errors

We recently began using Microsoft's built-in SFC (System File Checker) as part of our FIM (File Integrity Monitoring) solution for PCI (Payment Card Industry) compliance. This great feature will compute hashes of core system files and compare those against originals looking for differences. If any are found, it can automatically replace those files with the original. The best part is that it incorporates all system updates into these checks so you can rest easy knowing that the checks are being performed against the latest, patched system files.

In most cases, this runs without intervention, but every now and then it needs a little help correcting any problems it encounters. If running a scan (via sfc /scannow) indicates there were unfixable errors found (eg. "Windows Resource Protection found corrupt files but was unable to fix some of them."), you can use the log file under C:\Windows\Logs\CBS\CBS.log to determine which file(s) are having problems being fixed. Microsoft's KB928228 article has great instructions on how to analyze this file. The basic gist is to run the following command and view the details only:

findstr /c:"[SR]" %windir%\logs\cbs\cbs.log >sfcdetails.txt

Search the resulting file for the phrase "cannot repair" - this should give you the file(s) that SFC is having problems replacing. To fix this, replace these file(s) manually with trusted versions (either from source media or from other working systems with same edition, bitness, and patch-level). It is probably best to review the text in the CBS.log file surrounding that entry to be certain you are replacing with the appropriate versions.

In very rare cases, you may not find the phrase "cannot repair" in the log file. In fact, you will find an entry to the contrary: "Verify and Repair Transaction completed. All files and registry keys listed in this transaction have been successfully repaired" at the end of the log file, but the SFC program will still still report that it found unfixable files. In these cases, I have found that renaming the file(s) specified in the logs and re-running SFC will correct the issue. You may need to take ownership, change permissions, or boot into safe mode to rename the suspect file(s) depending upon the system file in question.

Tuesday, December 29 2009   |   Jeff Graves   |   Comments

Configure source ip for Nessus daemon on Windows

Nessus from Tenable Network Security is an invaluable tool for vulnerability scanning. As a windows-only shop, we were very pleased that Nessus would run on a Windows platform. For our configuration, we have a server sitting outside of our firewall with multiple public IP addresses. We configured firewall policies for the system's primary IP address to allow it necessary access into our environment and from our management subnet to the device. That means we needed a different IP address to use for scanning so it can be subject to the standard rules that apply to all external traffic.

In *nix environments, the Nessus daemon has a command line switch that forces the scanner to use a specific source IP for scans (this is different than the "listen address" which is used by remote clients to connect to the scanner - that setting can be configured in nessusd.conf). Unfortunately, the nessus-service.exe called by the Windows Service does not pass command line parameters to the nessusd process.

Not to worry, our old friend srvany comes to the rescue (note that srvany only works on Windows 2000/2003/XP). Perform the following steps:

  1. Stop the Nessus service
  2. Copy the srvany.exe executable to C:\Program Files\Tenable\Nessus
  3. Modify the ImageName value under HKLM\SYSTEM\CurrentControlSet\Services\Tenable Nessus to C:\Program Files\Tenable\Nessus\srvany.exe
  4. Add a Parameters key under HKLM\SYSTEM\CurrentControlSet\Services\Tenable Nessus
  5. Add a REG_SZ value named Application with the following value (replace <ip_address> with the IP you want the scanner to use for scans):
    C:\Program Files\Tenable\Nessus\nessusd.exe -S <ip_address>
  6. Start the Nessus service.

Happy scanning!

Tuesday, November 17 2009   |   Jeff Graves   |   Comments

Sound Blaster Wireless Music Device

It's been quite some time since this device was released.

Sound Blaster Wireless Music

I picked one up back in 2003 and it has been a little work-horse ever since. I love the fact it comes with a remote with a screen that you can use to flip through albums and songs. Unfortunately, it was discontinued before it's time. Information is scarce out there about it, but here's a couple of tricks in case you still have yours or pick up a used one on ebay.

First, upgrade to the latest fimware. This is an unreleased version that was posted by a Creative tech in the forums.

Second, you can run the Music Server as a service using srvany. Call the service "SBWMSvr" when installing it. You will need to add an additional string value under HKLM\SYSTEM\CurrentControlSet\Services\SBWMSvr\Parameters called AppDirectory pointing to "C:\Program Files\Creative\Shared Files" and set the service to run as an adminsitrative account in order for it to work. This eliminates the need for you to login to the system runing the application (ie. if you have a home server). Be sure to remove the link from your startup menu.

Thursday, October 01 2009   |   Jeff Graves   |   Comments

DPM Daily Maintenance Script

We recently completed a project to move over 300 servers from our old backup infrastructure to a brand new disk-based DPM 2007 solution. We have been very pleased with DPM 2007 thus far, but are finding that it required a fair amount of hand holding in the mornings to kick off failed jobs, increase disk allocations, and perform consistency checks. Unfortunately, the DPM console can only be loaded on the DPM server itself, and it cannot connect to a remote DPM server. That means logging in via RDP to each DPM server and addressing the alerts. After a few weeks of doing this by hand, we added them to our SCOM 2007 server which helped consolidate the alerts to a single interface, but we found we could not modify disk allocations via SCOM.

So I sat down and hashed out DPM Daily Maintenance Script. This powershell script will query the database for alerts and addresses the four most common. Replica disk and Recovery Point Volume threshold exceeded, Replica is inconsistent, and Recovery Point creation failed. The script takes 4 optional parameters:

replicaIncreaseRatio - Percentage of existing replica disk size to increase (ie. 1.1 increases by 10%. This is the default if nothing is specified)
scIncreaseRatio -  Percentage of existing recovery point volume size to increase (ie. 1.1 increases by 10%. This is the default if nothing is specified)
replicaIncreaseSize - Fixed value to increase replica disk (ie. 1GB)
scIncreaseSize - Fixed value to increase recovery point volume (ie. 1GB)

The script will first query the database for alerts, and then sorts them alphabetically and by alert type. This means that if a replica became inconsistent because the replica disk threshold was exceeded or if a recovery point creation failed because the recovery point volume threshold was exceeded, the script will increase the size of the volume before re-running the job. Also, for replica disks, the script will actually query the original datasource and resize the replica disk to the current workload's size plus the ratio or fixed amount specified in the script. This ensures that the replica disk is extended to the proper amount during the first pass in cases where a large amount of data is added to the workload.

We have been running this scripts on 6 DPM servers for about 6 weeks now and I have to say they have virtually eliminated the daily maintenance (I was on vacation for 2 weeks during that time and DPM happily hummed along without any intervention, self-healing twice per day). We still use SCOM to monitor the alerts and are manually checking for replicas that are constantly becoming inconsistent or recovery point creations that are consistently failing and addressing those by hand. We have setup a scheduled task that runs twice per day using the following command line:

C:\Windows\system32\windowspowershell\v1.0\powershell.exe -PSConsoleFile "C:\Program Files\Microsoft DPM\DPM\bin\dpmshell.psc1" -command ".'C:\admin\DailyMaintenance.ps1'" >> C:\admin\DailyMaintenance.log

DailyMaintenance.zip

There are a few 3rd party products that can help with these same alerts, and Microsoft is working on making our lives easier with DPM v3, but in the meantime, this should take some of the burden off of the sys admins.

Thursday, September 17 2009   |   Jeff Graves   |   Comments

Partition Alignment

Partition Alignment

Squeezing every ounce of performance out of your disk array is critical in IO intensive applications. Most times, this is simply an after-thought. However, doing a little leg-work during the implementation phase can go a long way to increasing the performance of your application. Aligning partitions is a great idea for SQL and virtualized environments - these are the places you will see the most benefit.

The concept of aligning partitions is actually quite simple and applies to SAN's and really any disk array alike. If you are using RAID in any capacity, then aligning disk partitions will help increase performance. It is best illustrated by the following graphics, borrowed from http://www.vmware.com/pdf/esx3_partition_align.pdf (This is a great read, but specific to VMWare environments, however, the same concepts apply).

Using unaligned partitions in a virtual environment, you can see that a read could ultimately result in 3 disks accesses to the underlying disk subsystem:

 

By aligning partitions properly, that same read results in just 1 disk access:

While these graphics are Virtual Machine and VMWare specific, the same is true for Hyper-V and SQL (except remove the middle layer for SQL). In order for partition alignment to work properly, you need to ensure that the lowest level of the disk sub-system has the highest segment size (also referred to as stripe size). Depending upon your RAID controller or SAN, this could default to as low as 4K or as high as 1024K. I won't cover what differences in segment sizes mean for performance, that's an entirely difference discussion, but generally speaking defaults are usually 64K or 128K. The basic idea behind a proper stripe size is that you want to size it so that most of your reads/writes can happen in 1 operation.

From there, you need to ensure that your block or file allocation unit size is set properly - ideally smaller or the same size as the segment size and that it is a multiple of the segment size. Lastly, you should then set the offset to the same as the segment size. By default, Windows 2003 will offset by 31.5K, Windows 2008 by 1024K, and VMWare VMFS default's to 128.

Setting the segment size may or may not be an online operation - that depends entirely on your RAID controller or SAN as to whether this can be done to an already configured array or if it has to be done during the initial configuration. Changing the offset and/or block size of a partition however is NOT an online operation. This means that all data will have to be removed from the partition, the offset configured, and the partition recreated. Prior to Windows 2008, this cannot be done to system partitions so for Windows 2003, you would have to attach the virtual hard disk to another system, set the offset and format the partition, and then perform the windows installation.

The following links provide detailed information about aligning partitions in both VMWare and Windows. Consult your SAN or RAID controller documentation for setting or finding out the segment size.

Recommendations for Aligning VMFS Partitions

Disk Partition Alignment Best Practices for SQL Server

Monday, September 14 2009   |   Tagged as: , , , , ,   |   Jeff Graves   |   Comments

Subscribe

RSS FeedYou can subscribe to Jeff Graves feed via RSS to receive updates when new entries are posted.

@OrcsWeb

  • Thanks for the kudo shout-out! RT @leebrandt: I gotta say, @orcsweb has been really responsive in setting up my hosting stuff.
  • RT @julielerman: darling devs using EF: unless you're intimately aware of how EF works, please don't use long-running ObjectContexts in ...
  • RT @scottgu: Very cool VS extension that runs pages using different browsers: http://bit.ly/bvnj5b