We upgraded the firmware on some network devices during last month's maintenance window. Before that, they had some impressive uptime:
The devices are configured with HA redundancy, so the rolling firmware upgrades went beautifully with minimal downtime during the route convergence and no manual intervention.
Recently, we were alerted to higher than normal CPU on some of our core Cisco Catalyst 4507 switches running IOS 12.2. Using Cisco's CPU troubleshooting doc, I was able to narrow down the source to the Cat4K Mgmt LoPri process. From there, issuing a "sh platform health" command found it was the K2CpuMan Review process meaning packets are being forward by the CPU. To find out which queue, we issued the "sh platform cpu packet statistics" command. That showed the L3 Fwd Queue was much higher than normal.
By creating a CPU span and monitoring the traffic with Network Monitor 3.3, we could see that all of the traffic destined for VIP's in our 2008 NLB clusters was hitting the CPU. I checked the configuration to ensure it matched the Catalyst and MSFT NLB example on Cisco's site which it did. We were using multicast NLB configuration as explained in the document. I setup a test NLB cluster to play with the settings to figure out why cluster bound packets were hitting the CPU. What I found was in relation to this section:
However, since the incoming packets have a unicast destination IP address and multicast destination MAC the Cisco device ignores this entry and process-switches each cluster-bound packets. In order to avoid this process switching, insert a static mac-address-table entry as given below in order to switch cluster-bound packets in hardware.
mac-address-table static 0300.5e11.1111 vlan 200 interface fa2/3 fa2/4
Note: For Cisco Catalyst 6000/6500 series switches, you must add the disable-snopping parameter. For example:
mac-address-table static 0300.5e11.1111 vlan 200 interface fa2/3 fa2/4 disable-snooping
The disable-snooping parameter is essential and applicable only for Cisco Catalyst 6000/6500 series switches. Without this statement, the behavior is not affected.
I double and triple checked that our switches had the satic mac entry for the CAM tables and they did. So, I reconfigured my test cluster from the ground up and found that cluster bound packets only hit the cpu AFTER this command was entered. By removing this command from my switches for our production, CPU dropped 30-40% instantly. This seems to contradict what Cisco has posted in their example.
There was no adverse affect or downtime from removing this command. Both cluster nodes are connected locally to the switch however, and this command may be necessary if a NLB node is connected to a down-level swtich. Furthermore, a "sh int stats" is showing that no packets are switched by the "processor."
SCVMM is an excellent product for managing your Hyper-V environment. The Self Service Portal (SSP), a component of the SCVMM install, allows end users to manage and deploy VM's remotely. However, but sure to read the fine print when installing.
During the installation process, you will be prompted to select what ports the VMM Agent will use when communicating with the Hyper-V host. The default ports WinRM and BITS use are 80 and 443 respectively. If you plan on running the Self Service Portal from the same host system, you will either need to change the ports the VMM Agent uses or change the ports the Self Service Portal uses.
Since browsers and IIS always default to 80 and 443 for HTTP and HTTPS, I would recommend making the change to the VMM Agent. Port 8080 for the VMM Agent control port (WinRM) and 8443 for the VMM Agent data port (BITS) are nice alternatives. Note that using a different IP for the SSP is NOT an option, as WinRM and BITS will self-configure to listen on all IP addresses thereby hijacking the ports.
A quick note, changing the default ports is recommended if you are planning on running ANY website on the same box. For instance, we had initially installed SCVMM on the same box running Operations Manager. It wasn't until our VM migrations began failing that we realized the default installation of SCVMM was being interfered with the Operations Manager Console which was also running on the same system.
Lastly, you may not even receive an error of any type when having this issue - rather, the SSP simply won't install. You may see behavior similar to this:
One error you may receive while backing up a Hyper-V VM with DPM 2007 is the generic "DPM encountered a retryable VSS error. (ID 30112 Details: Unknown error (0x800423f3) (0x800423F3))." There are a couple of different things that could cause this error. The two most common are:
1. You are running a Windows Server 2008 SP1 Hyper-V host and do not have the appropriate pre-requisites installed. Specifically, the hotfix described in KB959962.
http://technet.microsoft.com/en-us/library/dd347840.aspx
2. There is a VSS error of some kind inside the VM causing the Hyper-V VSS writer to fail.
One of the most common VSS errors inside a Server 2008 VM I have seen, is event id 8193:
Log Name: Application
Source: VSS
Date: <DateTime>
Event ID: 8193
Task Category: None
Level: Error
Keywords: Classic
User: N/A
Computer: <ComputerName>
Description:
Volume Shadow Copy Service error: Unexpected error calling routine ConvertStringSidToSid. hr = 0x80070539.
This has resolved the issue in most cases where I have seen that DPM error occur. Some other suggested troubleshooting tips that have solved this problem for me in the past:
I ran into an interesting issue today. We use a dedicated port range for RPC connections through firewall per this Microsoft article. Doing so allows RPC to work through dedicated hardware firewalls. We also enable the local Windows firewall on several boxes as this provides a firewall for any systems not using a dedicated piece of hardware or from other systems behind dedicated firewalls.
While using Shavlik NetChk Configure to scan systems for compliance, I noticed some inconsistencies which I traced back to a firewall issue on the server being scanned. The scans perform some of the checks over RPC. I confirmed that Remote Administration had been enabled using this command:
netsh firewall set service REMOTEADMIN enable
However, netstat would show the connection in a SYN_SENT state on a port in the dedicated RPC range. Buried in this technet article, I found the reason:
|
Remote Administration |
Adds TCP ports 135 and 445 to the exceptions list. Also adds Svchost.exe and Lsass.exe to the exceptions list to allow hosted services to open additional, dynamically-assigned ports, typically in the range of 1024 to 1034. This setting allows a computer to be remotely managed with administrative tools, such as the Microsoft Management Console (MMC) and Windows Management Instrumentation (WMI). It also allows a computer to receive unsolicited incoming Distributed Component Object Model (DCOM) and remote procedure call (RPC) traffic. |
It seems that when setting a custom range of ports for RPC via the HKLM\Software\Microsoft\RPC\Internet key, it "breaks" the Remote Administration firewall rule in the Windows Firewall. This was tested on a Server 2003 R2 SP2 system, but I suspect similar issues would apply to Server 2008.
We recently began using Microsoft's built-in SFC (System File Checker) as part of our FIM (File Integrity Monitoring) solution for PCI (Payment Card Industry) compliance. This great feature will compute hashes of core system files and compare those against originals looking for differences. If any are found, it can automatically replace those files with the original. The best part is that it incorporates all system updates into these checks so you can rest easy knowing that the checks are being performed against the latest, patched system files.
In most cases, this runs without intervention, but every now and then it needs a little help correcting any problems it encounters. If running a scan (via sfc /scannow) indicates there were unfixable errors found (eg. "Windows Resource Protection found corrupt files but was unable to fix some of them."), you can use the log file under C:\Windows\Logs\CBS\CBS.log to determine which file(s) are having problems being fixed. Microsoft's KB928228 article has great instructions on how to analyze this file. The basic gist is to run the following command and view the details only:
findstr /c:"[SR]" %windir%\logs\cbs\cbs.log >sfcdetails.txt
Search the resulting file for the phrase "cannot repair" - this should give you the file(s) that SFC is having problems replacing. To fix this, replace these file(s) manually with trusted versions (either from source media or from other working systems with same edition, bitness, and patch-level). It is probably best to review the text in the CBS.log file surrounding that entry to be certain you are replacing with the appropriate versions.
In very rare cases, you may not find the phrase "cannot repair" in the log file. In fact, you will find an entry to the contrary: "Verify and Repair Transaction completed. All files and registry keys listed in this transaction have been successfully repaired" at the end of the log file, but the SFC program will still still report that it found unfixable files. In these cases, I have found that renaming the file(s) specified in the logs and re-running SFC will correct the issue. You may need to take ownership, change permissions, or boot into safe mode to rename the suspect file(s) depending upon the system file in question.
Nessus from Tenable Network Security is an invaluable tool for vulnerability scanning. As a windows-only shop, we were very pleased that Nessus would run on a Windows platform. For our configuration, we have a server sitting outside of our firewall with multiple public IP addresses. We configured firewall policies for the system's primary IP address to allow it necessary access into our environment and from our management subnet to the device. That means we needed a different IP address to use for scanning so it can be subject to the standard rules that apply to all external traffic.
In *nix environments, the Nessus daemon has a command line switch that forces the scanner to use a specific source IP for scans (this is different than the "listen address" which is used by remote clients to connect to the scanner - that setting can be configured in nessusd.conf). Unfortunately, the nessus-service.exe called by the Windows Service does not pass command line parameters to the nessusd process.
Not to worry, our old friend srvany comes to the rescue (note that srvany only works on Windows 2000/2003/XP). Perform the following steps:
Happy scanning!
It's been quite some time since this device was released.

I picked one up back in 2003 and it has been a little work-horse ever since. I love the fact it comes with a remote with a screen that you can use to flip through albums and songs. Unfortunately, it was discontinued before it's time. Information is scarce out there about it, but here's a couple of tricks in case you still have yours or pick up a used one on ebay.
First, upgrade to the latest fimware. This is an unreleased version that was posted by a Creative tech in the forums.
Second, you can run the Music Server as a service using srvany. Call the service "SBWMSvr" when installing it. You will need to add an additional string value under HKLM\SYSTEM\CurrentControlSet\Services\SBWMSvr\Parameters called AppDirectory pointing to "C:\Program Files\Creative\Shared Files" and set the service to run as an adminsitrative account in order for it to work. This eliminates the need for you to login to the system runing the application (ie. if you have a home server). Be sure to remove the link from your startup menu.
We recently completed a project to move over 300 servers from our old backup infrastructure to a brand new disk-based DPM 2007 solution. We have been very pleased with DPM 2007 thus far, but are finding that it required a fair amount of hand holding in the mornings to kick off failed jobs, increase disk allocations, and perform consistency checks. Unfortunately, the DPM console can only be loaded on the DPM server itself, and it cannot connect to a remote DPM server. That means logging in via RDP to each DPM server and addressing the alerts. After a few weeks of doing this by hand, we added them to our SCOM 2007 server which helped consolidate the alerts to a single interface, but we found we could not modify disk allocations via SCOM.
So I sat down and hashed out DPM Daily Maintenance Script. This powershell script will query the database for alerts and addresses the four most common. Replica disk and Recovery Point Volume threshold exceeded, Replica is inconsistent, and Recovery Point creation failed. The script takes 4 optional parameters:
replicaIncreaseRatio - Percentage of existing replica disk size to increase (ie. 1.1 increases by 10%. This is the default if nothing is specified)
scIncreaseRatio - Percentage of existing recovery point volume size to increase (ie. 1.1 increases by 10%. This is the default if nothing is specified)
replicaIncreaseSize - Fixed value to increase replica disk (ie. 1GB)
scIncreaseSize - Fixed value to increase recovery point volume (ie. 1GB)
The script will first query the database for alerts, and then sorts them alphabetically and by alert type. This means that if a replica became inconsistent because the replica disk threshold was exceeded or if a recovery point creation failed because the recovery point volume threshold was exceeded, the script will increase the size of the volume before re-running the job. Also, for replica disks, the script will actually query the original datasource and resize the replica disk to the current workload's size plus the ratio or fixed amount specified in the script. This ensures that the replica disk is extended to the proper amount during the first pass in cases where a large amount of data is added to the workload.
We have been running this scripts on 6 DPM servers for about 6 weeks now and I have to say they have virtually eliminated the daily maintenance (I was on vacation for 2 weeks during that time and DPM happily hummed along without any intervention, self-healing twice per day). We still use SCOM to monitor the alerts and are manually checking for replicas that are constantly becoming inconsistent or recovery point creations that are consistently failing and addressing those by hand. We have setup a scheduled task that runs twice per day using the following command line:
C:\Windows\system32\windowspowershell\v1.0\powershell.exe -PSConsoleFile "C:\Program Files\Microsoft DPM\DPM\bin\dpmshell.psc1" -command ".'C:\admin\DailyMaintenance.ps1'" >> C:\admin\DailyMaintenance.log
There are a few 3rd party products that can help with these same alerts, and Microsoft is working on making our lives easier with DPM v3, but in the meantime, this should take some of the burden off of the sys admins.
Squeezing every ounce of performance out of your disk array is critical in IO intensive applications. Most times, this is simply an after-thought. However, doing a little leg-work during the implementation phase can go a long way to increasing the performance of your application. Aligning partitions is a great idea for SQL and virtualized environments - these are the places you will see the most benefit.
The concept of aligning partitions is actually quite simple and applies to SAN's and really any disk array alike. If you are using RAID in any capacity, then aligning disk partitions will help increase performance. It is best illustrated by the following graphics, borrowed from http://www.vmware.com/pdf/esx3_partition_align.pdf (This is a great read, but specific to VMWare environments, however, the same concepts apply).
Using unaligned partitions in a virtual environment, you can see that a read could ultimately result in 3 disks accesses to the underlying disk subsystem:
By aligning partitions properly, that same read results in just 1 disk access:
While these graphics are Virtual Machine and VMWare specific, the same is true for Hyper-V and SQL (except remove the middle layer for SQL). In order for partition alignment to work properly, you need to ensure that the lowest level of the disk sub-system has the highest segment size (also referred to as stripe size). Depending upon your RAID controller or SAN, this could default to as low as 4K or as high as 1024K. I won't cover what differences in segment sizes mean for performance, that's an entirely difference discussion, but generally speaking defaults are usually 64K or 128K. The basic idea behind a proper stripe size is that you want to size it so that most of your reads/writes can happen in 1 operation.
From there, you need to ensure that your block or file allocation unit size is set properly - ideally smaller or the same size as the segment size and that it is a multiple of the segment size. Lastly, you should then set the offset to the same as the segment size. By default, Windows 2003 will offset by 31.5K, Windows 2008 by 1024K, and VMWare VMFS default's to 128.
Setting the segment size may or may not be an online operation - that depends entirely on your RAID controller or SAN as to whether this can be done to an already configured array or if it has to be done during the initial configuration. Changing the offset and/or block size of a partition however is NOT an online operation. This means that all data will have to be removed from the partition, the offset configured, and the partition recreated. Prior to Windows 2008, this cannot be done to system partitions so for Windows 2003, you would have to attach the virtual hard disk to another system, set the offset and format the partition, and then perform the windows installation.
The following links provide detailed information about aligning partitions in both VMWare and Windows. Consult your SAN or RAID controller documentation for setting or finding out the segment size.
You can subscribe to Jeff Graves feed via RSS to receive updates when new entries are posted.