Many
have been asking for what i need to check to make sure SCCM is healthy.
According to me if I were you these are things I would have checked
This
might not be complete but had tried to include whatever I could think of.
Administrative
Activities:
Daily
Administrative Task:
Daily
Site Monitoring Tasks:
To
best maintain your system, perform the following monitoring tasks on a daily
basis. If there is any indication of a problem, isolate and repair the problem
to ensure that the site remains healthy.
Daily
site monitoring tasks include:
- Checking
SCCM site database status.
- Checking
site server status.
- Checking
site systems status.
- Checking
client status.
- Checking the
operating system event log.
- Checking the
SQL Server error log.
- Checking
system performance.
- Checking
SCCM system folders.
Check SCCM Site Database
Status
Use
the SQL Server DBCC command to check the health of the SCCM site database. Use
any other tools available to test the health of the SCCM site database.
Check Site Server Status
View
site status summary information in the SCCM Administrator console, or create
reports that summarize the server activity and status (such as the Clients that
Received a Specific Advertised Program report). If necessary, check status
messages of individual components. For further details, in case status messages
indicate that a problem exists, view the relevant log files. Isolate and fix
conditions that generate errors or warnings. If appropriate, reconfigure the
status system so that only relevant and helpful messages are recorded.
Check the status of items such as:
- Site
components and services. Check if any site server component or service is
experiencing any problems.
- Packages
and advertisements. Check the status of packages and advertisements in
your site. Check package and advertisement status messages to ensure that
package source files reach distribution points, and that advertised
programs reach clients. Check status messages that are returned from
clients to see whether the clients run programs successfully or not.
- Site-to-Site
Communication. Check communication between the site and its parent and
child sites (if they exist). Check status messages and, if necessary,
check log files of the Replication Manager, Scheduler, and Senders on the
site to determine whether the site is having communication problems.
Check Site Systems
Status
Check
the state of site systems throughout the SCCM hierarchy. Use the status system
and, if necessary, use log files to determine if site systems are having
problems, such as:
- Low
level of available disk space.
- SCCM
components that cannot connect with a site system.
Check Clients Status
Check
the state of clients in the hierarchy. Run queries on status messages to detect
any problems that clients might be having, such as:
- Client
components are experiencing problems.
- Clients
are failing to install.
- Clients
are not reporting software inventory or hardware inventory.
- Clients
that are not reporting heartbeat discovery data regularly (or for the past
x days).
- Client
count unexpectedly increasing or decreasing at a fast rate.
You
can monitor a client's status only if it creates status messages, and these
status messages reach the site server.To detect clients from which you are
missing status messages, you need to run a query that returns all clients that
have not reported a status message within the last <n> days. In this
query, <n> is the length of time you would expect to receive a status message
from that client (taking into account the frequency of hardware or software
inventory and the regular time it takes for status messages to reach the site
server.)
Check
the Operating System Event Log
On
key servers, check the application, system, and security system event logs. You
can access those through the Event Viewer administrative tool. Look for
messages that indicate error conditions or developing problems. Isolate and
repair the conditions that generate error or warning messages.
When
installing an SCCM site server, its default configuration is to write status
messages to the event log. This helps you identify any developing problems with
SCCM.
Note
:When SCCM is configured to write status messages to Windows event logs, SCCM
error status messages are written as Information events, not Error events.
Save
instances of the most recent event log files for future comparison. When you
can compare current log files with previous log files, it is easier to detect
problems that are developing. After saving the log files, you can clear them
from the event log so it is easier to detect new problems.
Check the SQL Server Error Log
Check the SQL Server error log in SQL Enterprise Manager. Look
for messages that indicate error conditions. Isolate and repair the conditions
that generate error or warning messages.
Check System Performance
To
check whether the site server and component servers have sufficient resources
and that SCCM site services are running optimally, you must monitor site server
and component server performance. Use performance-monitoring tools such as
System Monitor in the Performance console. Check the status of critical
components on the site server, on the computer running SQL Server, and other
SCCM site systems.
SCCM
installs many performance monitor counters, but you can add, remove and
configure counters as needed. You can also use the SQL Server performance
counters.
Save
performance log files for future comparison. It is easier to detect performance
trends, and to identify potential bottlenecks, when comparing current
performance log file to previous performance log files.
Check Status Filter Rules
Check whether it is possible to reduce the amount of traffic
generated by status messages being replicated throughout the hierarchy. If the
site is currently healthy, it might be possible to add status filter rules to
prevent replication of status messages, which are not necessary.
Check
the Inboxes to Monitor
Listed here is a list of the ConfigMgr inboxes that should be checked on a regular
basis to ensure that your site(s) function as expected.
Auth\Dataldr.Box : A backlog of files can indicate problems accessing the site
database.
Auth\Dataldr.Box\Process :A backlog of files can indicate problems accessing
the site database.
Auth\Ddm.box\Bad_DDRs
:A backlog of files can indicate a network corruption problem or a problem with
the DDM
Auth\Sinv.Box :A backlog of files can indicate that the Software Inventory
Processor cannot connect to the site database or that too many files were
received.
Auth\Sinv.Box\Orphans :A backlog of files can indicate problems with specific
clients, with management points, or with the network that could cause data
corruption.
Compsumm.Box : A backlog of files can indicate that the Component Status
Summarizer cannot process the volume of messages.
Dataldr.Box :A backlog of files can indicate problems accessing the Systems
Management Server (SCCM) database
Dataldr.Box\Badmifs :A backlog of files can indicate a bad custom MIF file or that
a client computer cannot transfer the file correctly.
Ddm.Box :A backlog of files can indicate a bad DDR is preventing other DDR’s to
process.
Ddm.Box\Bad_DDRs :A backlog of files can indicate a network corruption problem
or a problem with the DDM
OfferSum.Box : A backlog of files can indicate a performance problem that is
caused by a large number of messages.
Policypv.Box :A backlog of files in the policypv.box folder indicates that the
policy provider component is not running.
Replmgr.Box\Ready:A backlog of files can indicate that the Scheduler is
backlogged or is already processing files of the same priority
Schedule.Box:A backlog of files can indicate that the Sender cannot connect to
or cannot transfer data to another site.
Schedule.Box\Outboxes :A backlog of .srq files indicates that the sender cannot
process the number of jobs scheduled for that sender or that the sender cannot
connect to or transfer data to another site.
Schedule.Box\Tosend :A backlog of files can indicate that many send
requests are not completed or that the Scheduler has not yet deleted the files.
Sinv.Box :A backlog of files can indicate that the Software Inventory Processor
cannot connect to the site database or that too many files were received.
Sinv.Box\BadSinv :A backlog of files can indicate problems with specific
clients, with management points, or with the network, causing data corruption.
SiteStat.Box :A backlog of files can indicate a performance
problem. Examine status messages for the Site System Status Summarizer for
possible problems.
Statmgr.Box\Futureq : A backlog of files can indicate that some site systems'
clocks are not synchronized with the site server.
Statmgr.Box\Queue : A backlog of files can indicate a problem with the Status
Manager or that the component is trying to process too many messages.
Statmgr.Box\Retry : A backlog of files
can indicate problems with the connection to the computer that is running SQL
Server.
Statmgr.Box\Statmsgs :A backlog of files can indicate a problem with the Status
Manager or that the Status Manager is trying to process too many messages
Swmproc.Box :A backlog of .sum and .sur files can indicate that the Software
Metering Processor component cannot connect to the SCCM database.
Check
Daily Maintenance Task
Check
and make sure that the daily Maintenance Task if any. We can use the
smsdbmon.log for more details.
Weekly
Administrative Task
Weekly
Site Monitoring Tasks
To
best maintain your system, perform the following monitoring tasks on a weekly
basis. If there is any indication of a problem, isolate and repair the problem,
to ensure that the site remains healthy.
Weekly
site monitoring tasks include:
- Checking
SCCM site database available space.
- Checking
available disk space.
Check SCCM Site Database
Available Space
To find the amount of space used by database devices, run the
SQL Server stored procedure sp_spaceused against the SCCM site database. For
more details about space used, see the SQL Server Help. Check the temp-db
device at peak usage, when several instances of the SCCM Administrator console
are using the database and the site is actively processing objects.
Check Available
Disk Space
Check
the amount of available disk space on the site server, the SCCM site database
server, and other SCCM servers. Ensure that the amount of free disk space is
sufficient for SCCM and SQL Server to perform properly during regular and
increased activity load.
To
use the Status System to view information about site system disk space
- In
the SCCM Administrator console, navigate to Site System Status.
Configuration
Manager > Site Database > System Status> Site Status> <site
name>> Site System Status
- In
the details pane, view status information of site systems such as free
disk space.
Weekly
Site Maintenance Tasks
To
best maintain your system, perform the maintenance tasks in this section on a
weekly basis. You can automate some tasks by scheduling predefined maintenance
tasks or custom maintenance tasks, as appropriate, to run on a weekly basis.
Weekly site maintenance tasks
are:
- Weekly
automated tasks.
- Delete
unnecessary files.
- Delete
unnecessary SCCM objects.
- Produce and
distribute end-user reports.
- Run disk
defragmentation tools.
- Back up
application, security, and system event logs.
Weekly Automated Tasks
The
following predefined maintenance tasks should be scheduled to run on a weekly
basis. For more information about these tasks, see the "Predefined Site
Maintenance Tasks" section earlier in this chapter.
- Rebuilding
Indexes
- Monitor
keys
- Delete
aged inventory history
- Delete
aged discovery data
- Delete
aged collected files
- Delete
aged software metering data
- Delete
aged software metering summary data
- Summarize
software metering data
- Summarize
software metering periodic usage data
Delete Unnecessary Files
If
Management Information Format (MIF) files or IDMIFs are used to extend hardware
inventory in your site, then any MIF files that are not valid are placed in the
SCCM\inboxes\dataldr.box\BADMIFS folder and SCCM never removes them. You must
empty this folder manually. If a large number of MIFs are placed in the BADMIFS
folder, it is likely that a MIF generating tool is producing the MIFs with an
incorrect format. Investigate and repair the cause of the bad MIFs.
Delete Unnecessary
SCCM Objects
Delete
objects such as collections, queries, and packages that are no longer needed at
the site. Deleting unnecessary objects saves disk space, reduces intersite
replications, and increases performance.
Caution:
When deleting a collection, any advertisements to that collection are also
deleted.
Run Disk
Defragmentation Tools
Over
time, disk volumes on SCCM site systems become fragmented. Site operations such
as distributing large software packages might significantly increase
fragmentation on site servers and distribution points. As fragmentation
increases, disk operations take longer, thus, the overall site performance
decreases.
Run
disk defragmentation tools on the SCCM site server and all other site systems
to maintain the performance level of disk operations.
Back Up Application,
Security, and System Event Logs
Windows
event logs can get full, and by default, new items will start to overwrite
older items. To diagnose problems, and for other reasons, it might be necessary
to refer to an older event log. It is recommended that you back up Windows
event logs, and store the backups in a safe and accessible location. If
necessary, increase default logs file size to accommodate larger amounts of
data.
Periodic
Administrative Task
Periodic
Site Maintenance Tasks
To best maintain your system,
perform the following tasks periodically. Use the predefined maintenance tasks
when appropriate.
Periodic site maintenance tasks
include:
- Backing up
account data.
- Changing
accounts and passwords.
- Checking
network performance.
- Reviewing
the security plan.
- Reviewing
the maintenance plan.
- Performing
recovery tests.
Back Up Account Data
To
properly recover a site server, you must have information about the accounts
that SCCM used before the site failed. Account data is stored in domain controllers.
Use
Microsoft tools, such as the NTBackup.exe tool that comes with Windows Server,
or third-party tools to back up account data as follows:
- If
there are multiple domain controllers in your infrastructure that contain
the SCCM account database, you need to periodically back up the account
database. (If Active Directory® directory service is implemented in your
organization, then such a task might be included in the Active Directory
maintenance plan.)
- If
the account database is stored on a single domain controller, then back up
the account database frequently. Depending on the frequency of changes to
account data, you might need to add this task to the site's daily or
weekly maintenance tasks.
- If
the account data is stored on member servers, then regularly back up the
whole operating system that contains the account data, using software that
backs up account lists and the account database.
- Whenever
there is a change to the password of the Client Push Installation account
or to the site system connection accounts, you should note that change.
For security reasons, SCCM encrypts the Client Push Installation
account and the site system connection accounts. You need to be able to
retrieve these accounts' passwords so that you can re-enter them during a
site recovery operation.
- In
between account database backups, document any changes to accounts. Write
down and save any changes made to SCCM accounts and share rights so that
you can apply those changes again after recovering the site.
Change Accounts and
Passwords
To
maintain the level of security in your hierarchy, you must periodically change
the passwords and the accounts that SCCM sites use. Report any changes to the
security staff so that security administrators know that these changes are
planned and authorized.
To
develop an effective security maintenance plan for your SCCM hierarchy, you
must thoroughly understand how security is deployed in your hierarchy and make
the following decisions:
- Which
accounts need to be changed, and for which accounts is it sufficient to
change only the password.
- How
often to change passwords and accounts.
- How
to change passwords and accounts (such as by running SCCM site reset).
- Which
accounts cannot be configured by the administrator (either the account
name cannot be changed, or the password cannot be manually modified).
Check Network
Performance
Check
the available bandwidth and error rates on the networks used by the SCCM
hierarchy. Use Network Monitor to capture and analyze network frames so you can
diagnose network problems and look for optimization opportunities.
Review the
Security Plan
SCCM
evolves with time. User roles change, and people might no longer need access to
some or any of the SCCM functions. Although most changes in access permission
should be implemented after role or staff changes, you should also periodically
review the access for all users or groups to identify and delete unauthorized
access permissions.
The
security plan implemented for the SCCM hierarchy in your organization needs to support
the risk assessment of your organization. As your organization changes,
policies can become ineffective.
Review security-related settings
such as:
- Who has
access to SQL Server and to the SCCM site database.
- Who can
download from SCCM distribution points.
- Which
accounts have permissions within SCCM security.
- Periodically,
re-evaluate the risk assessment of your organization, and then review and
update the security plan accordingly.
Review the
Maintenance Plan
Use
the maintenance plan document to review the SCCM maintenance plan. SCCM evolves
with time, and it might be necessary to adjust the maintenance plan to
accommodate growth, development, and other changes in your organization.
If
there were any changes in your organization's security strategy, backup and
recovery strategy, or any other strategy that affects SCCM, then determine if
the maintenance plan needs to be adjusted to reflect these changes.
Review
maintenance tasks configuration. Check the amount of data in the site database
and evaluate the usefulness of that data against the amount of space that it
occupies in the database. If necessary, adjust the settings that determine the
number of days that data is retained in the database.
Update
the maintenance plan document to reflect any changes to the maintenance plan,
and then distribute it to all SCCM administrators that are using it.
Perform Recovery
Tests in a Test Lab
The best way to be
fully prepared for a site recovery operation is to ensure that the recovery
plan is adequate and that administrators are familiar with the recovery
process. After you develop a recovery plan for your site, it is recommended
that you perform periodic recovery tests in a test lab.
A
recovery test should follow the recovery plan developed for the production
environment. Plan to perform a recovery test of the central site, and of any
other systems deployed in your hierarchy. A recovery test should test all
phases of recovery, including:
- Backing
up a site.
- Archiving
the backup snapshot.
- Simulating
a site failure, such as by turning a server off.
- Recovering
the failed site.
- Verifying
the success of the recovery operation.
- You
might schedule periodic recovery tests. Company policy might require that
new administrators always perform a recovery test. It is strongly
recommended that you always include a recovery test when testing major
changes to the hierarchy.
- For
example, before upgrading site server operating systems, you should
probably first test the upgrade in the test lab. After completing the
upgrade in the test lab, you should perform a recovery test to identify
any issues or adjustments to the recovery plan associated with the
operating system upgrade. This ensures that if you upgrade the servers in
the production environment, you will still be able to successfully recover
a failed site.
- Include
a recovery test in every major deployment test, such as:
- A
major operating system upgrade (not service pack).
- A
major change to the networking infrastructure.
- New
equipment deployment or building relocation.
- An
SCCM major version site upgrade.
Periodic
Site Monitoring Tasks
To
best maintain your system, perform the following monitoring tasks periodically.
If there is any indication of a problem, isolate and repair the problem to
ensure that the site remains healthy.
Periodic
site monitoring tasks include:
- Checking
hardware.
- Checking
site's overall health.
- Checking
the backup snapshot.
Check Hardware
Even
high-quality hardware occasionally fails. Sometimes, it fails gradually, so
there might be early signs. Replacing hardware before it completely fails is a
key step in preventing site failure. Both Windows and SCCM provide performance
counters, which you can use to monitor the performance and state of the
hardware used in the site.
As
soon as you notice any signs of hardware-related unreliable behavior of an SCCM
server, replace the hardware. To properly replace server hardware, you must use
the Recovery Expert. For more information about swapping the computer of SCCM
servers, see the "Swapping the Computer of a Site Server" section
later in this chapter.
Check Site's
Overall Health
It
is recommended that you periodically perform a more thorough health check, as
follows:
- Ensure
that all SCCM services are running.
- Review
the Status Message System for Critical status.
- Ensure
that all the latest service packs are installed.
- Ensure
that the latest critical security patches are installed.
- Examine
the System and Application Event logs for errors.
- Note
- When
SCCM is configured to write status messages to the system's event log,
SCCM error status messages are written as information events, not error
events.
- Run
a query to determine if discovery data is being updated correctly in the
SCCM site database. The query should list all installed clients in which
System Resource - Agent Time is not within the heartbeat interval. It is
expected that some clients might be offline, but in other cases, it might
indicate a problem.
- Run
a query to determine if software inventory data is being updated correctly
in the SCCM site database. The query should list all installed clients in
which Last Software Scan - Last Inventory Collection is not within the
software inventory interval. It is expected that some clients might be
offline, but in other cases, it might indicate a problem.
- Run
a query to determine if hardware inventory data is being updated correctly
in the SCCM site database. The query should list all installed clients in
which Workstation Status - Last Hardware Scan is not within the hardware
inventory interval. It is expected that some clients might be offline, but
in other cases, it might indicate a problem.
If
any of these tests fail, you need to diagnose the problem and repair it.
Check the Backup
Snapshot
At
the end of every site backup cycle you should check the validity of the backup
snapshot. Periodically, you should perform a more thorough check to ensure that
the site's backup snapshots can be successfully used for recovery.
Restore
a recent backup snapshot to a disk and examine file continuity, file size, and
other file properties to ensure that they do not seem corrupted. Check critical
files by restoring these files to their respective applications to ensure that
the application can use the restored file.
cheers
Happy Learning.....!
Comments
Post a Comment