Batch schedule monitoring

Post Reply
Admin007

Batch schedule monitoring

Post by Admin007 » 10 Dec 2009 7:43

Does anyone have any suggestions on how our computer operators can do a better job of identifying long-running jobs in the Control-M/EM?

We have ~10,000 jobs that process daily and only about 100 are executing at any given time. However, many times our operators miss long-running jobs (jobs that would normally run for 5 minutes that are running for 5 hours) and this causes issues for the batch schedule.

I've advised 'drilling down' and checking the statistics and filtering only on executing jobs but they don't always adhere to this practice. I am reluctant to add alert shouts to jobs for execution time because you can end up w/ way too many shout messages and they would also be ignored.

Any input would be appreciated! Thanks!

User avatar
Dilbert
Nouveau
Nouveau
Posts: 185
Joined: 10 Jan 2007 12:00
Location: Zagreb, Croatia

Batch Impact Manager

Post by Dilbert » 11 Dec 2009 8:38

In your case you must purshace BMC CONTROL-M Batch Impact Manager, add-on for CONTROL-M solution, which provides exactly what you want. Defining your critical jobs, you can define their deadline and SLA, in case those jobs exceeding SLA, you will get notification. You can use web based GUI to tr<ck those jobs (services in BIM terminology) or you can use CONTROL-M GUI an Alerts. In that case, your stuff can effectively monitor those long-running jobs.

For more information, check BMC documentation.

Admin007

Batch schedule Monitoring

Post by Admin007 » 11 Dec 2009 2:55

Dilbert, we do have BIM (I was remiss to mention earlier). We have many legs of our schedule defined in the critical path. However, as I mentioned, due to the volume of jobs we have processing on a regular basis, we do not define ALL jobs as critical. Therefore, many jobs are not defined in BIM and those are the ones we have issues w/ our operators missing as 'long running'. Some of those jobs, although not defined as critical in the business sense can have reprecussions for other aspects of the batch schedule. So, any ideas for how to monitor those jobs outside BIM more effectively?

User avatar
gglau
Nouveau
Nouveau
Posts: 317
Joined: 13 Jun 2007 12:00

Post by gglau » 11 Dec 2009 4:39

What do you expect an operator to do? Drill down on an executing job and find out when did it start, then look at job stats and compare decide it has been running too long? Isn't it the equivalent of shout on ExecTime? How long is running too long to call for action? With ExecTime, you make the decision.

Shout on ExecTime allows a max of +900%, which is 10 times a job's average. Or max +999 minutes (=16.5 hrs) above average. At this high value, there should not be too many shouts. In fact you may want to use a smaller value to allow shouts to happen.

It is unrealistic to rely on an operator to sniff for long running jobs manually. Add a shout on ExecTime in every job and you can give operator a stick if they don't take action on the long running alerts.

User avatar
ejtoedtli
Nouveau
Nouveau
Posts: 51
Joined: 19 Nov 2008 12:00
Location: Portland, Or. - U.S.A.

Post by ejtoedtli » 11 Dec 2009 6:21

We only have a few jobs that we are concerned about running to long. On those I have an EXECTIME with shout coded.

User avatar
th_alejandro
Nouveau
Nouveau
Posts: 188
Joined: 26 Nov 2008 12:00
Location: Bogotá

Options

Post by th_alejandro » 22 Dec 2009 12:25

user EXECTIME or LATETIME to control the execution time for a job or the star execution late of the job. So, you can generate and ECS alert to the operator, but the job still running...

User avatar
nicolas_mulot
Nouveau
Nouveau
Posts: 149
Joined: 07 Jan 2010 12:00

Post by nicolas_mulot » 08 Jan 2010 7:59

Admin007 ,
All suggestions conduct to specifying shouts, and I don’t see any other way to dynamically “mark” those jobs which run for too long. I understand that if you send them messages to the alert window, they could be lost in the traffic.

I tested a little trick which works and might help you.

Instead of specifying ECS as the destination, route the message to a script, by defining a new shout destination, destination type “program”, let’s call it JOBTOOLONG.

Your seleted jobs should then include the following PostProc parameters:
SHOUT WHEN EXECTIME >100% DEST JOBTOOLONG MESSAGE %%ORDERID

The effect of the script is to mark the job definition (on the ajf), so those jobs which last for too long can be filtered. The mark is a modification performed by ctmpsm, as part of the script. The lightest modification I could imagine, which does not affect the production flow, is the addition of a standardized OUT condition (the same for all jobs):
For example:
ctmpsm –updateajf <the_OrderId> CONDADDOUT JOB_IS_TOO_LONG ODAT –

Apart from that, define a view point which will include:
1°) STATUS Executing
2°) OUT CONDITION JOB_IS_TOO_LONG ****

This viewpoint should always be empty; as soon as a job lasts for too long, it will pop up in that viewpoint.

The time limits, the definition of the viewpoint (whether you want to see only running jobs, of keep them when finished), the suggested mark are up to you.
Pls note that though the shout message consists of the sinple %%ORDERID, this value is received by the script as positional parameter nr 2.

Cheers
Nicolas Mulot

User avatar
gglau
Nouveau
Nouveau
Posts: 317
Joined: 13 Jun 2007 12:00

Post by gglau » 11 Jan 2010 2:25

This is a marvelous idea.

After a job in the Run-Too-Long viewpoint has turned green or red, there should be an SOP to remove the added condition to get it out of the viewpoint.

User avatar
nicolas_mulot
Nouveau
Nouveau
Posts: 149
Joined: 07 Jan 2010 12:00

Post by nicolas_mulot » 11 Jan 2010 10:42

Well, the idea is to focus on exceptional events. Once notified, the operator should act on the job, probably using a more standard viewpoint.

Since this particular viewpoint is only intended to catch attention, the best is to exclude ended jobs (OK or NOTOK) from the viewpoint, and leave the modified job order as is

Cheers
Nicolas Mulot

Post Reply