CMS dashboard task monitoring

CMS dashboard task monitoring

We are now in a phase change of the CMS experiment where people are turning more intensely to physics analysis and away from construction. This brings a lot of challenging issues with respect to monitoring of the user analysis. The physicists must be able to monitor the execution status, application and grid-level messages of their tasks that may run at any site within the CMS Virtual Organisation.

The CMS Dashboard Task Monitoring project provides this information towards individual analysis users by collecting and exposing a user-centric set of information regarding submitted tasks including reason of failure, distribution by site and over time, consumed time and efficiency. The development was user-driven with physicists invited to test the prototype in order to assemble further requirements and identify weaknesses with the application.

5.1 Introduction

The Experiment Dashboard[1] is a monitoring system developed for the LHC experiments in order to provide the view of the Grid infrastructure from the perspective of the Virtual Organisation. The CMS Dashboard provides a reliable monitoring system that enables the transparent view of the experiment activities across different middleware implementations and combines the Grid monitoring data with information that is specific to the experiment

The scientists must be able to monitor the execution status, application and grid-level messages of their tasks that may run at any site on the distributed WLCG infrastructure. The existing CMS monitoring systems provide this type of information but they are not focused on the user's perspective.

The CMS Dashboard Task Monitoring project addresses this gap by collecting and exposing a user-centric set of information to the user regarding submitted tasks. It provides a clear and precise view of the status of the task including job distribution by sites and over time, reason of failure and advanced graphical plots giving a more usable and attractive interface to the analysis and production user. The development was user-driven with physicists invited to test the prototype in order to assemble further requirements and identify weaknesses with the application.

This chapter discusses the development of the CMS Dashboard Task Monitoring that was performed by the author. In the first section of this paper, the concept of the Experiment Dashboard monitoring system and its framework will be described in detail. The next sections provide an overview of the CMS Dashboard Task Monitoring application and its features. The two final sections focus on the known issues, the future work and draw some conclusions.

5.2 Design

The following sections discuss the requirements that shaped the design of the CMS Dashboard Task Monitoring application.

5.2.1 Objectives

Most of the CMS analysis users interact with the Grid via the CMS Remote Analysis Builder (CRAB). User analysis jobs can be submitted either directly to the WLCG infrastructure or via the CRAB analysis server, which operates on behalf of the user. In the first case, the support team does not have access to the log files of the user's job or to the CRAB working directory, which keeps track of the task generation.

To understand the reason of the problem of a particular user's task, the support team needs a monitoring system capable of providing complete information about the task processing. To serve the needs of the analysis community and of the analysis support team, the CMS Dashboard Task Monitoring[1] application has been developed on top of the CMS job monitoring repository.

5.2.2 Use Cases

A use case analysis was carried out based upon the feedback received by the CMS physicist community. The main use cases are described in Appendix XXX and illustrated in Figure 1.

With the major use cases established it is possible to extract the key requirements that the application has to fulfil. The following points represent the baseline requirements divided into principal areas.

5.2.3 Requirements

Assumptions

  1. Users have a grid certificate.
  2. Users are members of the CMS VO.
  3. Users have submitted jobs to the Grid within one month.

User Interface

  1. Users control the application via a web interface using a browser.
  2. The application will be focused on the CMS analysis user's perspective.
  3. Easy to understand how it works and how to navigate throughout the tool.
  4. Compatible with all the recent browsers and operation systems.
  5. Simple, clean and intuitive in layout containing no unnecessary information.
  6. All of the Grids and the job submission systems that CMS uses will be supported.
  7. The user will access a very detailed information of the job processing including every single resubmission that he/she might have performed for each job individually.
  8. The application will offer task meta-information.
  9. The application will offer consumed time information and processing efficiency.
  10. Individual jobs within a task can be selected.
  11. Fast with very low latency.
  12. Update in 'real-time' from the worker nodes where the jobs are running.
  13. The user will be able to bookmark his/her favourite tasks for later use or to share them among his/her colleagues.
  14. Offer a wide selection of advanced graphical plots that will visually assist the user.
  15. The application will be built on top of the CMS Dashboard Job Monitoring Data Repository.
  16. Exceptions should be caught by the application and informative error messages will be provided to the users.
  17. Verbose logging should be available to identify any problems.
  18. Quick access to the application's manual, help and the meanings of the error exit codes should be provided.

Developer's Requirements

  1. Variable level of logging will be built in from the start.
  2. Logging will write to stdout and to a file to ease debugging.
  3. Low coupling between the components is required.
  4. Minimum version of Python that is supported is determined by that installed on lxplus.cern.ch (currently 2.3).

5.2.4 Architecture

The CMS Dashboard Task Monitoring application is part of the Experiment Dashboard system[2] which is widely used by the four LHC experiments. The framework of the system consists of the following components (Figure 2):

The Data Access Layer (DAO) is responsible for the management of the persistent data (stored in a RDBMS). Each component in this layer will provide query/update capabilities for a subset of the stored data. The Web Application is responsible for the HTTP entry point to the available data. It exposes the data to the users in different formats and inserts new records/updates existing ones. It makes heavy use of the DAO. The Collectors layer listens to messages/events coming from the Messaging Infrastructure and it quickly analyses the data and passes it on to the DAO layer for storage. The Information Sources layer sits closely to the services/applications being monitored and listens to interesting events. Finally, the Messaging System is an external component used by the Dashboard to communicate with the Information Sources.

The Controller is the main piece of the web application (Figure 3). It receives all client requests and decides what to do with them. For each client request there should be a corresponding Action, which will normally involve some interaction with the model of the application (some business logic that might involve accessing or updating persistent data).

A client request might involve producing some output. This output is identified by its mime/type and will have a View associated with it. The Action will put any data that it collected/produced in a shared area (the ActionContext) so that it can later be taken by the View to produce the output to the client.

All the relationship between client requests, Actions, Views and its associated mime/types is defined in a single configuration file, the ActionMapping file. A widely used format for data retrieval is HTML but information can also be retrieved in XML, CSV or image formats allowing any third party application to use the system. The sequence of actions of the Web Application are illustrated in Figure 4.

The Dashboard Task Monitoring application is built on top of the Dashboard Job Monitoring system which uses multiple sources of information[3]. There are two main architectural principals of the Dashboard Job Monitoring system:

  1. Monitoring should not be intrusive to the information source. Thus, it does not pool information from the primary monitoring sources on a regular basis to avoid adding additional load on the services responsible for the job processing.
  2. The Dashboard uses a message-oriented architecture. There is no synchronous connection to the primary information producer. The job submission tools as well as the jobs themselves are instrumented to report in real time important events to the MonALISA[6] servers. The Dashboard Collectors regularly consume information published by the MonALISA servers. At the time when the development of the Dashboard started in the summer of 2005, no messaging system was provided as a standard component of the Grid Middleware stack. The MonALISA system was selected to be used as a messaging system for the Dashboard. Currently, the Dashboard development team is integrating the Dashboard with the Messaging System for the Grid (MSG) [4].

The data collectors gather both Grid-related information as well as information specific to the application which is run by the users (Figure 5). The Grid-related information is obtained in the XML format from the Logging and Bookkeeping Database using the Imperial College Real Time Monitoring publisher (ICRTM)[5]. The application-specific information is gathered throughout a job's lifetime via the MonALISA monitoring system.

The job submission tools of the CMS experiment and the job wrappers generated by these tools are instrumented to report meta-information about a user's tasks and the progress of a user's job to the MonALISA server. The Dashboard then presents all this information in a coherent way, as if all of it came from one source [7].

5.3 Implementation

The Python language was chosen for development of the CMS Dashboard Task Monitoring due to the power, flexibility and speed of development that it offers. It is also widely used within the High Energy Physics community. Apache 2.0.52 (as of November 2009) was chosen to provide the client interface as it has a history of being flexible, secure and performant. The dojo javascript toolkit was used to connect the web interface with the database. Finally, the Graphtool python library was used for the creation of all the plots.

The major components that were identified in the requirements are illustrated in Figure 6 and are discussed in more detail in the following sections. The client revolves around the concept of a task which coordinates all of the actions required to satisfy the user requirements.

The relation between the Action and the View python classes and their generated output files is being illustrated in Figure 7. All the Action classes access the database to collect the data and if a calculation in the results is needed, they forward the data to the appropriate View class for the calculation and then the data is returned to the user in the appropriate output format. There are also 40 Action and View python classes and 20 Output image files for the 20 available plots generated by the application. These python classes are not shown in Figure 7 for clarity reasons.

5.3.1 CMS Dashboard Database Schema

The CMS Dashboard Task Monitoring application is built on top of the CMS Dashboard Job Processing Data Repository. To ensure a clear design and maintainability of the application, the actual monitoring queries are decoupled from the internal implementation of the data storage.

The CMS Dashboard Task Monitoring application comes with a Data Access Object (DAO) implementation that represents the data access interface. Access to the database is done using a connection pool to reduce the overhead of creating new connections and therefore, the load on the server is reduced and the performance is increased. A flowchart illustrating all the major paths for a client request is shown in Figure 8.

Figure 9 illustrates the entity relationship diagram between the most important tables of the database used by the CMS Dashboard Task Monitoring application. The job table contains information regarding the job itself such as the number of events to be analysed, the task that it belongs, the site that the job is running and various submission timestamps. The task table contains task-specific information such as the task creation timestamp, the name of the task, the submission method used, the user that has submitted this task, the input collection and the target Computing Element (CE). The site table contains site-specific information such as the site name, the country that the site belongs to, the Computing Elements of the site and the nodes of the site.

The connection to the database is defined in a single configuration file, the dashboard-dao-oracle-job.cfg as illustrated in the following sample listing.

### ORACLE SPECIFIC CONFIGURATION

[oracle]

# Home of the oracle libraries

oracle_home = /var/www/tmp

# Connection parameters

# You can either specify a set of 'user', 'password', 'host', 'port', 'sid'

# or set the full connection string in the 'connect_string' property

user = ⟨username⟩

password =

host = ⟨hostname⟩

port = ⟨port⟩

sid = ⟨sid⟩

connect_string = (DESCRIPTION=(ADDRESS_LIST=(ADDRESS=(PROTOCOL=TCP)(HOST=⟨hostname⟩)(PORT=⟨port⟩)))(CONNECT_DATA=(SID=⟨sid⟩)))

# Pool configuration parameters

pool_min_size = 1

pool_max_size = 2

pool_increment = 1

pool_mon_interval = 600

5.3.2 SQL Queries

In this section, the most important SQL queries of the application will be presented. The first SQL query fetches the list of all the available users that have submitted jobs during the period of a month.

select distinct users."GridName" from users, task where users."UserId" = task."UserId"

and task."TaskCreatedTimeStamp" > sysdate - 31 and task."TaskTypeId" in (select "TaskTypeId" from task_type where "Type" in ('analysis', 'JobRobot', 'AnaStep09')) order by users."GridName"

The second SQL query fetches all the submitted tasks of the user during a selected period of time.

SELECT "TaskId" as taskid, "TaskMonitorId" as taskmonid, "InputCollection" as inputcollection, "TaskCreatedTimeStamp", MAX(decode(status,'P', jobsInState, 0)) AS pending, MAX(decode(status,'R', jobsInState, 0)) AS running, MAX(decode(status, 'S', jobsInState, 0)) AS

success, MAX(decode(status, 'F', jobsInState, 0)) AS failed, MAX(decode(status,'U', jobsInState, 0)) AS

terminated, sum(jobsInState) as numofjobs FROM (

SELECT "TaskId", "TaskMonitorId", "InputCollection", "TaskCreatedTimeStamp", status, COUNT(status) AS jobsInState FROM (

SELECT JS."TaskId", TK."TaskMonitorId", "InputCollection", "TaskCreatedTimeStamp", JS.status FROM (

SELECT "TaskId", "TaskMonitorId", "InputCollection", "TaskCreatedTimeStamp"

FROM task T, input_collection

WHERE T."TaskCreatedTimeStamp" > :startDate AND T."TaskTypeId" in (select "TaskTypeId" from task_type where "Type" in ('analysis', 'JobRobot', 'AnaStep09'))

AND T."UserId" IN (SELECT "UserId" FROM users WHERE "GridName" = :gridName)

AND "INPUT_COLLECTION"."InputCollectionId" = T."InputCollectionId"

) TK JOIN ( SELECT "TaskId", "EventRange", "JobId", "DboardFirstInfoTimeStamp",

job_status("DboardJobEndId","DboardStatusId","DboardGridEndId") AS status,

ROW_NUMBER() OVER (PARTITION BY "TaskId", "EventRange" ORDER BY "DboardFirstInfoTimeStamp" DESC) AS n

FROM job WHERE job."NextJobId" is null AND job."TaskId" IN (

SELECT "TaskId" FROM task T

WHERE T."TaskCreatedTimeStamp" > :startDate AND T."TaskTypeId" in (select "TaskTypeId" from task_type where "Type" in ('analysis', 'JobRobot', 'AnaStep09'))

AND T."UserId" IN (SELECT "UserId" FROM users WHERE "GridName" = :gridName)

)) JS ON (JS."TaskId" = TK."TaskId")

WHERE JS.n ⇐ 1) GROUP BY "TaskId", "TaskMonitorId", "InputCollection", ` "TaskCreatedTimeStamp", status) GROUP BY "TaskId", "TaskMonitorId", "InputCollection", "TaskCreatedTimeStamp" ORDER BY "TaskCreatedTimeStamp"

The third query fetches all the jobs of a selected task.

SELECT "TaskJobId", "EventRange", "Site", "started", "finished", "submitted", "resubmissions", "SchedulerJobId", status, "GridEndId", "GridEndReason", "JobExecExitCode", "AppGenericStatusReasonValue" FROM (

SELECT "TaskJobId", "EventRange", site."VOName" as "Site",

job_status("DboardJobEndId","DboardStatusId","DboardGridEndId") AS status,

"SubmittedTimeStamp" as "submitted", "StartedRunningTimeStamp" as "started",

"FinishedTimeStamp" as "finished", job_resubmission("TaskJobId") as "resubmissions", "SchedulerJobId", ROW_NUMBER() OVER (PARTITION BY "TaskId", "EventRange" ORDER BY "DboardFirstInfoTimeStamp" DESC) AS n,

"DboardGridEndId", "DboardGridEndId" as "GridEndId",

"JobExecExitCode", "AppGenericStatusReasonValue",

generic_status_reason."GenericStatusReasonValue" as "GridEndReason"

FROM job, long_ce, short_ce, site, generic_status_reason, grid_status_reason, app_generic_status_reason

WHERE job."NextJobId" is null AND job."TaskId" =

( select "TaskId" from task where "TaskMonitorId" = :taskMonId) AND job."LongCEId" = long_ce."LongCEId" and short_ce."ShortCEId" = long_ce."ShortCEId" AND grid_status_reason."GridStatusReasonId" = job."GridStatusReasonId" AND grid_status_reason."GenericStatusReasonId" = generic_status_reason."GenericStatusReasonId" AND app_generic_status_reason."AppGenericErrorCode" =

nvl(job."JobExecExitCode",-1) and site."SiteId" = job."SiteId" order by TO_NUMBER("EventRange")

The fourth SQL query fetches task meta-information such as the task creation time, the version of the application used, the number of events per job and the input collection data.

select task."TaskId", task."TaskMonitorId", task."TaskCreatedTimeStamp", task_type."Type" as "TaskType", submission_tool_ver."SubToolVersion", application."Application", application."ApplicationVersion", task."NEventsPerJob", appl_exec."Executable", input_collection."InputCollection", submission_tool."SubmissionTool", submission_ui."DisplayName" as "SubmissionUI", "SubmissionType", "TargetCE", scheduler."SchedulerName" as "SchedulerName" from task, task_type, task_status, submission_tool_ver, application, appl_exec, input_collection, submission_tool, submission_ui, scheduler

where task."TaskMonitorId" = :taskMonId

and task_type."TaskTypeId" = task."TaskTypeId"

and task."DefaultSchedulerId" = scheduler."SchedulerId"

and task_status."TaskStatusId" = task."TaskStatusId"

and application."ApplicationId" = task."ApplicationId"

and appl_exec."ApplExecId" = task."ApplExecId"

and input_collection."InputCollectionId" = task."InputCollectionId"

and submission_tool."SubmissionToolId" = task."SubmissionToolId"

and submission_ui."SubmissionUIId" = task."SubmissionUIId"

and submission_tool_ver."SubToolVerId" = task."SubToolVerId"

The final SQL query fetches all the resubmission history for a selected job.

select "JobExecExitCode" as "JobExitCode", app_generic_status_reason."AppGenericStatusReasonValue" as "JobExitReason",

"DboardGridEndId" as "GridEndId", "GenericStatusReasonValue" as "GridEndReason",

"VOName" as "Site","AppStatusReason", "SubmittedTimeStamp" as "submitted",

"StartedRunningTimeStamp" as "started", "FinishedTimeStamp" as "finished",

"EventRange", "SchedulerJobId" from (select "JobExecExitCode", "DboardGridEndId", "GenericStatusReasonValue", "VOName", "SubmittedTimeStamp", "StartedRunningTimeStamp", "FinishedTimeStamp", "EventRange", "SchedulerJobId", replace("AppStatusReason",\'\'\'\') as "AppStatusReason" from job, long_ce, short_ce, site, generic_status_reason, grid_status_reason, app_status_reason

where "TaskJobId" = :taskJobId

and job."LongCEId" = long_ce."LongCEId" and short_ce."ShortCEId" = long_ce."ShortCEId" and site."SiteId" = short_ce."SiteId"

and app_status_reason."AppStatusReasonId" = job."JobExecExitReasonId"

and grid_status_reason."GridStatusReasonId" = job."GridStatusReasonId"

and grid_status_reason."GenericStatusReasonId" = generic_status_reason."GenericStatusReasonId") all_jobs

left join app_generic_status_reason on app_generic_status_reason."AppGenericErrorCode" = nvl(all_jobs."JobExecExitCode", -1) order by "submitted"

5.3.3 Gridsite Authentication

We have integrated the CMS Dashboard Task Monitoring with the Gridsite library[xxx] to enable secure access to the information based on X509 authentication. GridSite was originally a web application developed for managing and formatting the content of the GridPP website. Over the past three years it has grown into a set of extensions to the Apache web server and a toolkit for Grid credentials, GACL access control lists and HTTP(S) protocol operations. The sequence of actions can be seen in Figure 10.

The authentication module was developed after some privacy concerns that any user is able to view everyone's tasks and their progress of the task. Another reason was to personalise the available content shown to the user. When the user logs in to the application, the information will be presented automatically by the application and this information is focused on the user only and not to all the existing Grid users.

The authentication module is optional and not used by default. Hence, everyone is an administrator by default. When the module is enabled, the Grid Certificate must be loaded in the user's browser. If the client's Distinguish Name (DN) exists, we check if the user's DN matches any entries from the table ADMIN_USERS and if there is a match between the user's DN and an entry from the table ADMIN_USERS. If the previous is true, the user is an administrator and we execute the following query that fetches the full list of the users on the system.

userQuery = 'select distinct users."GridName" from users, task

where users."UserId" = task."UserId" and task."TaskCreatedTimeStamp" > sysdate - 31 and task."TaskTypeId" in (select "TaskTypeId" from task_type where "Type" in (\'analysis\', \'JobRobot\')) order by users."GridName"'

Otherwise, authentication is being used and the user is not an administrator. We execute the following query so that the user will only see his own jobs.

userQuery = 'select distinct users."GridName" from users, task

where users."GridCertificateSubject" = :clientDNstring and users."UserId" = task."UserId" and task."TaskCreatedTimeStamp" > sysdate - 31 and task."TaskTypeId" in (select "TaskTypeId" from task_type where "Type" in (\'analysis\', \'JobRobot\'))'

5.3.4 Advanced Graphical Plots

Graphical plots were developed to give to the physicist user a more usable and attractive user interface and to visually represent the data contained in an analysis operation. The "Graphtool" python library was used to create the plots. The sequence of actions for the generation of a graphical plot is illustrated in Figure 11.

The following code is from the GraphicalOverviewPyPlot python class that creates a simple graphical overview plot. We have patched and extended the library to support custom colouring of the legends by using the 'color_override' option. The patches are available in the Appendix XXX.

Implementation of GraphicalOverviewPyPlot

license: Apache License 2.0

import os, time

from mod_python import util

from dashboard.common import log as logging

from dashboard.common import xml

from dashboard.common.Config import Config

from dashboard.http.View import View

from graphtool.graphs.graph import Grapher

from graphtool.graphs.common_graphs import PieGraph

from dashboard.common.InternalException import InternalException

from dashboard.http.actions.job.argument_filtering import filter_job_arguments

class GraphicalOverviewPyPlot(View):

version: $Id: GraphicalOverviewPyPlot.py,v 1.1.2.7 2009/01/29 19:56:33 ekaravak Exp $

_logger = logging.getLogger("dashboard.http.views.job.task.GraphicalOverviewPyPlot")

def __init__(self, attributes):

super(GraphicalOverviewPyPlot, self).__init__(attributes)

def generate(self, actionCtx, request):

# get the summaries

summaries = actionCtx.get("summaries")

parameters = filter_job_arguments(request.args)

data = {'Pending': summaries[0][0]['PENDING'], 'Running':summaries[0][0]['RUNNING'],

'Successful': summaries[0][0]['SUCCESS'], 'Failed': summaries[0][0]['FAILED'],

'Unknown': summaries[0][0]['TERMINATED']}

metadata = {'title': 'Graphical Overview', 'color_override':{'Pending':'#FEFE98', 'Running':'#CCCCFE', 'Successful':'#98CB98', 'Failed':'#FF0000',

'Unknown': '#DDFEAA'}, 'title_size':10, 'text_size':8}

pieJobs = PieGraph()

file = request

# Return the plot to the request

self._logger.debug('Returning the plot to the request')

pieJobs(data, file, metadata)

The application offers a wide-variety of graphical plots and these plots are being illustrated in the next section.

5.3.5 User Interface and Monitoring Features

Task Monitoring provides monitoring functionality regardless of the job submission method or the middleware flavour and it works transparently across various Grid infrastructures which is the reason why it is so heavily used by many analysis users [8][9]. It is easy to understand how it works and how to navigate throughout the tool. It is clean and intuitive in layout and it contains no unnecessary information as illustrated in Figure 12.

A snapshot of the user interface can be seen in Figure 12. The user can also retrieve the result of this table in the XML format by using the following command:

curl -H 'Accept: text/xml' 'http://dashboard02.cern.ch/dashboard/request.py/taskstable? &typeofrequest=A&timerange=last3Days&usergridname=USERNAME' > /tmp/actions.xml

The XML output will be a bit hard to read because there is no newline break. We can reformat the output by using the 'xmllint' command:

xmllint --format /tmp/action.xml

Clicking on the information link next to the name of the task provides meta-information such as input dataset, version of the software used by the task and of the submission tool and the task creation time. Clicking on the number of jobs corresponding to a given status provides a detailed information of all the jobs of a selected category (Figure 13).

The user can also retrieve the result of this table in the XML format by using the following command:

curl -H 'Accept: text/xml' 'http://dashboard02.cern.ch/dashboard/request.py/taskjobs? &timerange=time_range&what=all&taskmonid=task_name > /tmp/actions.xml

The available time range periods are 'lastDay', 'last2Days', 'last3Days', 'lastWeek', 'last2Weeks' and 'lastMonth'. The XML output will be a bit hard to read because there is no newline break. We can reformat the output by using the 'xmllint' command:

xmllint --format /tmp/action.xml

Clicking on any name on the 'Site' column opens the Site Status Board for the CMS Sites[10], providing a 24-hour status availability of the selected site allowing to identify any problematic site and blacklist it from resubmissions (Figure 9).

Also, clicking on the 'Retries' column provides a detailed re-submission history of every single job which can be very useful for debugging purposes. An example can be seen in Figure 10; the job produced an output to the Storage Element (SE) but the staging out finished with an error (exit code: 60307), thus, all following resubmissions had no chance to succeed, since the file was already created on the SE (exit code: 60303). Before any further resubmission, the output file generated by the previous attempt should be removed from the SE.

Currently, the strongest point of the application is the failure diagnostics for the Application failures. It is extremely useful to get not only the exit-code of the failed job, which sometimes can be misleading, but a detailed reason of failure as well, i.e. 'Could not save output file A on the storage element B'.

The ideal goal would be to reach to a point where a user shouldn't have to open the log file and search for what went wrong with the job. The user could get everything from the monitoring tool. An example can be seen in Figure 16.

The application offers a wide variety of graphical plots that will visually assist the user to understand the status of the task. These plots show the distribution by site of successful, failed, running and pending jobs as well as for the processed events (Figure 12a) and they can help identify any problematic site and blacklist it from further resubmissions (Figure 12b).

They also demonstrate the terminated jobs in terms of success or failure and over the time range that the task has been running (Figure 12c). In the case of failure, the distribution by reason is demonstrated, whether it be Grid-Aborted or Application-Failed jobs (Figure 12d).

Figure 12. Graphical Plots: a) Processed Events over Time, b) Terminated Jobs by Site, c) Terminated Jobs over Time, d) Reason of Failure

Various kinds of consumed time plots are available such as the distribution of CPU and Wall Clock time spent for successful and failed jobs and the average efficiency distributed by site (Figure 13). These plots will help the user to see how the CPU time per event and efficiency can vary depending on the site that the jobs are running on. The user gets information regarding the time that has been consumed for a specific task or a given job.

For any given task (Figure 14), the following information is available: the average efficiency of the task, the total and the average CMSSW CPU and job wrapper Wall Clock time usage and the average CPU time spent per event. The average efficiency per task is calculated by the following formula:

At the job-level the user gets information about the efficiency of every single job separately (Figure 15). The processing efficiency per job is calculated by the following formula:

A selection of snapshots of the application can be seen in Figure 16.

5.4 Experience of the CMS User Community with Task Monitoring

In the CMS Community, the CMS Remote Analysis Builder (CRAB)[12] is used for the job submission. CRAB is a Python programme simplifying the process of creation and submission of CMS analysis jobs to the Grid environment. CRAB can be used in two ways: i) as a standalone application and ii) with a server.8

The standalone mode is suited for small tasks and it submits the jobs directly to the scheduler and these jobs are under the user's responsibility. In the server mode, suited for larger tasks, the jobs are prepared locally and then passed on to a dedicated CRAB Server which then interacts with the scheduler on behalf of the user and performs additional services such as automatic resubmissions and output retrieval.

Rather often, Task Monitoring discovers previously undetected problems with the CRAB Server or the Workload Management Systems (WMS). The Dashboard reports a job as 'finished' when the job finishes on the worker node but the job status updates by the Grid services can introduce some latency and they are quite often delayed due to a component of the CRAB Server or due to problems of the WMS or of the Logging and Bookkeeping system (LB). Thus, when the users see a big delay in status updates in CRAB compared to the status shown in Task Monitoring, they report the problem and after investigation either the CRAB Server is fixed or the faulty WMS is blacklisted.

We have performed a publicity campaign to bring awareness to the CMS User Community for the Task Monitoring application, collect feedback, assemble further requirements and identify weaknesses with the application.

The results of our publicity campaign are available in Appendix X along with their feature requests. Two hundred analysis users were contacted via e-mail and fifty of them replied and provided very positive feedback:

  • 'A nice surprise to see this tool live and working!'
  • 'It's easy to navigate and provides useful information regarding failed jobs'.
  • 'The monitoring tool is great! I'll now investigate why most of my jobs are failing'.
  • 'Perfect! It even updates in real-time! It is particularly nice to be able to see how many events were processed'.
  • 'Indeed very helpful tool to monitor the progress of my grid activities'.
  • 'Excellent! It has low latency and excellent plots with clear labels. I'm surprised that it also supports the condor scheduler. Thanks for making this happen'.

According to our web statistics[8][9], more than one hundred distinct analysis users are using Task Monitoring for their everyday work as illustrated in Figure 17. The Dashboard Applications Usage Statistics programme was developed to count the daily total number of distinct users using a selected number of CMS Dashboard applications. In order to count the distinct daily users, the daily access_log file of the apache http web server was used.

The following bash script commands were used in a python programme to determine the date of the month and the total number of distinct daily users using some selected applications according to the total number of unique visitor IPs.

# Command to get the date of the month:

getDate = "zgrep +0 /var/log/httpd/access_log.1.gz | awk '{print $4}'| uniq | head -n 1| cut -c 2-13"

# Commands for the usage of the following applications:

TaskMon = "zcat /var/log/httpd/access_log.1.gz | grep taskmonitoring | awk '{print $1}' | sort | uniq |wc -l"

TaskMonCRAB = "zcat /var/log/httpd/access_log.1.gz | grep taskmon.html | awk '{print $1}' | sort | uniq |wc -l"

The "TaskMon" bash command counts the total number of distinct users using the application and the "TaskMonCRAB" command counts the total number of distinct CRAB users accessing the application directly from the CRAB status output. The following cron command was scheduled to run the programme daily at 06:00am for the updating of the stats.

0 6 * * * python /usr/share/dashboard-stats/dashb_stats.py 2>&1 >> /var/log/script_output.log

The Graphtool library was then used to create the plot of the programme, which is available at http://lxarda18.cern.ch/usage.html.

5.5 Known Issues and Future Work

The overall improvement of the Task Monitoring application strongly depends on the completeness of the job monitoring information in the Dashboard data repository. One of the known issues is the incomplete information regarding the Grid status of the jobs. Currently, only information from the Imperial College Real Time Monitoring (ICRTM) is used for this purpose. Unfortunately, only a fraction of the CMS jobs are monitored by the ICRTM[4]. Jobs submitted via Condor_G[11] or Condor-glideins escape it and not all Workload Management Systems (WMSs) are monitored by the ICRTM. There is a lot of development effort, driven by the Dashboard team, to improve this situation by instrumenting the Logging and Bookkeeping system (LB) for publishing job status changes information to the Messaging System for the Grid (MSG). Then, this information will be available to all possible interested clients including the Dashboard. Condor_G is also being instrumented to report job status changes to the MSG.

We plan to develop a search functionality that will allow the user to search for a specific task or job using a search pattern. Finally, we are working on developing a script to automatically generate a set of commands for processing a set of jobs, such as resubmissions, killings, getting logging info and retrieving the output. These commands will be both in CRAB format and in various underlying middleware format. The user could select a subset or all of the failed jobs for a given task and be able to download and run a single command file that will do the resubmission automatically. This feature is needed when CRAB's task directory is not available for an unknown reason and the user can not use the CRAB UI in order to manipulate the jobs of the task. Due to the increased success of the CMS Dashboard Task Monitoring application within the CMS community, there are plans to adapt it to the ATLAS VO as well.

5.6 Summary

While the existing monitoring tools are coupled to a specific middleware, Task Monitoring provides monitoring functionality regardless of the job submission method or the middleware platform offering a complete and detailed view of the user's tasks including failure diagnostics, processing efficiency and resubmission history.

The monitoring tool has become very popular among the CMS users. According to our web statistics[7][8], more than one hundred distinct analysis users are using it for their everyday work. Close collaboration with several CMS users resulted in the tool being focused on their exact monitoring needs.

References

[1] Experiment Dashboard, http://dashboard.cern.ch

[2] ARDA Dashboard Developer's Guide, http://dashb-build.cern.ch/build/nightly/doc/guides/common/html/dev/index.html

[3] J. Andreeva et al. 'Experiment Dashboard: the monitoring system for the LHC experiments'. In GMW'07: Proceedings of the 2007 workshop on Grid monitoring, ACM

[4] J. Andreeva et al. 'New Job Monitoring Strategy on the WLCG Scope'. In CHEP'09: Proceedings of the 2009 International Conference on Computing in High Energy and Nuclear Physics, IOP

[5] Imperial College Real Time Monitoring (ICRTM), http://gridportal.hep.ph.ic.ac.uk/rtm/

[6] Monitoring Agents Using a Large Integrated Services Architecture (MonALISA), http://monalisa.cern.ch/monalisa.html

[7] P. Saiz et al. 'Grid Reliability'. In CHEP'07: Proceedings of the 2007 International Conference on Computing in High Energy and Nuclear Physics, IOP

[8] Dashboard Production Server Statistics, http://lxarda18.cern.ch/awstats/awstats.pl?config=lxarda18.cern.ch

[9] Dashboard Application Usage Statistics, http://lxarda18.cern.ch/usage.html

[10] Dashboard Site Status for the CMS Sites, http://dashb-ssb.cern.ch/ssb.html

[11] Condor-G, http://www.cs.wisc.edu/condor/condorg/

[12] CMS Remote Analysis Builder (CRAB), https://twiki.cern.ch/twiki/bin/view/CMS/SWGuideCrab

Please be aware that the free essay that you were just reading was not written by us. This essay, and all of the others available to view on the website, were provided to us by students in exchange for services that we offer. This relationship helps our students to get an even better deal while also contributing to the biggest free essay resource in the UK!