Record of Loss of Data on Shifts

From NPDGamma Wiki

Jump to: navigation, search

Dear Collaborators: As we know our 1E-8 asymmetry goal for hydrogen will require several weeks of data taking. As the number of new shift workers is poised to increase rapidly starting in May the odds are strong that the efficiency of data taking may be reduced through mistakes by inexperienced shift workers. To increase the speed of the learning curve of the collaboration toward high-efficiency shift work I have started this list of various examples of loss of data, starting with myself. I ask other people who take shifts and experience events which lead to loss of data to record their experiences here, analyze the event, and offer suggestions for changes that can eliminate/minimize the failure mode.

Use previous entries as a template for making new entries to this list. Also, please list events with the most recent at the top.

Contents

June 2012

shifts: 6-16-12-6-17-12: various

submitted by: M. Snow

loss of data event: decision to stop taking data because of low/unstable beam power

description: near the end of the run SNS decided to come up to only about 1/2 power. This event was cited as a justification by shift workers not to take data. The first reaction of any red-blooded physicist is to recoil in horror at the idea that no useful information can be obtained in this situation.

analysis: The assertion that no useful data can be taken at half power is at least debatable. It is true that it is unlikely that such data could be analyzed for a hydrogen asymmetry. However there could be some quite useful information to be derived from continuous operation at a power significantly lower than normal. For example: such data could be useful to confirm our understanding of various backgrounds in the detector and also possibly our understanding of noise in the detector asymmetries.

suggestions for changes: It is likely that such events will also occur in the future. We should think about what program of measurements woudl be useful in such situations and have a plan in place to take advantage of situations like this.

ACTION:

May 2012

Owl Shift: 5-31-12

shift: 5-31-12: midnight-8AM shift (Z. Tang)

submitted by: M. Snow

loss of data event: magnetic field out of range, ~3.5 hours of data lost

description: B field value changed value around 4:45am from -9.40 Gauss to -9.42 Gauss. Then current in hydrogen DAQ display changed from a normal value of 22.94 A to 23.02 A on display (22.94 A on hydrogen daq computer). This changed the field down to - 9.402 and 9.368. Lost data is a combination of delay in noting the field change combined with response time for change in the configuration.

analysis: ??? (why did the field change?)

suggestions for changes: Obviously need to understand and fix whatever caused the field to change. More vigilance on the magnetic field during shifts can help response time. Adding the B field to the data stream would also be a positive thing to do since manual observation of field values makes it hard to choose which runs might be bad.

ACTION: B field values are now being read into the data stream.

Day Shift: 5-29-12

shift: 5-29-12: 8AM-4PM shift (J. Calarco)

submitted by: M. Snow

loss of data event: hydrogen compressor failure, no loss of data through pure luck, we would have lost ~5 hours of data if the beam had been on

description: At 9AM the compressor for the lower refrigerator on the hydrogen target failed. Experts were called and it was restarted shortly thereafter. However the compressor stopped again around~11AM. Decision was made to replace the compressor with the backup compressor. As of this writing the compressor appears to be working and the target should be ready to take data when the maintenance break is over around 4PM today since the recovery time for the temperatues and pressures form these brief events looks to be ~2 hours.

analysis: system for getting experts on the scene worked as designed. The origin of the problem with the compressor is not yet understood: its two shutdowns seem spontaneous. If we had not had a backup compressor we would have had to try to operate the target with only the upper two refrigerators. We do not yet know if it would be possible to take data in such a mode.

suggestions for changes: this event emphasizes the need to have ready backup compressors for the system. We have no backup for the Cryomech compressor We had been looking into this already a couple of weeks ago in anticipation of this potential issue.

ACTION: get specific, detailed info on a local backup compressor for the Cryomech refrigerator and fix the CVI compressor ASAP

Owl Shift: 5-29-12

shift: 5-29-12: midnight-8AM shift (Z. Tang)

submitted by: M. Snow

loss of data event: loss of magnetic field, lost ~3 hours of data

description: At 5AM the magnetic field went down. Experts were called. a wire connector from the "trim" portion of the B field winding was burned from the power supply's attempt to maintain constant current and the power supply was dead. SNS technicians were called to help install the backup power supply which had been previously identified. It was installed in the morning. Kyle worked on getting the monitoring system for the magnetic field current. The onyl reason why more data was not lost was becasue of the scheduled 8AM facility maintenance break: otherwise we woudl have lost ~6 hours.

analysis: It seems that the connector that burned was not rated for the current in the wire. This event had happened earlier about one year ago but at that time the connector was replaced by the same type and since the problem looked to be "solved" for several months there was no effort to properly replace the connector with one that had the correct current rating. So this event was in principle preventable but was clearly a lower priority for fixing in comparison with other issues.

A possible hint of the problem may have been observed a few days before by previous shift workers, who noticed some intermittent noise on the magnetic field (although not enough to cause a problem for data taking). It is possible that this might have been a precursor to the bad connection coming undone.

suggestions for changes: the connector has been replaced with one which is rated to take the current.

ACTION: double-check all the rest of the connectors on the B coils to make sure that all are rated for the currents that they must carry.

Day Shift: 5-28-12

shift: 5-28-12: 8AM-4PM shift (M. Snow)

submitted by: M. Snow, clarifications by Kyle

loss of data event: disk changeover, lost ~0.5 hours of data

description: I successfully followed the directions for changing the disk and restarted the computer. I tried two test runs which were not successful. I called Kyle and he used a special permission to fix the problem.

analysis: Here I append Kyle's description of the problem:

I formatted all of the old 500gb drives and this also happened once or twice and hasn't happened in ~10 months. I did something (or maybe I didn't?) that fixed it when I formatted the rest, but I don't remember what it was that fixed it or if I even did anything. The issue is that the drive mounted so that only root could touch it, it's permissions were 755 in linux terms, which means the owner (root) can read, write, and execute, and everyone else can read and execute. I simply changed it so that all users have rwx, or 777. For obvious reasons, data is not being taken by a user with root privileges, so a normal user can't do this without special permission.

suggestions for changes: Again from Kyle;'s message:

Here's where special permission comes in... I've written a simple script that runs the command to fix this. I've added permission for shift takers, or rather the account that is always logged in on clover, to run this script and only this script as sudo without a password. This script will need to be run at first boot after changing the drive. I've tested it and it works as intended. I'm going to email Nadia about it so she's aware such a script exists before I add it to the list of things to do when changing a drive.

ACTION: Hard disk change procedure changed. Shift takers are now required to run a script that assures that the disk is writable.

Day Shift: 5-27-12

shift: 5-27-12: 8AM-4PM shift (M. Snow)

submitted by: M. Snow

loss of data event: DAQ froze and VME2 crashed, lost ~1.5 hours of data

description: DAQ was frozen and we were not taking data. I called Kyle who verified that VME2 had crashed. He sent Zhaowen out to help since I was not authorized to enter the cave to restart VME2. Zhaowen arrived and fixed the problem.

analysis: If I had access to the cave I could have entered and tried to follow the shift taker instructions to get VME2 going again but odds are that due to unfamiliarity I would have ended up doing something stupid

suggestions for changes: none

Day Shift: 5-22-12

shift: 5-22-12: 8AM-4PM shift (H. Nann)

submitted by:

loss of data event: runs not restarted after HV disconnection for LED runs

description:

analysis:

suggestions for changes: One thing that is important for shift workers to know is what is the default run plan for the day.

ACTION: we now have a wiki entry with the day-to-day default run plan

Evening Shift: 5-19-12

shift: 5-19-12: 4PM-midnight shift (A. Barzilov)

submitted by:

loss of data event: DAQ crash, vme3 bad, lost ??? minutes of data

description: runs 80375-80387 DAQ crash, vme3 was bad, Kyle had to be called in to fix it

analysis:

suggestions for changes:

Owl Shift: 5-19-12

submitted by: Z. Tang

loss of data event: runs 80267-80277 vme2 gains were not reset, losing ~70 minutes of data

description:

analysis:

suggestions for changes:

Day Shift: 5-16-12

submitted by: Mike Snow

loss of data event: ~5 hours of lost data on Cl from magnetic field failure

description: Around 11:30-12 we were flnishing the 3He polarimetry/spin flipper efficiency measurement and we were inserting the Cl target in the the spin flipper for an asymmetry measurement. Unnoticed by the shift supervisor, one of the secondary power supplies for the magnetic field failed. This failure went unnoticed by the shift supervisor until the end of the shift at 4PM when, as instructed, the shift supervisor was filling out the standard log entries and it was noticed that the B field computer display was down. The shift supervisor called the expert on call who revived the B field DAQ which showed that the field has dropped from ~9.4G to 8.7G, thereby destroying the spin flipper resonance condition and rendering all the data useless. Experts arrived, entered the cave, and fixed the power supply. Data taking was resumed about 1 hour after the event was discovered. The terminals of the power supply, which had never failed before, were later cleaned as described in the shift log.

analysis: The shift supervisor had been instructed on the procedures for shift taking and had been shown the list of items that was on the checklist. However, since the row associated with recording the B field on the shift log paper was narrow and for some reason previous shift workers has not been entering the B field values, the shift supervisor simply didn't notice the B field item initially at first glance. Only when looking at the list in detail did the shift worker notice the B row needed to be filled in. The shift worker was aware that an incorrect B field would render the data useless but made no effort to proactively ask where the B field info was displayed before being forced to do so as the end-of-shift requirement to record values.

suggestions for changes: There are various ways that this loss of data could have been prevented. One is to require shift workers to record the critical data AT THE BEGINNING of the shift, not at the end as was the practice at the time of my instruction. Another would have been to more prominently emphasize visually the critical importance of the B value on the log energy form. A third might be to consider either removing or deemphasizing the requested data entry in the existing shift list associated with the current in the main magnetic field coils, which is presently being fed into the hydrogen target DAQ. Since there are more than one power supply which creates the field, this number alone does not determine the total field, and its constancy can lull the shift worker into a false sense of security that the field is "OK".

More generally: the list of items to be recorded by a shift worker must include all those parameters which are required to ensure that we are taking valid data and we need to review the list to make sure that we are doing this.

ACTIONS: production run checklist was modified/improved. Shift takers now record production list and hydrogen list at the beginning and the end of the shift

Personal tools