UDA Disk Systems. - Part 2 Of 2.

CLAYTON@XRT.UPENN.EDU ("Clayton, Paul D.") (11/08/87)
Information From TSO Financial - The Saga Continues...
Chapter 32, Part 2 Of 2 - November 7, 1987


         The last item of the installation was to become knowledgeable of
    the terms of the field service contract that is bundled into the
    purchase. The call window is 9:00 AM to 5:00 PM, Monday to Friday. The
    window can be expanded at additional cost to what ever is required. The
    time lag from placing a call for service to having someone here working
    is to be less then four (4) hours. I am told that these terms can be
    different according to your location and the distance to the nearest SI
    field service office. It needs to be noted here that I will not have DEC
    or other vendors perform the maintenance on these drives for a long time.
    My reason is that the controller is brand new to the market and still
    receiving significant fixes and upgrades. I feel the ONLY way to get
    these fixes in a timely manner is to have the manufacturer perform the
    maintenance. It also helps when you concern yourself with the sparing
    issue at the local office.

         With that done, everyone cleared out and I put the drives to
    immediate use by running a program I wrote some time ago to heavily
    load a disk with I/O. The program is designed such that a 400,000 block
    file is created and random length records, 1 to 32,000 bytes, are
    written then read from the file on a purely random basis. After the
    write is complete, a read of the same information at the same location
    on disk is performed and the information is compared with what was
    written to insure that there was no errors in writing to the disk. This
    process continues until the program is aborted or more then 5,000,000
    write operations to the disk are completed. Status report lines are
    generated every so often to indicate how the test is progressing.
    Should an error occur, all pertinent information is printed out and the
    operation is retried up to ten (10) times before moving on. The result
    is a disk that physically feels likes its attempting to rip its heart
    out. Two editions, for a total of 800,000 blocks of usage, were run on
    two different unloaded cpu's, one 8700 the other a 8500. The SPM
    package from DEC accumulated performance statistics during this time
    period and the following results tell the comparison between various
    types of disks. The 'response time' stat has been defined to me as the
    time it takes an I/O request to travel from the VAX bulkhead CI
    connection, through the HSC, out to the drive, have it performed, and
    back to the bulkhead. This therefore includes any time needed by the
    HSC to setup the request for the disk. The tests on the SI93C drives
    were done on a VAX 8350, with the attached processor enabled.


                 +---------- Shadow Disk Statistics ---------+
                 !                    Serv    Resp           !
                 !            Rate    Time    Time   Queue   ! 
                 !            (/s)    (ms)    (ms)   Length  !
                 !           ------  ------  ------  ------  !
                 ! RA$81       14.9      67     131     2.0  ! 
                 ! SI$83C      16.6      60     117     1.9  !
                 ! SI$93C      17.5      56     101     1.8  !
                 +-------------------------------------------+

                 +-------- Seperate Disk Statistics ---------+
                 !                    Serv    Resp           !
                 !            Rate    Time    Time   Queue   !
                 !            (/s)    (ms)    (ms)   Length  !
                 !           ------  ------  ------  ------  !
                 ! RA$81       16.8      60     117     2.0  !
                 ! SI$83C      19.2      52     100     1.9  !
                 ! SI$93C      20.9      48      88     1.8  !
                 +-------------------------------------------+

                                    Table B
               Comparison Of Disk Performance Using Test Program


         For both the tables shown above, it needs to be noted that there
    was always an I/O request that was waiting to execute on the drive and
    therefore no 'idle' time as far as the drive was concerned. This is
    shown in the 'Queue Length' value for each test. The other item to note
    here is that the 'Response Time' values for the shadow disks are larger
    then the values for the separate disks. The basis for this, I feel, is
    that while the shadow disks were on different HSC requestor cards, the
    test program performs a read after EVERY write. The result is that both
    members of the shadow set have to complete the previous write operation
    and then the HSC does a comparison between them to see which could
    provide the information faster on the following read. The increase that
    is shown, therefore is the time to completely update both shadow
    members and do the comparison. I also feel that these numbers represent
    the worst case scenario, except if both shadow members are on the same
    requestor, that you should encounter. The point to remember with shadow
    sets is that the 'win' situation is with I/O that is largely read
    requests instead of write requests, and the test program is exactly
    opposite this.

         The problems, other then the ones listed above, that I have had to
    date are as follows with any updates that I know at the time of this
    writing.

         Error messages on the wrong HSC channel. This problem existed if
    you had the drives dual ported between two HSC's and the drive was
    'selected' by one of them. If the drive reported any errors, both HSC's
    received the message packet. The problem existed for the HSC which did
    not have the drive selected and therefore was not in its list of 'known
    drives'. If the errors happened frequent enough, the opposing HSC
    invoked ILEXER to find the problem and it could not find the drive and
    this continued until the HSC would declare the drive inoperatble or
    crash. This has since been corrected and new firmware is being
    distributed to existing sites and installed on new systems. I have not
    had the firmware in long enough to determine if it is truly fixed
    myself. Concurrent with the firmware update, there may be a need to
    have the disk(s) reformatted to provide more space for diagnostic
    testing on the disk. SI has just announced and provided to the field
    offices a box that will enable them to format and exercise a FUJI/NEC
    drive fully, without any HSC or host support needed.

         The third screw back on the rack slides which hold the SI
    controller in the cabinet can push the metal standoff inwards and the
    result is a C-Mod card that is bent. This was the case on the SI83C
    drives and was remedied by removing the screw completely. The problem
    is a screw that is about 1/16th of an inch to long and the metal
    standoff between C-Mod cards being perfectly located to coincide with
    this screw.

         The SI controller box has only one power switch in its current
    form. The implication here is that should one C-Mod card fail in the
    controller cabinet, the entire cabinet has to be powered down to fix
    the problem, which would also cause up to four drives to unavailable
    for use during the outage. The space inside the cabinet is layed out in
    such a way that any work done to one C-Mod card would require that all
    drives with controllers in the cabinet be turned over to field service.

         The attachment of the SDI and drive cables to the back of the
    controller cabinet are done in a very compact way, which can cause
    headaches for the field service representative. Similar cables for two
    drives are directly over one another and the cable hold downs have
    screws on the top. The result is cramped space between the top and
    bottom cables and the back of the cabinet when it is pulled out on the
    slide racks. The cable from the SI83C drives to the SI controller
    cabinet is approximately three (3) inches wide and is installed in such
    a manner that it covers over an air vent in the power supply that is
    four (4) inches wide. The result is an air vent with only one (1) inch
    of effective air flow on a power supply.

         The failure rate to date has been one C-Mod card, one power supply
    one FUJI drive and two NEC drives. The C-Mod card and power supply
    failures I would attribute to the power failures we have had recently
    (four in five months) and is minor compared to the five HSC requestor
    cards, two HSC CPU cards, three HSC CI Link cards and various DEC HDA
    and VAX 8XXX problems I have had in the same period. The FUJI HDA
    failure happened immediately after delivery and the initial power up,
    and I consider it the result of shipping damage. The two NEC HDA
    replacements I attribute to power failures also. These are located in
    another computer room I have in Wilmington, Delaware. On a GOOD week we
    ONLY have one power failure, enough said.

         There appears to be a 'latching' problem between the 'C-MOD' card
    in the SI controller and the FUJI/NEC drives. The problem comes up when
    you power down a drive and power it up. Or power a controller cabinet
    down then up. The write protect signal and drive ready signals can be
    latched in the wrong state and require several more power down/up
    cycles to clear them.

         Overall I have to say that I am very pleased with the disk drives,
    their performance and the support of my local field service office. SI
    appears to have a product on the market that provides an alternative to
    the equipment that is offered by DEC and the pricing is very
    attractive. Of all the issues that I have talked about here, the only
    one that could change due to your getting the SI disk drives is the
    support of the local field service office. I view this as an important
    issue and one that needs constant monitoring and changes. I feel it is
    the users responsibility to watch the equipment on a day to day basis
    and to notify the local office of any problems. It is also the users
    responsibility to press any issues that arise over questionable or
    incomplete support that you may be receiving. If the field service
    office is not supporting you to the extent that you feel is needed,
    take it up with your salesman, and let them work the issue for you. If
    they can not resolve the issue call the west coast, but only after all
    other avenues have been tested.


    NOTE***
         All comments, statements and facts here are my own, and not that
    of my employers, National Teachers Life Insurance, Teachers Service
    Organization (TSO) or any of their subsidiaries. All rights to this article
    are reserved. This article is not meant to be a 'Sales' pitch of the 
    product. I have no connections with SI, short of HEAVILY using their 
    equipment. Any electronic reprint of this article MUST completely contain 
    this NOTE. NO PERMISSION IS GIVEN TO REPRINTING THIS ARTICLE OR ANY 
    PARTS OF IT ON PAPER, OR SIMILAR SUBSTANCES.

    Paul D. Clayton
    Manager Of Systems
    TSO Financial Corp.
    Horsham, Pa. USA 19044
    Address - CLAYTON%XRT@CIS.UPENN.EDU