BTI logo

– Ron Crandall's BTI Stories

Ron Crandall worked at BTI, and was a key employee, from 1973 to 1993. Being a hardworking engineer, a top manager, and being plugged into the grapevine, Ron witnessed a lot. He has been instrumental in helping me organize this website and provided a lot of the material for it.

This page is a collection of various stories Ron has passed on to me via email, starting in 2007, even before I had the website set up. Because they were ad-hoc and not written with the intent of being a coherent story, the presentation here is disjoint.

Ron has given me permission to post these extracts here, and I have the feeling he will be adding more later. Ron has a great memory and I feel that if I manage to strike the piñata right, more stories will fall out. Here's to hoping for that event.

A few hilarious, but possibly libelous, stories have been removed to protect the innocent (and the not so innocent).

On the idea of getting a usable BTI-8000 boot image

Even if I could find a backup tape or something similar, they are all encrypted. The algorithm isn't extremely strong, but it might be tough just to reincarnate the code... then the cryptanalysis to rediscover the key might add a little more work factor.

Tom Poulter has two actual machines, disks and some manuals. The machines should be working, given enough power (they were power hungry!). You would need both machines, as the only likely way into a machines privileged accounts is through the remote diagnostic facility, which only runs on an 8000. Think I could find my old passwords (to get into the rdf)... but not sure.

In the event that the machines couldn't be made to work... The disk drives are 49MB 11 platter removables, I forget the actual designation. The format of the track is unique to the 8000. Each track is split into 4 disk blocks. Each block is 4140 bytes. And the blocks are composed of 460 byte pieces, 9 for the data and a tenth that is the XOR of the nine. Each piece has a sync field in front of it. Don't know if you could make sense of the data without one of the controllers even if you had another machine to read it on. The first 9 pieces are in order and not scrambled or anything unusual, so possible. Error recovery was done in software, but the tenth piece was generated and written by hardware/firmware so we didn't pay the overhead on every write.

Ron answers a bunch of my questions

My questions have a purplish background; Ron's replies are greenish.

Q1:

At first I said the eight CPU registers were all general purpose. But then I found that R7 was used for subroutine linkage. Were there other designated uses for certain registers? I don't mean designated by software convention, but that the instruction set itself treated a given register differently than others.

R7 was indeed used for stack linkage. In ALL of the code that we wrote R6 was used as a stack pointer. However, I don't think that the instruction set required that.

A few special purpose instructions, such as CMOVE (character move) had dedicated functions for R0, R1, and R2 (I think source address, destination address, count). PRMUT might had dedicated R0. I suspect there might have been a few others in the instruction set, but we were quite rigid in our register use, so that added pragmatic register assignments to help cloud the memory.

Q2:

A brochure from 1978, before the 8000 was real, claimed that all the OS was written in Pascal. My recollection was that just about everything was written in either assembly language or DRAGON. I was a hardware guy, so I didn't know first hand, but it is what I heard. Was that true? Was the claim in 78 the intent, or just marketing saying things that sounded cool?

Marketing was like any other marketing department; playing fast and loose with the truth. At the time, Pascal was the only language that we had an organized compiler project for (except, of course, basicx). As it turns out, the Pascal developer left the company and was never replaced. That was also the case with the Fortran compiler! There was some pressure within the software group to abandon all of our assembly code and rewrite everything in Pascal. At the time, it was not at all clear that ANY language would evolve as the systems programming language. C was just another contender along with BCPL and some other oddballs.

Glenn Andert and some other people from whatever school he was from, were the big proponents of a 'high level' language for systems development. They were NOT however in the OS or utilities group. All of us there, including the department head (George 'Lew' Louis) were assembly language fans. I believe all of those people were in the database project group, and they expended thousands of man hours developing their own proprietary language, Dragon. That compiler wasted an incredible amount of paper.

Q3:

The same brochure lists these languages: cobol 74, fortran 77, assembler, BASIC-X, PASCAL-X, RPG-II, DBMS-X

Were these all eventually available? Any others that weren't in that list? I recall someone working on C in 85/86, but it got canceled when BTI went through one of its contractions. I also have a vague recollection that the guy in charge of Pascal was named Bob. Odd what echos are left in my head.

Perhaps Bob Pariseau. A really great guy.

Cobol: I seem to recollect now that we did go to RM for compilers, but I don't think any of these project were completed.

Fortran, I think it was Lou DeMartini that worked on the Fortran compiler. The project died when he left BTI. He had started the project in assembly, believed that crap about Pascal and started recoding everything. Then, or course, couldn't test anything because we never came close to having a Pascal compiler.

I never knew of an RPG project.

The Basic-X interpreter/compiler was written in assembly and virtually all of the customers for the 8000 ported their BTI5000 programs over.

DBMS did exist, it was written in Dragon.

Q4:

The brochure from 78 talks about the 8000, but it certainly wasn't available for a long time after that. Do you know when it first shipped? 1982 maybe? When I was there in 85/86, it seemed like we shipped on or two machines a month, and the sales curve was flat. Do you know what the total installed base was for the 8000?

The first BTI 8000 was shipped in 1981. We had ghastly problems maintaining it. Two problems ganged up on us. One, we didn't yet have the remote diagnostic facility working. Once we had the RDF working properly, we could dial up a sick system and often restart it so that customers didn't lose any data (or even their sessions in most cases. Users just saw a long pause). The second problem was some kind of firmware issue that caused bus errors where the system bus just froze up. This hurt us really bad because disk transfers would just finish off with one bits.

Both issues were fixed in the same release, probably fall 1981.

Total PAID BTI 8000s probably amounted to about 25 or 30. There were 19 customer systems in 1995. The highest serial number was xxx67, but there were some gaps that I can't believe were actually filled with systems. In addition to R&D systems (named for the seven deadly sins... envy, greed, sloth, etc. I don't think we used them all up) there were 2 Field Service systems (running RDF and various F.S. related programs, problem tracking, etc), two systems in the UK office, and one or two admin systems that did a lot of BTI business applications.

Q5:

How much were you involved with the BTI-3000, 4000, 5000? I've also seen the BTI-6000 referenced. One BTI customer told me a bit about this series. The short was that the 3000 and 4000 were repackaged HP 21MX CPUs, with the BASIC dialect enhanced by BTI to be more suitable for businesses, and faster. Still, I didn't understand how a 3000 and a 4000 differed. The 5000 used a CPU of BTI's own design (although he didn't say if it was faster, slower, more capable). This is kind of open ended, but do you have sketch of the sequence?

BTI was incorporated in 1968 to sell time sharing on a HP 2000A. The software design of the 2000A had some real, serious, deficiencies. Most notably, it wasn't 'crashproof'. A software or hardware crash would generally result in the loss of most of the data stored on the disk, unless an arcane recovery process to recover vital disk structure information stored in core memory was successful. The two software developers (At HP, Mike Green did the scheduler and time share part of the software, and Jerry Smith did the Basic interpreter) maintained that it wasn't their responsibility to recover from such issues, just fix the hardware). I hired on at HP in 1969 and worked on several upgrades to the HP 2000A (known as the 2000B and 2000C). Notably, a more forgiving organization was never even mentioned in requirements documents and meetings.

The 2000A was release in 1968. Tom Poulter, Steve Porter, and Paul Schmidt worked at HP at the time and thought they'd like to run their own business. So they bought one and set it up. They had a bank of modems and were eating beans for a while. Meanwhile, Poulter and Porter were rewriting code as fast as they could to deal with the reliability and maintainability issues. Note that HP, in common with most vendors at the time, did not in any way inhibit the use of the source code.

Also, along the way, P & P (Schmidt was concentrating on selling timeshare services and never really bought in to the idea of selling hardware. He was a major thorn in the side of everyone for a long time) needed to add disk storage space for the ts business. Since HP was only selling fixed head devices, and they were outrageously expensive, Porter developed a disk controller for an Iomega 2.5 MB single platter moving head disk. They were able to put that up on their ts system and continue to expand their ts business.

In about 1971, Poulter and Porter realized that the money had all been earned by the seller of the computer. Selling timeshare services at $5 an hour was a hell of a way to make a living. So they went looking for a customer. I believe that they hired a marketing representative in Arizona who actually made a few sales for them.

The first was Arne Cantrell who wanted to make software for automobile dealers. John Matthews (of Matthews Chevrolet in Tuscon AZ) was willing to foot the bill for the hardware and software development. So, P&P bought the needed hardware and did a lot of software work to productize the stuff they had been working on. That system was delivered sometime in 1971 and was the first BTI 3000, not counting the one in house for the ts service.

More 3000s were sold over the next year or two. I hired on at BTI in Jan 1973 and continued to work on the software for a few years before the 8000 development got underway. Reynolds and Reynolds (no, they don't make an aluminum cigar) bought the automobile dealer business from Arne Cantrell. They wanted cost reduced versions of the machine. So, the BTI 4000 was born. It had more parts of BTI manufacture, it was more modular so that it didn't require a relay rack any more. It was made of sheet metal modules that stacked to form the system tower. Each module could be boxed and shipped easily and the installation was much quicker and easier. We shipped probably a thousand of these.

Along the way, we started investigating using HPs newest CPU in our machine. Well, this was actually a continuing process, I think the first machines shipped with a 2100 cpu. Can't remember all of the various model numbers. In any event, one of the versions had a gratuitous floating point bug, so we started to write our own microcode. (As a side note, we informed HP that they had a serious firmware bug and offered to supply details, but wanted some minor consideration. HP was notorious for ugly sales contracts and we wanted some relief from something. Wouldn't have cost them much. I guess HP R&D conviced HP management that we were blowing smoke and we got the brush off. 6 months later, HP spent millions to replace all of the microcode. The problem showed up as an anomaly in one of our standard tests, so they should certainly have seen it. Oh well.)

In addition to the floating point firmware bug, the 21mx from HP had some ugly design flaws. I was slammed hard by one. The gist of the story is that the logic of the machine (not including memory) was distributed on 5 boards (a1,a2...a5). I was chasing an intermittent problem and I was able to identify a signal that wasn't being latched correctly on the clock edge. It turned out that the signals origins were on a1, it went to the main board (might have been wire wrapped or a similar non printed backplane) and came back on a3 where it went through some combinatorial logic before going back out on the backplane to a5. There, you guessed it, some more combinatorial logic. And the result of that went back to a1 and showed up at the D input to a FF; too late to be reliably caught on the clock edge.

I seem to remember a clock interval of 100 nS with a worst case prop delay of 120 nS. Most machines worked fine, but if very many of the parts in a given machine were slow, it would fail. And swapping boards around might totally hide the problem as you matched up different sets of boards.

This was another problem that convinced us we needed to build our own machine.

The 5000 was when we built everything ourselves, including the CPU. I'm not sure where the memory came from. We probably shipped about 2000 of these. In 1974/5 we started working on the BTI 8000. Me, George Lewis, Bill Cargile, and Bill Quackenbush (I think that's the right roster) started preliminary investigations into a new 32 bit computer system. Lew and I had worked on the timesharing system at Oregon State University and we knew what we wanted. We soon disbanded the investigation and went back to other more pressing BTI 4000/5000 work, but we still maintained some level of activity on the 8000. In 1975 (I think) we talked to Jim Meeker about our instruction set and he proceeded to design the CISC that we ultimately built. I spent a lot of this preliminary design time working on a disk file system structure that would meet our goals, primarily crashproof. By 1976 we were assigning some people full time to the 8000. I (and many others) didn't go full time until a re-org in Feb 1977. That re-org had some nasty side effects, but that's another story.

Q6:

I left BTI in 86 and didn't follow it after that. I know there was a tie-in with a BTI in England, which is still there. What did the business model turn into? How long were the old systems supported? Was there any new development, or was it just support?

The UK office took on maintenance of things like tape drives that we used. In the US, we just shipped them back to the mfr. But BTI UK became the official European rep. for the mfr. They continued to build quite a business of electronic equipment maintenance. Poulter sold or donated the shares of BTI UK to the workers there a few years ago.

BTI started shrinking in 1981 when the 8000 didn't sell in the huge numbers we were hoping for and the recession dried up our sales of the 5000. That was particularly ugly, a good example of a contract provision having unintended side effects. Actually, worth describing...

Our contract with R&R called for manufacture of BTI 5000s, at the peak we were building a little more than 140 a month. They got a discount price depending on how many were released to be built (not the problem). They also had a provision that they had to 'release' (ahead of time by a few months, I'm not sure how long) for production some number of machines. They were only allowed to change that number from month to month by some percentage (about 20%, but I don't know the details). We held completed machines that R&R had not yet sold, so we could see how they were doing. When they sent us an address, we shipped the machine. In summer 1980, R&R noticed that they were starting to build up a little inventory of unsold machines. So, they instigated a sales contest to try to move them. Now, apparently, the incentives were pretty impressive, and the sales force started to really move machines. So, we got releases for MORE production. Then, at the end of the sales contest, sales vanished... every customer who had any thought of buying one had been tapped. But, meanwhile, we are building 140 systems a month. R&R is now reducing their releases at the highest rate allowed in the contract (.8 times .8 time .8 etc) So at some point the new orders from R&R go to zero and we have 750 systems sitting in inventory. Worse, the 8000 is about to enter production and we had high hopes for lots of sales, so were really slow to lay off our experienced mfring employees. This culminated in a massive layoff in Jan 1982, which was followed by lots of desertions as scared people made a move. We had two more minor layoffs that year (I was able to get rid of a particularly annoying jerk of an employee in one of them).

In March 1986, BTI decided that the 8000 was no longer a viable product and reduced work force to go into maintenance mode.

After this layoff, Poulter guaranteed employees that all future layoffs would have 90 day notices. This seemed to help stop the loss of the best employees and we gradually shrunk down as the systems in the field came off of maintenance contract. In 1993, the company was down to about a dozen (I was terminated in Sep, but continued with some contract work until 2001 or so). BTI tried to introduce several new products with no success (oddly, one that they struggled with in the US succeeded in the UK). For a while, BTI had no office. The final customer systems were maintained by Phil Deal and he ran the RDF machines from his house. The last of the field systems was retired in about 2002 and BTI US was closed down.

On RDF (Remote Diagnostic Facility)

The code for the RDF was written by Guy Lauterbach. His basis was the software debugger and 'control mode', our term for what is now called a shell. He just substituted an encrypted communications link and appropriate commands for the debugger and shell commands.

The RDF firmware (living on the SSU board running a Motorola 8x300 IIRC) was written by Jeff Libby. After Jeff left BTI, I took over the firmware and made some modifications for higher throughput.

The encryption algorithm and the communication protocol were mine... I'm not embarrassed by them. They served their purpose fairly well. Both Jeff and Guy had to implement the corresponding parts of the algorithm in their respective code.

And because of the error detecting and anti-replay protection on the communications packets, the 9600 baud line was actually the equivalent of 4800 baud. People often mistakenly blamed the encryption for that. But the encryption was well within the capability of the machines involved to do on the fly with almost no impact to the transfer rate and it did not expand the data. The real overhead were the checksums and sequence headers. But we were quite confident that...

  1. no one could break into a machine using the RDF link
  2. no one could hijack one of our sessions to break into a customer machine
  3. no one could replay a session or part of a session to spoof us or the machine into accepting illegitimate data
  4. line errors would be corrected by ARQ

The RDF allowed BTI maintenance people to actually log onto a customer machine to any account to which they had the password including as @005 (the 'superuser') account.

Irrelevant aside... the @005 account passwords were unique to each machine. They were often topical, sometimes related to the city or organization where the machine was installed, and very frequently funny as hell, and often not repeatable in polite company.

If the target machine was down, the RDF allowed dumps of registers, memory, etc. Virtually as if you were there.

I can remember sitting in my family room in San Jose, dialed into the BTI RDF machine (an 8000 was dedicated to the job, although it could be and was used for other functions) using the RDF to debug a machine in Australia. I was able to rectify whatever glitch had hit it and get it going.

On the BTI-8000 Backplane Design

During the first two years of design, the plan was to have the asynchronous bus and all design decisions were predicated on this. Each board would have its own clock and the bus arbitration would take care of syncing up. The advantage would be that we didn't need to distribute a system clock, thereby eliminating a single point of failure. Also, each board could run at the maximum speed that it was capable of. And redesigns of boards could easily run at a higher speed.

I'm not sure who was the 'champion' of this approach. Almost certainly Bill Cargile. But the guy who got to try to implement it was Roger Fairfield.

Roughly in the 1976 time frame, Roger could be found in the lab every day trying to make the asynchronous bus work reliably. Small backplane interface test boards were fabricated and long running tests were set up. Every test would fail about once a day.

This testing and groping for solutions went on for most of a year. Finally, Bill Quackenbush revamped the test setup slightly and demonstrated that the design would not work reliably.

Bill set up two adjacent simulated boards with clocks that were running essentially identical frequencies. But the phase between them varied gradually over the course of time... an hour or two? I have no idea how this was done. Perhaps there was a PLL and a phase shifter... I just don't know. Perhaps two xtal oscillators with trimmers...

Ordinary TTL F series flip-flops were being used for the synchronizers. 74fxx parts.... can't remember the exact designation. It was fascinating to watch Bills demo. On the scope, you could see the two FF outputs as the clock and data transitions crossed one another... Of course, the FF had minimum setup and hold requirements on the latch input relative to the clock. And equally 'of course' those constraints were violated during this crossing. You could see the early clocks cause the output to snap to the desired arbitration state, and the late clocks to the other state as well... but deep in the middle of the violation zone, you'd see a microscopic window where both q and qbar outputs of the FF would go low... and stay low for much longer than the advertised settling time of the FF. I believe the FF has a clock to output prop delay of 5 ns. We saw both outputs low for upwards of 40 ns. It was an eye-opening demonstration and an impressive method to show it as well.

Since Bill Cargile was no longer employed at BTI, it was relatively easy to convince everyone to go to a synchronous backplane. This required revamping all of the bus logic, but since no boards (other than backplane interface test boards) had been built yet, not a big problem.

Another note, the BTI 5000 hardware was emulating not only the peripherals, but also the instruction set. We had several BTI 5000 boxes connected to backplane interface cards. One would simulate the cpu (early, with 'internal' memory, later, when an MMU was available, it would go to the backplane for memory operations). Others would do disk I/O, etc.

Interestingly enough, we never tried a dual cpu emulation... the first dual cpu system operation was live with real cpus. It came up and ran first time.

Deadlocks

And a note on deadlocks. All of us knew how to avoid that particular complication. The callback scheme, in particular, was immune to deadlocks. The OS itself also had subtle 'rules' to prevent deadlock. It's possible that these rules are noted in the listing, but I wouldn't count on it.

The internal rules in the OS for preventing deadlock amounted to placing all locks into categories, ordering the categories, and ensuring that every operation that needed multiple categories used the same ordering.

Once I was working on a particularly tricky part of the OS and I was having trouble defining categories and an ordering that would work. I'd been hitting a brick wall for weeks, but then went on vacation. So I was in Eureka Nevada at a friends house and one morning I woke up and had the solution.

Lots of inspiration required to make this work.

Wire-Wrapped CPU Boards

The first CPUs were wire wrapped, but I don't know if we actually shipped any that way. I think by the time we got to shipping, we had gotten a multilayer board working. Funny story there too.

Our wizard layout guy was named John Caris. He could take a BTI 5000 schematic and lay it out in 2 weeks. Those cards were about 8" by 10". 80 sq inches. Since the 8000 boards were 460 sq inches, Lew figured that Caris could layout one in 12 weeks. I don't know what he was smoking to think that the effort was linear in board area, but there you have it.

Well, when the first CPU layout was started, Caris was having a lot of trouble because of the sheer complexity. Also, I'm guessing it didn't have enough layers to even be possible, but don't remember. Well, the project was getting later and later and Caris was busting his butt. So Lew (in another inspired decision, unbelievable) decided that we'd take our other PC layout guy and have him work a night shift so we could be doing layout on this one board 16 (or maybe 20) hours a day.

As you might guess, this didn't work for shit.... Caris would spend half the day tearing out the other guys stuff so he could route his stuff, and vice-versa. It was a major cock-up.

That's when we decided that the CPUs had to be wire-wrapped to get a working version. They occupied two slots in the backplane, so were a known compromise. But they allowed Quackenbush to debug the cpu design and get working hardware. To add insult to injury, when we did get PC boards, they didn't work because of cross coupling between the bus runs (all the 32 lines for a bus were often run parallel for a ways), a problem not seen in the point to point wire wrapped boards.

On Being a (Text Editor) Curmudgeon

I was very slow to adopt a lot of things. I started using more and more features of the full screen editor when I started doing things like adding comments to someones sparsely commented code, etc. I'm not sure I ever ended up using the full screen display, at least not as the default. Remember that the line editor did allow one to edit within the line, it wasn't so primitive as to have to replace whole lines.

When Tovar left BTI, we limped along with the bugs in the editor for quite a while. When the succeeding maintainer resigned, I took it over. I went through it and fixed a number of unclean practices (pushing and popping temporary variables rather than preserving a constant stack frame) and all of the problems magically went away.

I'm sure there are lots of stories about me! I was seen as a real impediment to progress in a number of areas. I was just very conservative and reluctant to even try 'radical' new ideas, such as high level languages. Especially true for the BTI 8000, since we designed the instruction set to accomplish the jobs we needed. No compiler could use the instructions... they just didn't fit the paradigm of any known language. Dragon was a force fit of a language to the instruction set started by Glenn Andert (and implemented by Pat Helland, I think).

From the time I started programming computers (in 1965) until 1995, I had only written in either absolute hex code (Alwac IIIe, see Axel Wenner-Gren in Wikipedia and some info from googling) or assembler, barring a little fortran. And of course Basic, so that I could test stuff on the BTI machines. I was expected to know C and do stuff like code reviews in 1995, but I finally started to do serious programming in C only in 1998.

Interesting story about me and text editors....

In 1993, I was laid off from BTI, due to lack of money and also due to lack of work for me to do. However, I did have a contract to do some work. This meant that I had to continue to use the BTI editor no matter what I used in my new job.

In 1995, I went to work at a job where I needed to learn how to use Unix. And I tried to use vi. Well, I'd sorta learn a few things, then either that night, or the next weekend, I'd be bollixing up edits in the BTI editor. And when I went back to vi, I'd be in a similar pickle. The editors were just too similar; I couldn't keep them distinguished.

So I asked about alternatives on Unix and someone suggested emacs. So, I learned that instead. It was so different that I didn't have the adverse cross-training. I've used emacs ever since. Sometimes I'll be working on some minimal implementation where all I've got is vi. I know just enough to be able to do the rudimentary things I need.

Al Zimmerman Lampoons BTI Characters

Do you happen to have the screamingly funny stories that Al Zimmerman wrote? They were a sort of history of the 8000 from his viewpoint, but written in middle earth style. One of them, I never did see... I suspect because it poked fun at me. But many people showed up looking like reasonable players, or at worst, someone who made a reasonable mistake and had to work to fix it.

I dunno if maybe Robert Adams (at Intel) still has copies. He'd be most likely. Next time I talk to Carl First I'll ask him, but I wouldn't think so. And Guy Lauterbach as well.

BTI's Time Off Policy

The BTI vacation policy early on was three weeks of paid vacation (increasing by a half a day for each year of service), along with ten paid holidays and 5 days of sick leave. At some time, the vacation time and sick leave were combined into PPAT (personal paid absence time). So, we got 20 days plus of PPAT, ten paid holidays, and (also later) 4 days of CPC (Christmas plant closing). I was very against the combination of vacation and sick time. I felt that it would encourage people to come to work sick. For whatever reason, that didn't seem to be a problem. Near the end of my tenure (20+ years) I was earning 30 days of PPAT (six weeks) a year. That was the hardest thing to give up when I went to a new job. If you total that all up, it amounts to 44 out of 260 weekdays in a year I had off.

The Epic Data Recovery Episode

One year, about 1985, probably in June or July, I heard that customer service was working a problem at a customer system (DiSalvo Trucking, I think. It was a several hour drive away in the central valley as opposed to a major airline flight). I heard that they'd had a head crash on Monday night and CS had attempted to recover the backup tapes on Tuesday with no luck; a series of errors during the recovery process.

On Wednesday, I heard that they were going to bring the tapes back to our plant for recovery. Some thought that the alignment of the heads on the customer tape units might have been off and we could experiment with various mis-alignments of our engineering units.

We had two different recovery programs... an on-line recovery program, primarily intended to recover from small disasters where only a few files were needed, although it could just as easily recover a whole volume. There was also an off-line recovery program that was a stand-alone program. It was intended for use more as a boot-strap to make a bootable system disk following a disaster. I had written parts of the off-line recovery program and much of the supporting environment (off-line device drivers, etc.) so I understood this program much better. I studied up on these programs while awaiting the tapes.

On Thursday I arrived at work to a stack of tapes and a machine that had disk drives attached that could be sent to the customer. I decided to try the optimistic approach and just put up the first tape and started the on-line recovery operation. It read to near the end of the first tape and crashed in spectacular fashion. Not being familiar with it's internals. I puzzled over it for a while, then decided I would continue with the off-line version. So I restarted the recovery with the off-line recovery and waited until the end of the first tape where an error blew up the recovery. In looking at the debris, it seemed that the tape record that was being processed was total garbage - but no error had been reported. A look at the code showed that the recovery operation would indeed fail in the face of corrupted data.

So, using the debugger, I backspaced the tape and re-read the record in question. I thought it possible that the recovery program itself had corrupted the buffer, so wanted to re-read the record. I was surprised to see that the re-read record was perfect in every respect.

So I reassembled the program with a backspace, re-read and compare for every record, with a halt on mis-compare. Needless to say, this made an already agonizingly slow operation even slower. So, using the debugger and analyzing many of the records to verify which of the reads of the record was correct, etc., (at least I'd made the default restart point backspace and do the two reads again) I worked my way through all eight tapes of the main backup. This brought me up to midnight, with the incremental backup tapes still to go. So, at this point, I took time out to make a disk to disk backup of the data already recovered and proceeded to tackle the incremental tapes. This meant that I was starting to tackle the tricky task of combining the incremental data with the full backup at about 4 AM, when I was truly bushed. Naturally, I made some minor brain fart and had to recover some files a second time. Thankfully, I had made the disk to disk backup and could easily retrieve those files.

Sometime around 7 AM, I had finished the combining and completed a disk to disk backup. I was looking forward to getting home and getting some rest. I was in the process of writing up a note to inform the CS personnel of the status of the project when the first of a wave of concerned management/marketing types descended on my office. I was kept busy for two more hours explaining the situation to a steady stream of visitors.

Worse, when I arrived home, I couldn't just go to bed; it was a Friday that I had taken as vacation and we had to drive to our weekend vacation rental. I was zombified for most of the vacation.

The problem with the tapes? Who knows. It looked to me like the controller would randomly decide that it had an error on a track (I believe that the recording protocol indicated an erasure) and the proper protocol was to use the parity to correct that track from that point on. My guess was that the tape controller would correct the wrong track. No one who looked at the controller design was able to confirm or deny that suspicion... Perhaps the original designer was no longer available and the follow on responsible parties just didn't understand it. But I do know that this particular problem was never really resolved. Almost all of our customers did disk to disk backups to removable disk packs, so we had little exposure from this.

Random Recollections


back to the story index