[comp.editors] What is buffer gap you ask?

jpayne%flam@Sun.COM (Jonathan Payne) (01/04/90)

Since my posting I have had a lot of people ask what buffer gap is.
So instead of answering each in turn I figured I would just post a
message.  I learned about buffer gap in the Emacs Cookbook that was
written at MIT a million years ago.  I think it was someone's master's
thesis.  I didn't see it implemented until Gosling's emacs.

Buffer gap is just one way to store a bunch of characters you wish to
edit.  (Another way is having a linked list of lines.)  Emacs uses the
buffer gap mechanism, and here is what a buffer looks like in emacs.

|<----- first half -----><----- gap -----><------ second half ------>|

An emacs buffer is just one huge array of characters, with a gap in the
middle.  There are no characters in the gap.  So at any given time the
buffer is described as the characters on the left side of the gap,
followed by the characters on the right side of the gap.

Why is there a gap?  Well if there isn't a gap and you want to insert one
character at the beginning of the buffer, you would have to copy all the
characters over to the right by one, and then stick in the character you
were inserting.  That's ridiculous, right?  Imagine editing a 100kbyte
file and inserting characters at the beginning.  The editor would be
spending the entire time copying characters over to the right one.
Delete would be done by copying characters to the left, just as slow.

With a buffer gap, editing operations are achieved by moving the gap to
the place in the buffer where you want to make the change, and then
either by shrinking the size of the gap for insertions, or increasing the
size of the gap for deletions.  In case it isn't obvious, this makes
insertion and deletion incredible simple to implement.  To insert a
character C at position POS in a buffer, you would do something like this:

	/* Move the gap to position POS.  That is, make the first half
	   be POS characters long. */
	Buffer_MoveGap(b, pos);
	b->data[b->first_half++] = c;
	b->gap_size -= 1;

There, done.  The gap is now one character smaller because we made the
first half bigger while sticking in the character we wished to insert.
Actually, at the beginning of the routine there should have been a check
to make sure that there is room in the gap for more characters.  When the
gapsize is 0, it is necessary to realloc the entire buffer.  Deletion is
even easier.  To delete N characters at POS, all you do is make the gap
bigger!  That is, by making the gap bigger, you take away from characters
that used to be part of the buffer.

	Buffer_MoveGap(b, pos);
	b->gap_size += n;

That is delete, folks.

Moving the gap is pretty trivial.  Just decide if you are moving it
forward or backward, and use bcopy.  bcopy is smart enough to handle
overlapping regions.

So, when emacs reads in a file it allocates enough memory for the entire
file, plus maybe 1024 bytes for the buffer gap.  Initially the buffer gap
is at the end of the file, so when you insert at the beginning of the
file right after reading it in, you will notice a longer delay than
usual, because it first has to move the gap to the beginning of the
file.  The gap only has to be moved when you are doing edit operations.

To examine the contents of a buffer, you can define a macro:

	Buffer_CharAt(b, pos)

All this does it check to see whether pos is in the first half or the
second half, and then index that one HUGE array of characters correctly.
It's surprisingly fast, actually.

Somebody mentioned that GNU search takes two strings to search in.  That
was Stallman's way of optimizing the hell out of the search.  The two
strings passed in represent the first half and the second half of the
buffer.  Now the search code does not have to use the Buffer_CharAt macro
to examine the buffer because it is guarunteed to have two contiguous
strings of valid data.  That's a good idea - I might have to adopt that
approach.

Notice that with this approach to storing the file, you can edit binary
files without thinking about it.  There are no line length limits.
Newline characters are just normal characters, which you can search for
if you want.  Also notice that to read a file, you do the following:

	fstat(fd, &stbuf);

	/* make sure the gap is at least as big as the file */
	Buffer_EnsureGapSize(b, stbuf.st_size);
	read(fd, &b->data[b->first_size], stbuf.st_size);

There, we just inserted a file into the buffer ... with ONE call to the
read system call.  Straight into the buffer, not into a temporary
buffer.  This is FAST.  On my Sun 3/60 it takes about a second to read in
a megabyte.

Redisplay.

One of the hard things with buffer gap is coming up with a good, fast
redisplay algorithm.  With a linked list approach, it's really easy to
write one.  It's important that internally the editor can munge around
and make changes to the buffer, and then just make one call to redisplay
when it's done.  And the redisplay should be smart enough to use insert
and delete line capabilities of the terminal/window system for scrolling
and for when lines are inserted and deleted.  The way any smart redisplay
algorithm should work is in stages:

	1) Layout the window.  This means figure out which lines are going
	   to go where; basically how the screen should look when the
	   redisplay is done, without actually doing any output to the
	   screen.

	2) Compare this layout to the layout of the screen after the last
	   redisplay.  This boils down to comparing lines to see if the
	   same lines are in the same place as before.  When you scroll,
	   all the lines will be off by how ever many lines you scrolled
	   by.  This is easy to detect, and when you do, you just use the
	   terminals insert/delete line capability to make the redisplay
	   faster.

	3) Repaint the lines which are different from before.  Lines are
	   different either because they have been changed, or because
	   they weren't even visible before.

So how do you actually compare lines quickly?  You can't do a string
comparison, because the scrolling algorithms have to do lots of
comparisons.  With a linked list approach to storing a buffer, you can
just compare the line pointers (a simple integer comparison!) in phase 2
(above).  So it's nice and fast.  But what do you do for buffer gap
implementations?  Well there are a couple of solutions, and I can tell
you two of them.  The first is Gosling's approach (which GNU still uses
to a certain extent) which is to calculate a hash value for each line
based on the contents of the line as it would appear on the screen, and
then using those hash values to do phase 2 of the redisplay.  This has
two bad problems in my opinion.  The first is that it takes a long time
to calculate those hash values, and second, since it's based on the
contents of the lines and not the physical position of those lines in the
file, the redisplay can be soooooo smart that it does unintuitive
things.  For instance, if you have duplicate code (shame shame shame) and
you type ^V to move forward a page, emacs will sometimes scroll BACKWARDS
on you!  It's very confusing, even if it is the fastest way to update the
screen.

Since it takes so long to calculate those hash values, Gosling and
Stallman had to kludge up the redisplay algorithm to try and optimize the
90% case of just inserting characters.  That's a shame and isn't
necessary.

I am not sure if I am allowed to describe the fast way to do this, but if
you are interested, I will see whether my managers will complain.

Anyway, so I hope this helps people understand buffer gap.  I can't
imagine writing an editor any other way, even though it does have it's
disadvantages.

walker@hpl-opus.HP.COM (Rick Walker) (01/04/90)

/ hpl-opus:comp.editors / jpayne%flam@Sun.COM (Jonathan Payne) / 11:41 am  Jan  3, 1990 /

> Buffer gap is just one way to store a bunch of characters you wish to
> edit.  (Another way is having a linked list of lines.)  Emacs uses the
> buffer gap mechanism, and here is what a buffer looks like in emacs.
> 
> |<----- first half -----><----- gap -----><------ second half ------>|
> 
  ... (lots of explanation deleted...)

> Anyway, so I hope this helps people understand buffer gap.  I can't
> imagine writing an editor any other way, even though it does have it's
> disadvantages.

I was also convinced that the buffer gap was the best way to write an
editor until I read "RED, a better C screen editor" by Edward K. Ream 
in Dr. Dobbs Journal, Number 81, July 1983.

The buffer routine describe therein uses virtual memory techniques and
caching to speed up buffer access.  Basically the edited file is read
into a sequence of blocks (~1-2k in size).  Each block has a pointer to
the disk address of both the previous and next block of the file (ie, doubly
linked), a dirty bit, and a count of how many characters are stored in
the block, and some number of characters from the file.

When inserting new text into a file, it is put in the current block if
there remains room. If it will not fit, a new block is allocated, linked
into the structure, and filled with the new data.  Similiarly, when
deleting, data blocks are merged or freed as needed and the empty disk
blocks are added to the free list.  Blocks which are modified have their
dirty bit set.

When browsing sequentially through a file, new blocks are read in as needed
and stored in memory.  The block slots in memory are cached in a Least
Recently Used (LRU) fashion.  This means that scrolling up and down in a file
over a small range will result in zero disk accesses because all the file
info is held in the block cache.

When it is necessary to flush a block to have space to read in a new
block, the least recently used block is replaced by the newly requested
block.  It is at this point that there can be a big performance
improvement over the buffer gap method.  In the buffer gap approach,
data pulled from the top of the gap must always be copied to the bottom
of the gap.  This involves alot of slow copying if the buffer gap file
is on disk.  In the RED system, only modified pages must actually be
written back to disk.  If the page is not modified (ie, the dirty bit is
not set on the page), then it is merely discarded from the cache when
new space is needed.  Simple sequential browsing through the file is
therefore twice as fast as the buffer gap method (no disc writes
required).  If you do alot of scanning of your file (regexp searches,
etc), the buffer gap approach will be twice as slow as the RED method.

If initialization of the later tempfile blocks is deferred until they are
actually needed, then the start up time for the editor is negligable.
For example, one could initialize only the first couple of buffer
blocks, start the editor with the cursor on line 1, and only read in later
blocks when the user scrolled up to that point in the file.

You may argue (correctly) that if the buffer gap algorithm is
implemented on a virtual memory machine that all the VM stuff is handled
by the operating system.  The price you pay is that you have a large
core image to be swapped around  (have you ever heard anyone complain about
the size of EMACS?).  Another gotcha is that every sweep through the file
in the buffer gap method modifies every single page of buffer memory. If the
file is bigger than memory, and therefore must be paged, the gap method
therefore forces the operating system to write all the pages. This
makes very poor use of the disk swap system.  RED on the other hand
only modifies pages when necessary and can usually avoid having to write
out stale pages.

The RED method allows a configurable and continuous trade-off of core
size vs cache miss ratio.  If desired, a very small but usable editor
can easily be configured.

Finally, the buffer gap editor is not portable to non-VM systems without
putting restrictions on the size of file that can be edited.  In
particular, files must fit in the available RAM instead of being limited
to available disk size. 

Cheers,
Rick Walker

--
Rick Walker
!hplabs!walker

fin@uh.msc.umn.edu (Craig Finseth) (01/05/90)

In article <62420008@hpl-opus.HP.COM> walker@hpl-opus.HP.COM (Rick Walker) writes:
	...
	I was also convinced that the buffer gap was the best way to write an
	editor until I read "RED, a better C screen editor" by Edward K. Ream 
	in Dr. Dobbs Journal, Number 81, July 1983.
	...

As I mentioned in my previous posting, this technique was described in
my 1980 thesis as "paged buffer gap."  It was first implemented in a
floppy-based CP/M environment, and was well-suited to that
environment.  This technique essentially allows you to create a
virtual memory environment in software.

However, it is a (not universally accepted) software axiom that an
application-writer has better things to do than re-implement the
operating system.  CP/M did not offer any paged virtual memory
support.  Hence we did not re-implement anything.  However, a "typical
workstation environment" (e.g., Sun) offers the virtual memory support
in hardware.

For those of you who do not accept the aforementioned axiom, I contend
that the paged buffer gap scheme performs worse than a standard buffer
gap scheme for the following reasons:

- First, all sequential search operations require that the data being
searched through be in memory, at least momentarily.  (Search strings
are in general much smaller than page or line sizes, so sub-linear
searches do not save a significant number of paging operations.)
Standard buffer gap schemes offer the most efficient use of both
memory and paging hardware.

- Next, extra overhead is incurred during file reads, which are relatively
common.  (The same overhead is incurred during writes, but they are
relatively rare.)

- Additionally, paged buffer gap is more complex to implement than
standard buffer gap (personal experience: I've done both).

- Finally, Modern hardware is fast enough that large blocks of memory
can be moved quickly and efficiently.  In particular, a recent message
contained timing data showing that an 80K byte buffer move was not a
significant problem.  Hence, paged buffer gap is a solution in search
of a problem.

Remember that the gap only moves when both of these conditions are met:

- An insert or delete (but not replace) operation is performed.

- The gap is not at the place where the operation is to be performed.

The gap never moves (and pages are never modified) when data is only
being read.


Craig A. Finseth			fin@msc.umn.edu [CAF13]
Minnesota Supercomputer Center, Inc.	+1 612 624 3375