[comp.emacs] What is buffer gap you ask?

jpayne%flam@Sun.COM (Jonathan Payne) (01/04/90)

Since my posting I have had a lot of people ask what buffer gap is.
So instead of answering each in turn I figured I would just post a
message.  I learned about buffer gap in the Emacs Cookbook that was
written at MIT a million years ago.  I think it was someone's master's
thesis.  I didn't see it implemented until Gosling's emacs.

Buffer gap is just one way to store a bunch of characters you wish to
edit.  (Another way is having a linked list of lines.)  Emacs uses the
buffer gap mechanism, and here is what a buffer looks like in emacs.

|<----- first half -----><----- gap -----><------ second half ------>|

An emacs buffer is just one huge array of characters, with a gap in the
middle.  There are no characters in the gap.  So at any given time the
buffer is described as the characters on the left side of the gap,
followed by the characters on the right side of the gap.

Why is there a gap?  Well if there isn't a gap and you want to insert one
character at the beginning of the buffer, you would have to copy all the
characters over to the right by one, and then stick in the character you
were inserting.  That's ridiculous, right?  Imagine editing a 100kbyte
file and inserting characters at the beginning.  The editor would be
spending the entire time copying characters over to the right one.
Delete would be done by copying characters to the left, just as slow.

With a buffer gap, editing operations are achieved by moving the gap to
the place in the buffer where you want to make the change, and then
either by shrinking the size of the gap for insertions, or increasing the
size of the gap for deletions.  In case it isn't obvious, this makes
insertion and deletion incredible simple to implement.  To insert a
character C at position POS in a buffer, you would do something like this:

	/* Move the gap to position POS.  That is, make the first half
	   be POS characters long. */
	Buffer_MoveGap(b, pos);
	b->data[b->first_half++] = c;
	b->gap_size -= 1;

There, done.  The gap is now one character smaller because we made the
first half bigger while sticking in the character we wished to insert.
Actually, at the beginning of the routine there should have been a check
to make sure that there is room in the gap for more characters.  When the
gapsize is 0, it is necessary to realloc the entire buffer.  Deletion is
even easier.  To delete N characters at POS, all you do is make the gap
bigger!  That is, by making the gap bigger, you take away from characters
that used to be part of the buffer.

	Buffer_MoveGap(b, pos);
	b->gap_size += n;

That is delete, folks.

Moving the gap is pretty trivial.  Just decide if you are moving it
forward or backward, and use bcopy.  bcopy is smart enough to handle
overlapping regions.

So, when emacs reads in a file it allocates enough memory for the entire
file, plus maybe 1024 bytes for the buffer gap.  Initially the buffer gap
is at the end of the file, so when you insert at the beginning of the
file right after reading it in, you will notice a longer delay than
usual, because it first has to move the gap to the beginning of the
file.  The gap only has to be moved when you are doing edit operations.

To examine the contents of a buffer, you can define a macro:

	Buffer_CharAt(b, pos)

All this does it check to see whether pos is in the first half or the
second half, and then index that one HUGE array of characters correctly.
It's surprisingly fast, actually.

Somebody mentioned that GNU search takes two strings to search in.  That
was Stallman's way of optimizing the hell out of the search.  The two
strings passed in represent the first half and the second half of the
buffer.  Now the search code does not have to use the Buffer_CharAt macro
to examine the buffer because it is guarunteed to have two contiguous
strings of valid data.  That's a good idea - I might have to adopt that
approach.

Notice that with this approach to storing the file, you can edit binary
files without thinking about it.  There are no line length limits.
Newline characters are just normal characters, which you can search for
if you want.  Also notice that to read a file, you do the following:

	fstat(fd, &stbuf);

	/* make sure the gap is at least as big as the file */
	Buffer_EnsureGapSize(b, stbuf.st_size);
	read(fd, &b->data[b->first_size], stbuf.st_size);

There, we just inserted a file into the buffer ... with ONE call to the
read system call.  Straight into the buffer, not into a temporary
buffer.  This is FAST.  On my Sun 3/60 it takes about a second to read in
a megabyte.

Redisplay.

One of the hard things with buffer gap is coming up with a good, fast
redisplay algorithm.  With a linked list approach, it's really easy to
write one.  It's important that internally the editor can munge around
and make changes to the buffer, and then just make one call to redisplay
when it's done.  And the redisplay should be smart enough to use insert
and delete line capabilities of the terminal/window system for scrolling
and for when lines are inserted and deleted.  The way any smart redisplay
algorithm should work is in stages:

	1) Layout the window.  This means figure out which lines are going
	   to go where; basically how the screen should look when the
	   redisplay is done, without actually doing any output to the
	   screen.

	2) Compare this layout to the layout of the screen after the last
	   redisplay.  This boils down to comparing lines to see if the
	   same lines are in the same place as before.  When you scroll,
	   all the lines will be off by how ever many lines you scrolled
	   by.  This is easy to detect, and when you do, you just use the
	   terminals insert/delete line capability to make the redisplay
	   faster.

	3) Repaint the lines which are different from before.  Lines are
	   different either because they have been changed, or because
	   they weren't even visible before.

So how do you actually compare lines quickly?  You can't do a string
comparison, because the scrolling algorithms have to do lots of
comparisons.  With a linked list approach to storing a buffer, you can
just compare the line pointers (a simple integer comparison!) in phase 2
(above).  So it's nice and fast.  But what do you do for buffer gap
implementations?  Well there are a couple of solutions, and I can tell
you two of them.  The first is Gosling's approach (which GNU still uses
to a certain extent) which is to calculate a hash value for each line
based on the contents of the line as it would appear on the screen, and
then using those hash values to do phase 2 of the redisplay.  This has
two bad problems in my opinion.  The first is that it takes a long time
to calculate those hash values, and second, since it's based on the
contents of the lines and not the physical position of those lines in the
file, the redisplay can be soooooo smart that it does unintuitive
things.  For instance, if you have duplicate code (shame shame shame) and
you type ^V to move forward a page, emacs will sometimes scroll BACKWARDS
on you!  It's very confusing, even if it is the fastest way to update the
screen.

Since it takes so long to calculate those hash values, Gosling and
Stallman had to kludge up the redisplay algorithm to try and optimize the
90% case of just inserting characters.  That's a shame and isn't
necessary.

I am not sure if I am allowed to describe the fast way to do this, but if
you are interested, I will see whether my managers will complain.

Anyway, so I hope this helps people understand buffer gap.  I can't
imagine writing an editor any other way, even though it does have it's
disadvantages.