[comp.editors] REPOST: editor.101

ant@mks.com (Anthony Howe) (06/04/91)

From: jhallen@wpi.wpi.edu (Joseph H Allen)
Subject: Editor 101

>I like many others have been writing my own editor based what I like.
>The code is not as well laid out as I would like but then that is 
>careless designing on my part. It would be nice if there was some
>sort of reference that suggests how to implement all the basic
>structures and operations necessary for an editor.

As far as I know there isn't any text-book for editors.  I have seen various
articles about text editors and also a few books on related issues:  There's
lots of things about string searches and a few theory books about the snobol
language (which encounters similer issues). 

I've made several text editors and I'm in the process of making another one. 
This most recent one is my emacs replacement so I've been doing quite a bit of
thinking and researching (if its going to replace emacs, it had better be
pretty good :).  I'm in the mood today so I'll try to show some of the thought
processes and research I've made.  (** Warning, I'm not a licensed CS major
so these may not be 'official' theories **)  Perhaps if I start, others will
expand/correct what I say.

>	editor buffer data structures - multiple file buffers

GAP BUFFER

I've seen two ways people have done this and know of at least 2 other
approaches.  One way is the gap buffer of my previous article.  In the editor
I used that in, I didn't start out using a gap buffer.  Originally, I had a
large array to hold the file in and a small array which held a small piece of
the large array in which the inserting and deleting occured.  Moving the point
involved saveing (and possibly making space in the large array for) the small
array and loading a new one.  When I found out about the gap method I rewrote
the buffer manager.  Both my method and the gap method work on a similer
method but the gap method is much more refined.  If anyone knows, I'd be
interested in hereing who invented the gap buffer.  Gnu-emacs uses a gap
buffer. 

The gap buffer has the best performance if you can insure that the gap is
always at the cursor.  If this is guarenteed, then the only time that there is
ever a delay is when the gap gets completely filled- on unix systems, you have
to realloc the block the buffer is in and make a new gap.  On systems where
you can access the far ends of the memory, you can prolong this as much as
possible by starting out with the largest possible gap.  This is what I've
seen done on machines without memory management (micros) and on machines with
really good memory management (the DEC 20, where to allocate memory you just
access it).

If the editor has multiple windows onto the same file it may not be possible
to insure that the gap is always at the cursor.  There is a tradeoff to think
about here.  You can do two things:  move the gap with the cursor into the
other window.  Then there is a delay here, but the gap is always with the
cursor.  The second way is to move the gap only when you are about to insert
or delete something.  This is not a bad idea since you usually only edit in
one window and only look at the others.  You CAN have more than one pointer to
the buffer but you can't have more than one gap (well, you could have more
than one gap but it would be very, very messy to manage).

I don't like the gap buffer for editors with multiple windows because of the
gap motion delays.  In particular, if you're on a virtual memory system, much
of the buffer might be swapped out to the disk.  Then if you have to move the
gap there has to be a lot of disk accesses.  Still, the gap editor does use
the least amount of cycles (remember the things you do most often in an editor
is type single characters and move the cursor short distances), so if you're
most interested in not loading down the system...

IBM METHOD

One method which I suspect many old IBM machines use is to store the text in
an array of fixed length arrays.  This is a leftover from 80 column puch
cards.

LINKED LIST OF LINES BUFFER

The next most common buffer data structure I've seen is the doubly linked list
of lines buffer.  In this case you store each line in a structure:

struct line
 {
 struct line *next; 	/* Link to other lines */
 struct line *prev;
 unsigned char size;	/* Size of 'text' */
 unsigned char bksize;	/* Size of malloc block this is in */
 char text[];
 };

Each line contains only a small amount of text so you can insert and delete
using brute force without any noticable delays (although this will use 20
times as many cycles as the gap buffer).

This buffer scheme eliminates the large delays of the gap method and also
eliminates some of the delays at higher levels in the editor (you don't have
to search for ends of lines any more), but it also has some problems.  Most
significantly, the line length is limited and the file must be processed when
it's loaded and stored.  Also, splitting lines and doing cut & paste become
slightly more complicated and the memory overhead is quite a bit higher (50%
higher).  However, even with these problems, many people choose it for its
simplicity and speed- brief, jove, vedit, vi, and many more use this method.  

A slight variation of this scheme can ease some of its problems.  Instead of
attaching the headers to the text, you can have them seperate:

struct line
 {
 struct line *next;
 struct line *prev;
 unsigned char size;		/* Size of text */
 unsigned char bksize;		/* Size of block text is in */
 char *text;
 };

If the headers are seperate like this then you can do things a little
differently.  Instead of loading files in small pieces (or in seperate
buffers) you can load them in contiguously, the way you can with the gap
buffer.  You still have to go through this loaded text and find where all the
lines are and make the headers, but this isn't so bad.  An added benefit of
this is that there will be bit less virtual memory swapping when you move
through a lot of lines without looking at the text (this requires that the
headers be packet together away from the text).

If you are a truely masachistic programmer and don't have any memory problems,
you can solve the line length limit by having a linked list of line segments
within each line.  The code for handling this is quite complex.

I don't particularly like this buffer scheme either because of the line length
limits and the huge memory overhead.  A header will contain 14 bytes and a
typical line has 20 characters in it- almost 100% more memory is needed
for this method than the gap buffer.

LINKED LIST BUFFER

Another buffer scheme similer to the linked list of lines buffer is a linked
list of blocks- I.E., ignore where the lines begin:

struct block
 {
 struct block *next;
 struct block *prev;
 unsigned char size;
 unsigned char bksize;
 char *text;
 };

This method will also need at least 14 bytes in each header, but you can start
out editing with each block containing a large amount of text (a full 256
characters, say).  This solves the memory problem because it will only use
about 5% more memory (14/256) than the gap buffer.  But you will have to
search for line-ends as with the gap buffer.

BUFFER HINTS

With the linked list of lines buffer and the linked list buffer, you can store
hints (extra information) in the headers.  For the linked-list of lines
buffer these might include the column width of the line and which attributes
(if the editor is a word processor) became active in this line (as an other
way of not going through all the text, only the headers).  For the linked list
buffer, these might include the number of lines contained in the text block
and the column width of this block.  Again, so that you don't have to look at
as much text.

VIRTUAL MEMORY TECHNIQUES

A good editor will have a software virtual memory manager.  Even if the
machine has hardware virtual memory, the process size limit is usually too
small (anything smaller than the largest file allowed on the file system is
too small- and then you want to be able to edit more than one file at once...).

There are two general techniques for handling virtual memory.  You could
either have a very seperate virtual memory handler in which you have to
translate virtual addresses to physical ones (or machine virtual ones) or have
the virtual memory as an integral part of the buffer scheme. 

When the gap buffer is employed, the virtual memory handler is usually part of
the buffer scheme.  One way of doing this is to break the file being edited
into many small files.  You will only have one of these loaded at a time in
the gap buffer arrangement.  Usually there will be a set of interface
functions between the buffer and the rest of the editor (fmgetc, fminsc, and
fmdel of my gap buffer for example).  Each of these has to be made to test for
when the end of or the beginning of the buffer is reached and then swap to a
new buffer.  The more complex buffer functions get a little messy.  My fmsave
function which writes a number of characters out to a file beginning with the
point would have to go through all the virtual memory files which contain data
to be saved.  A doubly linked list structure could be used to manage all the
files and would allow new files to be inserted relatively easily.

An important concept is to have two files for one whole buffer in memory. 
This way, when the end or beginning of the buffer is reached you only outload
half of it.  If this isn't done, there will be one position in the file where
the disk will be accessed whenever you move around it.  Also, if the screen
update algorithm reads the screen from the buffer for each key...

I think (I'm not sure) that this form of virtual memory was used often in the
past for editors such as edt, teco and WordStar.  It's not terribly
appropriate when there are multiple windows since going between windows would
require swapping (I theorize that this is why WordStar never had multiple
windows).  This method is very easy to implement and completely hides the
virtual memory from higher levels.

The second general method uses a virtual memory system which sits below the
buffer scheme.  It's usually used with the linked list methods since these
don't require large contiguous blocks.  Probably the best way of doing this is
to use a linked list method with the headers seperate from the text.  Then you
store all the headers in one virtual memory file and the text in another. 
This is nice bacause to load a file all you have to do is copy it into ram or
if it doesn't fit, copy it into a virtual memory file.  Then there would be a
set of functions to translate virtual addresses to physical ones:

	void *lock(struct vfile *file,unsigned long addr);
	void unlock(struct vfile *file,unsigned long addr);
	void change(struct vfile *file,unsigned long addr);

'lock' would translate the virtual address into a physical one and lock that
bit of virtual memory into physical memory so that it could can be accessed. 
Then when accessing it is done, 'unlock' is called.  If the memory was changed
while it was locked, 'change' is called to indicate that the block is dirty
and should be written out the virtual memory file.

The virtual memory file should be broken into pages.  Each page of the virtual
memory file will have header like this when it's loaded into physical memory:

struct vpage
 {
 struct vpage *next;	/* linked list of pages */
 VFILE *owner;		/* points to owner VFILE structure */
 unsigned long addr;	/* Offset from beginning of file where this block is*/
 int locks;		/* number of locks on this page */
 int changed;		/* set if this block must be written out */
 char data[16384];	/* data loaded from virtual file */
 };

the vfile structure would be something like this:

struct vfile
 {
 unsigned long size;	/* Size of virtual memory file */
 char *name;		/* File name (or 0 if doesn't exist) */
 int fd;		/* Descriptor (-1 if file not open) */
 };

The pages which are loaded into physical memory are stored in singly linked
list.  When 'lock' is called, it first searches this list to see if the page
is already loaded.  If it is, 'lock' will return a pointer into the data part
of the vpage (plus whatever offset within that page is requested) and move the
vpage to the head of the list.  This is done since 'lock' should be called for
the same page often.  Lock will also increment the 'locks' count in vpage (and
unlock will decrement it) to prevent it from beBing swapped out.  When there is
no more memory to load new pages, 'lock' will have to reuse old unused pages
(ones which don't have 'change' or 'locks' set).  If there are no reuseable
pages, then 'lock' must swap out 'changed' pages which arn't locked.  It's
probably a good idea to swap out all the 'changed' pages (which arn't locked)
when this condition occures because usually more pages are about to be loaded
in when one page is loaded in.  If all the pages arn't swapped out, the disk
head will move between the file being loaded and the swap file for each block
which is slow (and the disk drive makes a funny sound). 

The swap file will not be needed if there is enough room for the file be
loaded completely into memory.  Therefore, the comments in the vfile
structure.  The temporary swap file will be created when there's no more
memory.  [Also, it's a good idea to have linked list of the vfiles.  This way
if you run out of file descriptors (you can only have 32 open files in unix)
you can close all the swap files (which should automatically reopen when
needed)]

Unless you want to implement a complete malloc system in the virtual files,
it's a good idea to make the virtual memory block some multiple of the size of
both the headers and the text.  (make the headers 16 bytes, the text 256 and
the virtual block some large power of 2).  You also don't want to have text
overlap between blocks.

If you use linked list of lines, you will have to implement a malloc for the
virtual memory files because it's too wastefull to have fixed sized blocks in
the virtual memory when lines are typically only 10% of their maximum length.

I'm not completely sure, but I think vedit and me use an malloc system in a
paged virtual memory.

POINTERS

Usually there will be several pointers into a buffer.  One for each window,
one or two for block beginning and ends and the cursor.  These pointers should
be placed in a linked list so that the low level insert/delete functions can
update them.  If this isn't done then all pointers after inserts and deletes
will move relative to the text they point to.  I personally think you should
keep a lot of data in the pointers: 

struct point
 {
 struct point *next;
 unsigned row;
 unsigned column;	/* column number from beginning of line */
 unsigned byte;
 unsigned byte_line;	/* byte offset from beginning of line */
 
 /* If linked list method is used */
 unsigned long header;	/* virtual address of header */
 unsigned offset;	/* offset into text of header */

 /* If this is a wordprocessor */
 unsigned attributes;
 unsigned right_margin;
 etc...
 };

All of this data should be updated for each pointer whenever pointers move and
whenever inserts/deletes occure.  Updating this data when doing
inserts/deletes also saves you from having to check for bounds problems (I.E.,
if a pointer moved past the end of a file, or if one of the text blocks got
deleted and so 'header' is no longer valid).

Pointers should be the standard positioning objects within the editor.  The
basic editing functions should use pointers as their arguments:

	cursor_down(struct point *pointer);
	cursor_right(struct point *pointer);
	struct point *dup_point(struct point *pointer);

	save(file,struct point *from,struct point *to);
	insert_file(file,struct point *position);

	etc...

>	cut and paste problems

My favorite way of doing cut/paste is to do cut & paste with the file saving
and inserting functions.  This way you can cut & paste between edit sessions. 
Although, on smaller systems you'd certainly want to have a memory buffer for
cutting and pasting so that you don't always have to access slow disk drives. 
Cutting & pasting also require you to update all the pointers which follow the
cur/paste position.

>	undo issues

The key to undo is to have a standard set of buffer primatives (insert, delete
and pointer motion).  When any of these is executed, the converse is stored in
an undo buffer of some type.  If the linked list methods are used it should
be possible to keep the undo data right there (I.E., a deleted block is already
in the form of a linked list, so all you have to do is clean up the ends).

Undo should also always have a Redo... incase you undo too much.  Also I like
the idea of having an undo/redo system that works with just the positions.  So
you can undo to the previous position and redo back to the original one.  The
points could be stored whenever text is changed at a position.

>	displaying and editing around tabs
>	screen display issues
>	word wrap vs horizontal scrolling

There are two approaches to screen updates.  The first is to have all the edit
functions update the screen when they are executed.  Editors that do this
usually use few CPU cycles but are very complicated and arn't too
customizable.  vi is such an editor.  The second approach is to have the
screen update completely seperate from the edit functions.  This is the nice
modular way of making an editor which can be customized by people who don't
know the complete screen update processes.  The general algorithm for updating
the screen is this:  after each edit function is executed (each key is
pressed) compare the data on the screen (or data in an array which is a copy
of the screen) with the buffer and write any characters which don't match to
the screen (if there are multiples windows, compare them all in case there is
more than one on the same buffer).  For micros, this is usually all that is
needed since the screen is accessable as normal ram and this process is
reasonably fast.  The screen update function might be made suspendable if the
screen update takes longer then you can type.  If any keys are hit will the
screen is being updated, then you execute that edit function and start the
screen update over again afterwards.  

To handle tabs, control characters and other multi-character symbols two
approaches can be taken.  The first is to just not handler them.  Convert the
tabs into spaces and display the control characters in inverse video instead
of as ^A.  If you want to handle them then you need a translation table:

struct
 {
 int width;		/* Column width of sequence to display */
 char *sequence;	/* Sequence to display */
 }
 table[256];

Then when you compare the buffer you go through this table first.  'width'
could be set to '-1' to indicate special processing and 'sequence' could
contain a function to do it.  For exmaple, if width is -1 then sequence points
to a tab handler which takes the current position being updated (a pointer)
and returns a sequence of characters which should be displayed back to the
screen update algorithm (I.E., 1 to 8 spaces, the number depending on the
column position).

TERMINALS

A whole book could probably be written about how to update terminal screens
going through slow serial ports.  In general there are two parts to this. 
First, there should be a way to optomize setting the cursor position since
there are usually many ways to do this (CR, BS output a character, etc...). 
The second is to try to find instances of terminal control sequences which
make the terminal screen look more like the buffer.  These include scrolling
regions and line inserts/deletes and the methods of detecting when they can be
used (Gosling's famous algorithm).  Also included are easier sequences to
handle:  clear to the end of lines, clear to end of screen etc.  Once these
sequences are used the brute force compare method previously described is
called to get anything missed. 

NETWORKS

I've found that to get the screen update suspendable over terminals on
networks requires that the editor know the baud rate of the terminal.  Then
whenever anything is sent to the terminal you must delay by the number of
characters sent times the baud time before sending anything else.  If this
isn't done, data builds up in network server buffers and the screen becomes
uninterruptable (hitting page down 5 times make the screen output 5 pages
insteade of only one).

INSIDE MULTI-CHARATCER SEQUENCES AND PAST THE ENDS OF LINES

I think most people prefer not to have the cursor change column position when
the row is changed.  A way to do this is to have a flag in the pointer
structure which indicates that the cursor is not at a legal column position. 
Then when the user types, space can be added to the ends of lines or the
column position could be changed to a legal one.  This is greatly superior to
the other things I've seen:  wordstar:  immediatly go to legal column
positions when rows change, edt: keep the cursor at the same byte offset from
beginnings of lines (the cursor will jump to random column positions when the
row changes if there are tabs or control characters), emacs & vi:  remember
the column position but move to a legal position anyway (and move back to the
old column position whenever possible), turbo pascal: ignore tabs.

ATTRIBUTES

Handling attributes in wordprocessors requires that attribute change
information be stored somewhere.  A nice place to put this information is in
the line/block headers.  Unless the attributes are inverting, you need to
store the new attribute along with what it was previously so that you can move
'up'. 

>	regular expression search/replace
>	folds

I'll leave these for others.  I have to research regular expression compiling
to say anything intelligent and I havn't encountered folds.  [plus this
article is getting just a bit long... :]  Also, I'd add:

	macro languages and customizability issues.

-- 
ant@mks.com                                                   Anthony C Howe 
Mortice Kern Systems Inc. 35 King St. N., Waterloo, Ontario, Canada, N2J 6W9
"Fate favors fools, small children, and ships named Enterprise" - Riker

rcarter@wpi.WPI.EDU (Randolph Carter (nee. Joseph H. Allen)) (06/12/91)

The ghost of ant@mks.com (Anthony Howe) writes:
>From: jhallen@wpi.wpi.edu (Joseph H Allen)
>Subject: Editor 101

...

Ooof!  The spelling errors I made when I wrote that...   :-)

Another buffer technique I recently found is the "pieces" method from the
pointing editor (I can't remember the author's name).  As you edit, you break
a contiguous buffer up into pieces and add or delete characters from the edges
of the pieces.  Each piece is stored in a malloc-like structure in a virtual
memory file.  You have to maintain a list of all the pieces and search through
this list when you move out of a piece.  You also glue pieces together when
possible.  This is a neat technique, but it's complicated.

An important technique for this and other buffer methods is to have routines
which adjust pointer/size pairs as you move between pieces (or blocks or over
the gap).  This way, high level routines (like screen update) only have to
keep track of the pointer and size and can simply call a function to update
these when the beginning or end of a block is reached.  

Another usefull technique is to try to keep buffer pointers in the form of a
byte offset from the beginning of the file in the buffer.  There then needs to
be logical to physical translation functions but keeping the pointer this
simple will make implementing higher level functions much easier.
-- 
/*  rcarter@wpi.wpi.edu */      /* Amazing */             /* Joseph H. Allen */
int a[1817];main(z,p,q,r){for(p=80;q+p-80;p-=2*a[p])for(z=9;z--;)q=3&(r=time(0)
+r*57)/7,q=q?q-1?q-2?1-p%79?-1:0:p%79-77?1:0:p<1659?79:0:p>158?-79:0,q?!a[p+q*2
]?a[p+=a[p+=q]=q]=q:0:0;for(;q++-1817;)printf(q%79?"%c":"%c\n"," #"[!a[q-1]]);}

ant@mks.com (Anthony Howe) (06/12/91)

rcarter@wpi.WPI.EDU (Randolph Carter (nee. Joseph H. Allen)) writes:
>Another buffer technique I recently found is the "pieces" method from the
>pointing editor (I can't remember the author's name).  As you edit, you break
 ^^^^^^^^^^^^^^^
What and where?  Could you find a reference.  This might make for another
article -- Editor.202 maybe.

>a contiguous buffer up into pieces and add or delete characters from the edges
>of the pieces.  Each piece is stored in a malloc-like structure in a virtual
>memory file.  You have to maintain a list of all the pieces and search through
>this list when you move out of a piece.  You also glue pieces together when
>possible.  This is a neat technique, but it's complicated.

This sounds like a variation of the scratch file method described in
"Software Tools" chapter 6.  (I'm hoping to write an Editor.201 article
based on this chapter later this summer.)

>An important technique for this and other buffer methods is to have routines
>which adjust pointer/size pairs as you move between pieces (or blocks or over
>the gap).  This way, high level routines (like screen update) only have to
>keep track of the pointer and size and can simply call a function to update
>these when the beginning or end of a block is reached.  

Why sizes?  (I'm a bit slow this morning)

>Another usefull technique is to try to keep buffer pointers in the form of a
>byte offset from the beginning of the file in the buffer.  There then needs to
>be logical to physical translation functions but keeping the pointer this
>simple will make implementing higher level functions much easier.

This technique is fine for the Buffer Gap Scheme, but I don't see how this
would apply in a Link List Scheme.  Link List and Pieces would require that 
pointers become structures suitable for the scheme n'est pas?  

- ant
-- 
ant@mks.com                                                   Anthony C Howe 
Mortice Kern Systems Inc. 35 King St. N., Waterloo, Ontario, Canada, N2J 6W9
"Fate favors fools, small children, and ships named Enterprise" - Riker

ant@mks.com (Anthony Howe) (06/12/91)

>/*  rcarter@wpi.wpi.edu */      /* Amazing */             /* Joseph H. Allen */
>int a[1817];main(z,p,q,r){for(p=80;q+p-80;p-=2*a[p])for(z=9;z--;)q=3&(r=time(0)
>+r*57)/7,q=q?q-1?q-2?1-p%79?-1:0:p%79-77?1:0:p<1659?79:0:p>158?-79:0,q?!a[p+q*2
>]?a[p+=a[p+=q]=q]=q:0:0;for(;q++-1817;)printf(q%79?"%c":"%c\n"," #"[!a[q-1]]);}

Wow.  A maze generator.  Is it yours?


-- 
ant@mks.com                                                   Anthony C Howe 
Mortice Kern Systems Inc. 35 King St. N., Waterloo, Ontario, Canada, N2J 6W9
"Fate favors fools, small children, and ships named Enterprise" - Riker

rcarter@wpi.WPI.EDU (Randolph Carter (nee. Joseph H. Allen)) (06/13/91)

The ghost of ant@mks.com (Anthony Howe) writes:
>rcarter@wpi.WPI.EDU (Randolph Carter (nee. Joseph H. Allen)) writes:
>>Another buffer technique I recently found is the "pieces" method from the
>>pointing editor (I can't remember the author's name).  As you edit, you break
> ^^^^^^^^^^^^^^^
>What and where?  Could you find a reference.  This might make for another
>article -- Editor.202 maybe.

<msdos.editor>pt20pc.zip on simtel20 and its mirrors.  It's from Charlie
Crowly.  I don't think there's a unix version. (Asside from the buffer
technique, I don't like it since it requires that you use the mouse for
everything).

>>a contiguous buffer up into pieces and add or delete characters from the edges
>>of the pieces.  Each piece is stored in a malloc-like structure in a virtual
>>memory file.  You have to maintain a list of all the pieces and search through
>>this list when you move out of a piece.  You also glue pieces together when
>>possible.  This is a neat technique, but it's complicated.

>This sounds like a variation of the scratch file method described in
>"Software Tools" chapter 6.  (I'm hoping to write an Editor.201 article
>based on this chapter later this summer.)

One problem with this is that you have to seek to the correct piece.  If, as
was done in the pointing editor, you keep a linked list of headers for the
pieces, then editing will slow down as more pieces are made.  Another is that
the virtual memory system needs an malloc.

With the scratch file technique, do you combine the files together when you
reach some maximum?  Or am I misunderstanding it...

These buffering techniques remind me of "transaction files" for sorted
databases.  You add new records to the transaction file and update the entire
database only periodically.  But when reading the database, you always have to
search (usually sequentially) through the transaction file before doing a
binary search on the main database. 

>>An important technique for this and other buffer methods is to have routines
>>which adjust pointer/size pairs as you move between pieces (or blocks or over
>>the gap).  This way, high level routines (like screen update) only have to
>>keep track of the pointer and size and can simply call a function to update
>>these when the beginning or end of a block is reached.  

>Why sizes?  (I'm a bit slow this morning)

Oh sorry, I mean the number of characters left at the pointer.  If its a
linked-list editor, then the pointer will be in a small block and the size has
the amount remaining in it.

>>Another usefull technique is to try to keep buffer pointers in the form of a
>>byte offset from the beginning of the file in the buffer.  There then needs to
>>be logical to physical translation functions but keeping the pointer this
>>simple will make implementing higher level functions much easier.

>This technique is fine for the Buffer Gap Scheme, but I don't see how this
>would apply in a Link List Scheme.  Link List and Pieces would require that 
>pointers become structures suitable for the scheme n'est pas?  

You have to have some other structure that lets you quickly seek to the proper
entry.  For example, have blocks with pointer/size pairs in them in the
virtual memory system which refer to blocks containing text (you can have
multiple levels of these). Then you can standard tree techniques (btree
techniques?) to adjust and balance the trees as you insert/delete.  

To make it practicle (you don't want to have to seek through all this for
every byte you access (except in the gap editor)), have a tiny cache
containing the last few seeks.
-- 
/*  rcarter@wpi.wpi.edu */      /* Amazing */             /* Joseph H. Allen */
int a[1817];main(z,p,q,r){for(p=80;q+p-80;p-=2*a[p])for(z=9;z--;)q=3&(r=time(0)
+r*57)/7,q=q?q-1?q-2?1-p%79?-1:0:p%79-77?1:0:p<1659?79:0:p>158?-79:0,q?!a[p+q*2
]?a[p+=a[p+=q]=q]=q:0:0;for(;q++-1817;)printf(q%79?"%c":"%c\n"," #"[!a[q-1]]);}

kirkenda@eecs.cs.pdx.edu (Steve Kirkendall) (06/13/91)

In article <1991Jun12.202509.14825@wpi.WPI.EDU> rcarter@wpi.WPI.EDU (Randolph Carter (nee. Joseph H. Allen)) writes:
>The ghost of ant@mks.com (Anthony Howe) writes:
>>rcarter@wpi.WPI.EDU (Randolph Carter (nee. Joseph H. Allen)) writes:
>>>a contiguous buffer up into pieces and add or delete characters from the edges
>>>of the pieces.  Each piece is stored in a malloc-like structure in a virtual
>>>memory file.  You have to maintain a list of all the pieces and search through
>>>this list when you move out of a piece.  You also glue pieces together when
>>>possible.  This is a neat technique, but it's complicated.
>
>One problem with this is that you have to seek to the correct piece.  If, as
>was done in the pointing editor, you keep a linked list of headers for the
>pieces, then editing will slow down as more pieces are made.  Another is that
>the virtual memory system needs an malloc.
>
>With the scratch file technique, do you combine the files together when you
>reach some maximum?  Or am I misunderstanding it...

This sounds a lot like the technique I used in Elvis.  Of course, my
perceptions are biased so I may be seeing similarities that aren't there,
but...

Elvis uses a temporary file for storing the edit buffer.  This file is
divided into blocks of 1k bytes apiece.  Each block contains one or
more whole lines; lines are not allowed to straddle a block boundary.
Each line is terminated with newline character.  Empty space (at the
end of each block) is filled with NUL characters.

The physical order of the blocks is irrelevent; Elvis maintains a table
that maps logical block numbers into physical block numbers.  When new text
is inserted into the buffer, there is a chance that one of the old blocks
will need to be split in two; if this happens, a new block will be allocated
at the end of the scratch file, and the map table will be modified to
logically map that block into the middle of the edit buffer.

>You have to have some other structure that lets you quickly seek to the proper
>entry.  For example, have blocks with pointer/size pairs in them in the
>virtual memory system which refer to blocks containing text (you can have
>multiple levels of these). Then you can standard tree techniques (btree
>techniques?) to adjust and balance the trees as you insert/delete.  

Elvis uses its map table, plus another table that stores the line
number of the last line in each block.  To locate a given line, you
would scan the line-number table to determine which logical block
contained the line, then use the block map table to convert the logical
block number to a physical block number, read the block, and then scan
through the block for the desired line.

It's a lot faster than it sounds!

>To make it practicle (you don't want to have to seek through all this for
>every byte you access (except in the gap editor)), have a tiny cache
>containing the last few seeks.

The two tables are stored in RAM, so they can be scanned very quickly.
(The block map table is periodically written to the disk, though, to
allow for crash recovery.)  I have several levels of caching, which
allows elvis to skip most of the above steps for sequential accesses.

The tables are very small.  The block map table is an array of shorts,
and its total size is limited to 1 block -- 1K bytes, normally, which
translates into 512 entries.  This limits elvis to editing files that
are no larger than about half a megabyte.  With a 2K byte block, the
limit rises to two megabytes (twice as many entries, and each block
contains twice as much text).

By the way, elvis' most interesting source code is located in "modify.c"
(the functions for inserting, deleting, and replacing text), with other
fun stuff in "blk.c" (the block cache, and 'undo' support) and "tmp.c"
(loading and saving text).

Now, does anybody want to hear *why* I did it this way?
-------------------------------------------------------------------------------
Steve Kirkendall     kirkenda@cs.pdx.edu      Grad student at Portland State U.

ant@mks.com (Anthony Howe) (06/13/91)

rcarter@wpi.WPI.EDU (Randolph Carter (nee. Joseph H. Allen)) writes:
>With the scratch file technique, do you combine the files together when you
>reach some maximum?  Or am I misunderstanding it...

Roughly the scratch file technique is somewhat similar to the
description of Elvis' buffering technique.

---longish summary----

I have a file, a paging area, and a scratch/page file.  I read the
file in line by line, building a line table of seek pointers for each
paged line.  So there are two layers the line table (liner) which sits
on top of a pager.

Every time I wish to view a line (redraw the screen) I would call the
line table primitives to lookup the seek pointer, which in turn would
see if that seek pointer is in memory or fetch it from the page file,
passing up the memory pointer of the null terminated string/line.

To delete a line, shuffle down the line table of seek pointers to
delete it.  To insert a line, shuffle up the table to make room for
the seek pointer entry returned from the pager.  To change a line just
replace the old seek pointer with the new seek pointer.

Every request to page out a portion of memory will just append the
block to the end of the page file/memory structure.  So the current
edit line would be read into page memory, copied to an editor buffer,
modified, then  paged out.  All this would be layered: editor->
liner-> pager.

The scratch file technique (described in "Software Tools" by K&P,
chapter 6) is implemented via the pager.  In my implementation the
pager has been embellished with memory caches for the pages, which I
picked up from a CS course a couple of years ago.

(See I told you it was similar to the Elvis buffer description).

Now, the advantages of this.

1) Undo is implemented by remembering the current line table of seek
pointers before updating/inserting/deleting a line.  Also multiple
levels of undo could be maintained since the page file should have
every change since the start of the session.

2) It should be possible to edit files larger than memory (which on
computers with less than a mega of RAM is an advantage) giving limited
virtual memory.  Also if the pager and liner modules are careful, the
editor could edit either binary or text.

3) And the advantage I really like is I can use the seek pointers as
unique line identifiers for use in a screen update algorithum similar
to the one described in Miller's book "A Software Tools Sampler".

4) With a bit more smarts, the line table of seek pointers could also
be paged providing the final element of virtual memory.

5) If your page memory is large enough then most edit sessions can be
kept entirely in memory for fast access (since most source modules should
be reasonable chunks (<30k is reasonable, 50k is gross, 100k is crazy)).  
Memory shuffling would be limited in most cases (I would hope).

Of course there may be a physical limitation on the seek pointers,
which would mean the line table doing garbadge collection on the page
file (not at all for most reasonable edit tasks; unreasonable might be
editing 900k a dictionary for a 5 hour session).  If the line table
and pager are gerneral enough, you create another page file, read in
the current lines from the current page file, page then out to the new
file and update the line tabel with the new seek pointer into the new
file, lastly dispose of the now obsoleted page file.

The EDITOR.201 (Virtual Memory Technique) article I plan for this summer 
should provide a bit more detail and code fragments.  

- ant

-- 
ant@mks.com                                                   Anthony C Howe 
Mortice Kern Systems Inc. 35 King St. N., Waterloo, Ontario, Canada, N2J 6W9
"Fate favors fools, small children, and ships named Enterprise" - Riker

cortesi@informix.com (David Cortesi) (06/13/91)

In article <2857@pdxgate.UUCP> kirkenda@eecs.UUCP (Steve Kirkendall) writes:
>
>Elvis uses a temporary file for storing the edit buffer.  This file is
>divided into blocks of 1k bytes apiece.  Each block contains one or
>more whole lines; lines are not allowed to straddle a block boundary.

Which means that no line may exceed the block size, or 1K.  Oh,
yuck! I think poorly of editors that put arbitrary limits on line
size.  Let me give you a ferinstance: Frame on the NeXT has this
little bug in the PostScript it generates sometimes, and you can
easily work around it by editing its PostScript output file to
delete a line. But Frame also, sometimes, generates perfectly
legitimate PostScript lines that are 3, 4, 5 kilobytes without a
newline. (PostScript printers do not put arbitrary limits on line
lengths.)  So when I edit one of these doobies with vi, vi will
quit about halfway through the file, croaking "line too long."

The Edit application distributed with the NeXT, however, reads the same
file just fine; it doesn't care how long lines may be.
>
>Now, does anybody want to hear *why* I did it this way?
>
Sure, what's your feeble excuse? (;-)

ant@mks.com (Anthony Howe) (06/14/91)

cortesi@informix.com (David Cortesi) writes:
>In article <2857@pdxgate.UUCP> kirkenda@eecs.UUCP (Steve Kirkendall) writes:
>>
>>Elvis uses a temporary file for storing the edit buffer.  This file is
>>divided into blocks of 1k bytes apiece.  Each block contains one or
>>more whole lines; lines are not allowed to straddle a block boundary.
>
>Which means that no line may exceed the block size, or 1K.  Oh,
>yuck! I think poorly of editors that put arbitrary limits on line
>size.  Let me give you a ferinstance: Frame on the NeXT has this
[...]

A good liner/pager manager should be able to force line breaks at reasonable
locations like after 1K provided that when the file is saved again that
<newline> characters are NOT added.  I remember mention that Elvis stored
the <newline> in the 1K block.  Does that mean if Elvis splits a long
line it appends a <newline> or it just treats it as a logical <newline>?

If it appends a <newline>, then double yuck!

>The Edit application distributed with the NeXT, however, reads the same
>file just fine; it doesn't care how long lines may be.


-- 
ant@mks.com                                                   Anthony C Howe 
Mortice Kern Systems Inc. 35 King St. N., Waterloo, Ontario, Canada, N2J 6W9
"Fate favors fools, small children, and ships named Enterprise" - Riker

kirkenda@eecs.cs.pdx.edu (Steve Kirkendall) (06/15/91)

In article <1991Jun13.154125.6307@informix.com> cortesi@informix.com (David Cortesi) writes:
>In article <2857@pdxgate.UUCP> kirkenda@eecs.UUCP (Steve Kirkendall) writes:
>>
>>lines are not allowed to straddle a block boundary.
>
>Which means that no line may exceed the block size, or 1K. 
>>
>>Now, does anybody want to hear *why* I did it this way?
>>
>Sure, what's your feeble excuse? (;-)

I wanted elvis to be compatible with vi, of course!



Well, now I need some filler lines to keep Pnews happy.
Umm...
Hey, how 'bout the Bulls winning the Championship?
Is anybody outside La-la land disappointed?
Feels good to see Jordan get his ring.
Its about time.
And Pippen is getting a little glory, for a change.
I'm also kinda relieved that Wild Bill Cartwright didn't
bite off anybody's ear, or anything like that.
-------------------------------------------------------------------------------
Steve Kirkendall     kirkenda@cs.pdx.edu      Grad student at Portland State U.