ant@mks.com (Anthony Howe) (06/04/91)
From: jhallen@wpi.wpi.edu (Joseph H Allen) Subject: Editor 101 >I like many others have been writing my own editor based what I like. >The code is not as well laid out as I would like but then that is >careless designing on my part. It would be nice if there was some >sort of reference that suggests how to implement all the basic >structures and operations necessary for an editor. As far as I know there isn't any text-book for editors. I have seen various articles about text editors and also a few books on related issues: There's lots of things about string searches and a few theory books about the snobol language (which encounters similer issues). I've made several text editors and I'm in the process of making another one. This most recent one is my emacs replacement so I've been doing quite a bit of thinking and researching (if its going to replace emacs, it had better be pretty good :). I'm in the mood today so I'll try to show some of the thought processes and research I've made. (** Warning, I'm not a licensed CS major so these may not be 'official' theories **) Perhaps if I start, others will expand/correct what I say. > editor buffer data structures - multiple file buffers GAP BUFFER I've seen two ways people have done this and know of at least 2 other approaches. One way is the gap buffer of my previous article. In the editor I used that in, I didn't start out using a gap buffer. Originally, I had a large array to hold the file in and a small array which held a small piece of the large array in which the inserting and deleting occured. Moving the point involved saveing (and possibly making space in the large array for) the small array and loading a new one. When I found out about the gap method I rewrote the buffer manager. Both my method and the gap method work on a similer method but the gap method is much more refined. If anyone knows, I'd be interested in hereing who invented the gap buffer. Gnu-emacs uses a gap buffer. The gap buffer has the best performance if you can insure that the gap is always at the cursor. If this is guarenteed, then the only time that there is ever a delay is when the gap gets completely filled- on unix systems, you have to realloc the block the buffer is in and make a new gap. On systems where you can access the far ends of the memory, you can prolong this as much as possible by starting out with the largest possible gap. This is what I've seen done on machines without memory management (micros) and on machines with really good memory management (the DEC 20, where to allocate memory you just access it). If the editor has multiple windows onto the same file it may not be possible to insure that the gap is always at the cursor. There is a tradeoff to think about here. You can do two things: move the gap with the cursor into the other window. Then there is a delay here, but the gap is always with the cursor. The second way is to move the gap only when you are about to insert or delete something. This is not a bad idea since you usually only edit in one window and only look at the others. You CAN have more than one pointer to the buffer but you can't have more than one gap (well, you could have more than one gap but it would be very, very messy to manage). I don't like the gap buffer for editors with multiple windows because of the gap motion delays. In particular, if you're on a virtual memory system, much of the buffer might be swapped out to the disk. Then if you have to move the gap there has to be a lot of disk accesses. Still, the gap editor does use the least amount of cycles (remember the things you do most often in an editor is type single characters and move the cursor short distances), so if you're most interested in not loading down the system... IBM METHOD One method which I suspect many old IBM machines use is to store the text in an array of fixed length arrays. This is a leftover from 80 column puch cards. LINKED LIST OF LINES BUFFER The next most common buffer data structure I've seen is the doubly linked list of lines buffer. In this case you store each line in a structure: struct line { struct line *next; /* Link to other lines */ struct line *prev; unsigned char size; /* Size of 'text' */ unsigned char bksize; /* Size of malloc block this is in */ char text[]; }; Each line contains only a small amount of text so you can insert and delete using brute force without any noticable delays (although this will use 20 times as many cycles as the gap buffer). This buffer scheme eliminates the large delays of the gap method and also eliminates some of the delays at higher levels in the editor (you don't have to search for ends of lines any more), but it also has some problems. Most significantly, the line length is limited and the file must be processed when it's loaded and stored. Also, splitting lines and doing cut & paste become slightly more complicated and the memory overhead is quite a bit higher (50% higher). However, even with these problems, many people choose it for its simplicity and speed- brief, jove, vedit, vi, and many more use this method. A slight variation of this scheme can ease some of its problems. Instead of attaching the headers to the text, you can have them seperate: struct line { struct line *next; struct line *prev; unsigned char size; /* Size of text */ unsigned char bksize; /* Size of block text is in */ char *text; }; If the headers are seperate like this then you can do things a little differently. Instead of loading files in small pieces (or in seperate buffers) you can load them in contiguously, the way you can with the gap buffer. You still have to go through this loaded text and find where all the lines are and make the headers, but this isn't so bad. An added benefit of this is that there will be bit less virtual memory swapping when you move through a lot of lines without looking at the text (this requires that the headers be packet together away from the text). If you are a truely masachistic programmer and don't have any memory problems, you can solve the line length limit by having a linked list of line segments within each line. The code for handling this is quite complex. I don't particularly like this buffer scheme either because of the line length limits and the huge memory overhead. A header will contain 14 bytes and a typical line has 20 characters in it- almost 100% more memory is needed for this method than the gap buffer. LINKED LIST BUFFER Another buffer scheme similer to the linked list of lines buffer is a linked list of blocks- I.E., ignore where the lines begin: struct block { struct block *next; struct block *prev; unsigned char size; unsigned char bksize; char *text; }; This method will also need at least 14 bytes in each header, but you can start out editing with each block containing a large amount of text (a full 256 characters, say). This solves the memory problem because it will only use about 5% more memory (14/256) than the gap buffer. But you will have to search for line-ends as with the gap buffer. BUFFER HINTS With the linked list of lines buffer and the linked list buffer, you can store hints (extra information) in the headers. For the linked-list of lines buffer these might include the column width of the line and which attributes (if the editor is a word processor) became active in this line (as an other way of not going through all the text, only the headers). For the linked list buffer, these might include the number of lines contained in the text block and the column width of this block. Again, so that you don't have to look at as much text. VIRTUAL MEMORY TECHNIQUES A good editor will have a software virtual memory manager. Even if the machine has hardware virtual memory, the process size limit is usually too small (anything smaller than the largest file allowed on the file system is too small- and then you want to be able to edit more than one file at once...). There are two general techniques for handling virtual memory. You could either have a very seperate virtual memory handler in which you have to translate virtual addresses to physical ones (or machine virtual ones) or have the virtual memory as an integral part of the buffer scheme. When the gap buffer is employed, the virtual memory handler is usually part of the buffer scheme. One way of doing this is to break the file being edited into many small files. You will only have one of these loaded at a time in the gap buffer arrangement. Usually there will be a set of interface functions between the buffer and the rest of the editor (fmgetc, fminsc, and fmdel of my gap buffer for example). Each of these has to be made to test for when the end of or the beginning of the buffer is reached and then swap to a new buffer. The more complex buffer functions get a little messy. My fmsave function which writes a number of characters out to a file beginning with the point would have to go through all the virtual memory files which contain data to be saved. A doubly linked list structure could be used to manage all the files and would allow new files to be inserted relatively easily. An important concept is to have two files for one whole buffer in memory. This way, when the end or beginning of the buffer is reached you only outload half of it. If this isn't done, there will be one position in the file where the disk will be accessed whenever you move around it. Also, if the screen update algorithm reads the screen from the buffer for each key... I think (I'm not sure) that this form of virtual memory was used often in the past for editors such as edt, teco and WordStar. It's not terribly appropriate when there are multiple windows since going between windows would require swapping (I theorize that this is why WordStar never had multiple windows). This method is very easy to implement and completely hides the virtual memory from higher levels. The second general method uses a virtual memory system which sits below the buffer scheme. It's usually used with the linked list methods since these don't require large contiguous blocks. Probably the best way of doing this is to use a linked list method with the headers seperate from the text. Then you store all the headers in one virtual memory file and the text in another. This is nice bacause to load a file all you have to do is copy it into ram or if it doesn't fit, copy it into a virtual memory file. Then there would be a set of functions to translate virtual addresses to physical ones: void *lock(struct vfile *file,unsigned long addr); void unlock(struct vfile *file,unsigned long addr); void change(struct vfile *file,unsigned long addr); 'lock' would translate the virtual address into a physical one and lock that bit of virtual memory into physical memory so that it could can be accessed. Then when accessing it is done, 'unlock' is called. If the memory was changed while it was locked, 'change' is called to indicate that the block is dirty and should be written out the virtual memory file. The virtual memory file should be broken into pages. Each page of the virtual memory file will have header like this when it's loaded into physical memory: struct vpage { struct vpage *next; /* linked list of pages */ VFILE *owner; /* points to owner VFILE structure */ unsigned long addr; /* Offset from beginning of file where this block is*/ int locks; /* number of locks on this page */ int changed; /* set if this block must be written out */ char data[16384]; /* data loaded from virtual file */ }; the vfile structure would be something like this: struct vfile { unsigned long size; /* Size of virtual memory file */ char *name; /* File name (or 0 if doesn't exist) */ int fd; /* Descriptor (-1 if file not open) */ }; The pages which are loaded into physical memory are stored in singly linked list. When 'lock' is called, it first searches this list to see if the page is already loaded. If it is, 'lock' will return a pointer into the data part of the vpage (plus whatever offset within that page is requested) and move the vpage to the head of the list. This is done since 'lock' should be called for the same page often. Lock will also increment the 'locks' count in vpage (and unlock will decrement it) to prevent it from beBing swapped out. When there is no more memory to load new pages, 'lock' will have to reuse old unused pages (ones which don't have 'change' or 'locks' set). If there are no reuseable pages, then 'lock' must swap out 'changed' pages which arn't locked. It's probably a good idea to swap out all the 'changed' pages (which arn't locked) when this condition occures because usually more pages are about to be loaded in when one page is loaded in. If all the pages arn't swapped out, the disk head will move between the file being loaded and the swap file for each block which is slow (and the disk drive makes a funny sound). The swap file will not be needed if there is enough room for the file be loaded completely into memory. Therefore, the comments in the vfile structure. The temporary swap file will be created when there's no more memory. [Also, it's a good idea to have linked list of the vfiles. This way if you run out of file descriptors (you can only have 32 open files in unix) you can close all the swap files (which should automatically reopen when needed)] Unless you want to implement a complete malloc system in the virtual files, it's a good idea to make the virtual memory block some multiple of the size of both the headers and the text. (make the headers 16 bytes, the text 256 and the virtual block some large power of 2). You also don't want to have text overlap between blocks. If you use linked list of lines, you will have to implement a malloc for the virtual memory files because it's too wastefull to have fixed sized blocks in the virtual memory when lines are typically only 10% of their maximum length. I'm not completely sure, but I think vedit and me use an malloc system in a paged virtual memory. POINTERS Usually there will be several pointers into a buffer. One for each window, one or two for block beginning and ends and the cursor. These pointers should be placed in a linked list so that the low level insert/delete functions can update them. If this isn't done then all pointers after inserts and deletes will move relative to the text they point to. I personally think you should keep a lot of data in the pointers: struct point { struct point *next; unsigned row; unsigned column; /* column number from beginning of line */ unsigned byte; unsigned byte_line; /* byte offset from beginning of line */ /* If linked list method is used */ unsigned long header; /* virtual address of header */ unsigned offset; /* offset into text of header */ /* If this is a wordprocessor */ unsigned attributes; unsigned right_margin; etc... }; All of this data should be updated for each pointer whenever pointers move and whenever inserts/deletes occure. Updating this data when doing inserts/deletes also saves you from having to check for bounds problems (I.E., if a pointer moved past the end of a file, or if one of the text blocks got deleted and so 'header' is no longer valid). Pointers should be the standard positioning objects within the editor. The basic editing functions should use pointers as their arguments: cursor_down(struct point *pointer); cursor_right(struct point *pointer); struct point *dup_point(struct point *pointer); save(file,struct point *from,struct point *to); insert_file(file,struct point *position); etc... > cut and paste problems My favorite way of doing cut/paste is to do cut & paste with the file saving and inserting functions. This way you can cut & paste between edit sessions. Although, on smaller systems you'd certainly want to have a memory buffer for cutting and pasting so that you don't always have to access slow disk drives. Cutting & pasting also require you to update all the pointers which follow the cur/paste position. > undo issues The key to undo is to have a standard set of buffer primatives (insert, delete and pointer motion). When any of these is executed, the converse is stored in an undo buffer of some type. If the linked list methods are used it should be possible to keep the undo data right there (I.E., a deleted block is already in the form of a linked list, so all you have to do is clean up the ends). Undo should also always have a Redo... incase you undo too much. Also I like the idea of having an undo/redo system that works with just the positions. So you can undo to the previous position and redo back to the original one. The points could be stored whenever text is changed at a position. > displaying and editing around tabs > screen display issues > word wrap vs horizontal scrolling There are two approaches to screen updates. The first is to have all the edit functions update the screen when they are executed. Editors that do this usually use few CPU cycles but are very complicated and arn't too customizable. vi is such an editor. The second approach is to have the screen update completely seperate from the edit functions. This is the nice modular way of making an editor which can be customized by people who don't know the complete screen update processes. The general algorithm for updating the screen is this: after each edit function is executed (each key is pressed) compare the data on the screen (or data in an array which is a copy of the screen) with the buffer and write any characters which don't match to the screen (if there are multiples windows, compare them all in case there is more than one on the same buffer). For micros, this is usually all that is needed since the screen is accessable as normal ram and this process is reasonably fast. The screen update function might be made suspendable if the screen update takes longer then you can type. If any keys are hit will the screen is being updated, then you execute that edit function and start the screen update over again afterwards. To handle tabs, control characters and other multi-character symbols two approaches can be taken. The first is to just not handler them. Convert the tabs into spaces and display the control characters in inverse video instead of as ^A. If you want to handle them then you need a translation table: struct { int width; /* Column width of sequence to display */ char *sequence; /* Sequence to display */ } table[256]; Then when you compare the buffer you go through this table first. 'width' could be set to '-1' to indicate special processing and 'sequence' could contain a function to do it. For exmaple, if width is -1 then sequence points to a tab handler which takes the current position being updated (a pointer) and returns a sequence of characters which should be displayed back to the screen update algorithm (I.E., 1 to 8 spaces, the number depending on the column position). TERMINALS A whole book could probably be written about how to update terminal screens going through slow serial ports. In general there are two parts to this. First, there should be a way to optomize setting the cursor position since there are usually many ways to do this (CR, BS output a character, etc...). The second is to try to find instances of terminal control sequences which make the terminal screen look more like the buffer. These include scrolling regions and line inserts/deletes and the methods of detecting when they can be used (Gosling's famous algorithm). Also included are easier sequences to handle: clear to the end of lines, clear to end of screen etc. Once these sequences are used the brute force compare method previously described is called to get anything missed. NETWORKS I've found that to get the screen update suspendable over terminals on networks requires that the editor know the baud rate of the terminal. Then whenever anything is sent to the terminal you must delay by the number of characters sent times the baud time before sending anything else. If this isn't done, data builds up in network server buffers and the screen becomes uninterruptable (hitting page down 5 times make the screen output 5 pages insteade of only one). INSIDE MULTI-CHARATCER SEQUENCES AND PAST THE ENDS OF LINES I think most people prefer not to have the cursor change column position when the row is changed. A way to do this is to have a flag in the pointer structure which indicates that the cursor is not at a legal column position. Then when the user types, space can be added to the ends of lines or the column position could be changed to a legal one. This is greatly superior to the other things I've seen: wordstar: immediatly go to legal column positions when rows change, edt: keep the cursor at the same byte offset from beginnings of lines (the cursor will jump to random column positions when the row changes if there are tabs or control characters), emacs & vi: remember the column position but move to a legal position anyway (and move back to the old column position whenever possible), turbo pascal: ignore tabs. ATTRIBUTES Handling attributes in wordprocessors requires that attribute change information be stored somewhere. A nice place to put this information is in the line/block headers. Unless the attributes are inverting, you need to store the new attribute along with what it was previously so that you can move 'up'. > regular expression search/replace > folds I'll leave these for others. I have to research regular expression compiling to say anything intelligent and I havn't encountered folds. [plus this article is getting just a bit long... :] Also, I'd add: macro languages and customizability issues. -- ant@mks.com Anthony C Howe Mortice Kern Systems Inc. 35 King St. N., Waterloo, Ontario, Canada, N2J 6W9 "Fate favors fools, small children, and ships named Enterprise" - Riker
rcarter@wpi.WPI.EDU (Randolph Carter (nee. Joseph H. Allen)) (06/12/91)
The ghost of ant@mks.com (Anthony Howe) writes: >From: jhallen@wpi.wpi.edu (Joseph H Allen) >Subject: Editor 101 ... Ooof! The spelling errors I made when I wrote that... :-) Another buffer technique I recently found is the "pieces" method from the pointing editor (I can't remember the author's name). As you edit, you break a contiguous buffer up into pieces and add or delete characters from the edges of the pieces. Each piece is stored in a malloc-like structure in a virtual memory file. You have to maintain a list of all the pieces and search through this list when you move out of a piece. You also glue pieces together when possible. This is a neat technique, but it's complicated. An important technique for this and other buffer methods is to have routines which adjust pointer/size pairs as you move between pieces (or blocks or over the gap). This way, high level routines (like screen update) only have to keep track of the pointer and size and can simply call a function to update these when the beginning or end of a block is reached. Another usefull technique is to try to keep buffer pointers in the form of a byte offset from the beginning of the file in the buffer. There then needs to be logical to physical translation functions but keeping the pointer this simple will make implementing higher level functions much easier. -- /* rcarter@wpi.wpi.edu */ /* Amazing */ /* Joseph H. Allen */ int a[1817];main(z,p,q,r){for(p=80;q+p-80;p-=2*a[p])for(z=9;z--;)q=3&(r=time(0) +r*57)/7,q=q?q-1?q-2?1-p%79?-1:0:p%79-77?1:0:p<1659?79:0:p>158?-79:0,q?!a[p+q*2 ]?a[p+=a[p+=q]=q]=q:0:0;for(;q++-1817;)printf(q%79?"%c":"%c\n"," #"[!a[q-1]]);}
ant@mks.com (Anthony Howe) (06/12/91)
rcarter@wpi.WPI.EDU (Randolph Carter (nee. Joseph H. Allen)) writes: >Another buffer technique I recently found is the "pieces" method from the >pointing editor (I can't remember the author's name). As you edit, you break ^^^^^^^^^^^^^^^ What and where? Could you find a reference. This might make for another article -- Editor.202 maybe. >a contiguous buffer up into pieces and add or delete characters from the edges >of the pieces. Each piece is stored in a malloc-like structure in a virtual >memory file. You have to maintain a list of all the pieces and search through >this list when you move out of a piece. You also glue pieces together when >possible. This is a neat technique, but it's complicated. This sounds like a variation of the scratch file method described in "Software Tools" chapter 6. (I'm hoping to write an Editor.201 article based on this chapter later this summer.) >An important technique for this and other buffer methods is to have routines >which adjust pointer/size pairs as you move between pieces (or blocks or over >the gap). This way, high level routines (like screen update) only have to >keep track of the pointer and size and can simply call a function to update >these when the beginning or end of a block is reached. Why sizes? (I'm a bit slow this morning) >Another usefull technique is to try to keep buffer pointers in the form of a >byte offset from the beginning of the file in the buffer. There then needs to >be logical to physical translation functions but keeping the pointer this >simple will make implementing higher level functions much easier. This technique is fine for the Buffer Gap Scheme, but I don't see how this would apply in a Link List Scheme. Link List and Pieces would require that pointers become structures suitable for the scheme n'est pas? - ant -- ant@mks.com Anthony C Howe Mortice Kern Systems Inc. 35 King St. N., Waterloo, Ontario, Canada, N2J 6W9 "Fate favors fools, small children, and ships named Enterprise" - Riker
ant@mks.com (Anthony Howe) (06/12/91)
>/* rcarter@wpi.wpi.edu */ /* Amazing */ /* Joseph H. Allen */ >int a[1817];main(z,p,q,r){for(p=80;q+p-80;p-=2*a[p])for(z=9;z--;)q=3&(r=time(0) >+r*57)/7,q=q?q-1?q-2?1-p%79?-1:0:p%79-77?1:0:p<1659?79:0:p>158?-79:0,q?!a[p+q*2 >]?a[p+=a[p+=q]=q]=q:0:0;for(;q++-1817;)printf(q%79?"%c":"%c\n"," #"[!a[q-1]]);} Wow. A maze generator. Is it yours? -- ant@mks.com Anthony C Howe Mortice Kern Systems Inc. 35 King St. N., Waterloo, Ontario, Canada, N2J 6W9 "Fate favors fools, small children, and ships named Enterprise" - Riker
rcarter@wpi.WPI.EDU (Randolph Carter (nee. Joseph H. Allen)) (06/13/91)
The ghost of ant@mks.com (Anthony Howe) writes: >rcarter@wpi.WPI.EDU (Randolph Carter (nee. Joseph H. Allen)) writes: >>Another buffer technique I recently found is the "pieces" method from the >>pointing editor (I can't remember the author's name). As you edit, you break > ^^^^^^^^^^^^^^^ >What and where? Could you find a reference. This might make for another >article -- Editor.202 maybe. <msdos.editor>pt20pc.zip on simtel20 and its mirrors. It's from Charlie Crowly. I don't think there's a unix version. (Asside from the buffer technique, I don't like it since it requires that you use the mouse for everything). >>a contiguous buffer up into pieces and add or delete characters from the edges >>of the pieces. Each piece is stored in a malloc-like structure in a virtual >>memory file. You have to maintain a list of all the pieces and search through >>this list when you move out of a piece. You also glue pieces together when >>possible. This is a neat technique, but it's complicated. >This sounds like a variation of the scratch file method described in >"Software Tools" chapter 6. (I'm hoping to write an Editor.201 article >based on this chapter later this summer.) One problem with this is that you have to seek to the correct piece. If, as was done in the pointing editor, you keep a linked list of headers for the pieces, then editing will slow down as more pieces are made. Another is that the virtual memory system needs an malloc. With the scratch file technique, do you combine the files together when you reach some maximum? Or am I misunderstanding it... These buffering techniques remind me of "transaction files" for sorted databases. You add new records to the transaction file and update the entire database only periodically. But when reading the database, you always have to search (usually sequentially) through the transaction file before doing a binary search on the main database. >>An important technique for this and other buffer methods is to have routines >>which adjust pointer/size pairs as you move between pieces (or blocks or over >>the gap). This way, high level routines (like screen update) only have to >>keep track of the pointer and size and can simply call a function to update >>these when the beginning or end of a block is reached. >Why sizes? (I'm a bit slow this morning) Oh sorry, I mean the number of characters left at the pointer. If its a linked-list editor, then the pointer will be in a small block and the size has the amount remaining in it. >>Another usefull technique is to try to keep buffer pointers in the form of a >>byte offset from the beginning of the file in the buffer. There then needs to >>be logical to physical translation functions but keeping the pointer this >>simple will make implementing higher level functions much easier. >This technique is fine for the Buffer Gap Scheme, but I don't see how this >would apply in a Link List Scheme. Link List and Pieces would require that >pointers become structures suitable for the scheme n'est pas? You have to have some other structure that lets you quickly seek to the proper entry. For example, have blocks with pointer/size pairs in them in the virtual memory system which refer to blocks containing text (you can have multiple levels of these). Then you can standard tree techniques (btree techniques?) to adjust and balance the trees as you insert/delete. To make it practicle (you don't want to have to seek through all this for every byte you access (except in the gap editor)), have a tiny cache containing the last few seeks. -- /* rcarter@wpi.wpi.edu */ /* Amazing */ /* Joseph H. Allen */ int a[1817];main(z,p,q,r){for(p=80;q+p-80;p-=2*a[p])for(z=9;z--;)q=3&(r=time(0) +r*57)/7,q=q?q-1?q-2?1-p%79?-1:0:p%79-77?1:0:p<1659?79:0:p>158?-79:0,q?!a[p+q*2 ]?a[p+=a[p+=q]=q]=q:0:0;for(;q++-1817;)printf(q%79?"%c":"%c\n"," #"[!a[q-1]]);}
kirkenda@eecs.cs.pdx.edu (Steve Kirkendall) (06/13/91)
In article <1991Jun12.202509.14825@wpi.WPI.EDU> rcarter@wpi.WPI.EDU (Randolph Carter (nee. Joseph H. Allen)) writes: >The ghost of ant@mks.com (Anthony Howe) writes: >>rcarter@wpi.WPI.EDU (Randolph Carter (nee. Joseph H. Allen)) writes: >>>a contiguous buffer up into pieces and add or delete characters from the edges >>>of the pieces. Each piece is stored in a malloc-like structure in a virtual >>>memory file. You have to maintain a list of all the pieces and search through >>>this list when you move out of a piece. You also glue pieces together when >>>possible. This is a neat technique, but it's complicated. > >One problem with this is that you have to seek to the correct piece. If, as >was done in the pointing editor, you keep a linked list of headers for the >pieces, then editing will slow down as more pieces are made. Another is that >the virtual memory system needs an malloc. > >With the scratch file technique, do you combine the files together when you >reach some maximum? Or am I misunderstanding it... This sounds a lot like the technique I used in Elvis. Of course, my perceptions are biased so I may be seeing similarities that aren't there, but... Elvis uses a temporary file for storing the edit buffer. This file is divided into blocks of 1k bytes apiece. Each block contains one or more whole lines; lines are not allowed to straddle a block boundary. Each line is terminated with newline character. Empty space (at the end of each block) is filled with NUL characters. The physical order of the blocks is irrelevent; Elvis maintains a table that maps logical block numbers into physical block numbers. When new text is inserted into the buffer, there is a chance that one of the old blocks will need to be split in two; if this happens, a new block will be allocated at the end of the scratch file, and the map table will be modified to logically map that block into the middle of the edit buffer. >You have to have some other structure that lets you quickly seek to the proper >entry. For example, have blocks with pointer/size pairs in them in the >virtual memory system which refer to blocks containing text (you can have >multiple levels of these). Then you can standard tree techniques (btree >techniques?) to adjust and balance the trees as you insert/delete. Elvis uses its map table, plus another table that stores the line number of the last line in each block. To locate a given line, you would scan the line-number table to determine which logical block contained the line, then use the block map table to convert the logical block number to a physical block number, read the block, and then scan through the block for the desired line. It's a lot faster than it sounds! >To make it practicle (you don't want to have to seek through all this for >every byte you access (except in the gap editor)), have a tiny cache >containing the last few seeks. The two tables are stored in RAM, so they can be scanned very quickly. (The block map table is periodically written to the disk, though, to allow for crash recovery.) I have several levels of caching, which allows elvis to skip most of the above steps for sequential accesses. The tables are very small. The block map table is an array of shorts, and its total size is limited to 1 block -- 1K bytes, normally, which translates into 512 entries. This limits elvis to editing files that are no larger than about half a megabyte. With a 2K byte block, the limit rises to two megabytes (twice as many entries, and each block contains twice as much text). By the way, elvis' most interesting source code is located in "modify.c" (the functions for inserting, deleting, and replacing text), with other fun stuff in "blk.c" (the block cache, and 'undo' support) and "tmp.c" (loading and saving text). Now, does anybody want to hear *why* I did it this way? ------------------------------------------------------------------------------- Steve Kirkendall kirkenda@cs.pdx.edu Grad student at Portland State U.
ant@mks.com (Anthony Howe) (06/13/91)
rcarter@wpi.WPI.EDU (Randolph Carter (nee. Joseph H. Allen)) writes: >With the scratch file technique, do you combine the files together when you >reach some maximum? Or am I misunderstanding it... Roughly the scratch file technique is somewhat similar to the description of Elvis' buffering technique. ---longish summary---- I have a file, a paging area, and a scratch/page file. I read the file in line by line, building a line table of seek pointers for each paged line. So there are two layers the line table (liner) which sits on top of a pager. Every time I wish to view a line (redraw the screen) I would call the line table primitives to lookup the seek pointer, which in turn would see if that seek pointer is in memory or fetch it from the page file, passing up the memory pointer of the null terminated string/line. To delete a line, shuffle down the line table of seek pointers to delete it. To insert a line, shuffle up the table to make room for the seek pointer entry returned from the pager. To change a line just replace the old seek pointer with the new seek pointer. Every request to page out a portion of memory will just append the block to the end of the page file/memory structure. So the current edit line would be read into page memory, copied to an editor buffer, modified, then paged out. All this would be layered: editor-> liner-> pager. The scratch file technique (described in "Software Tools" by K&P, chapter 6) is implemented via the pager. In my implementation the pager has been embellished with memory caches for the pages, which I picked up from a CS course a couple of years ago. (See I told you it was similar to the Elvis buffer description). Now, the advantages of this. 1) Undo is implemented by remembering the current line table of seek pointers before updating/inserting/deleting a line. Also multiple levels of undo could be maintained since the page file should have every change since the start of the session. 2) It should be possible to edit files larger than memory (which on computers with less than a mega of RAM is an advantage) giving limited virtual memory. Also if the pager and liner modules are careful, the editor could edit either binary or text. 3) And the advantage I really like is I can use the seek pointers as unique line identifiers for use in a screen update algorithum similar to the one described in Miller's book "A Software Tools Sampler". 4) With a bit more smarts, the line table of seek pointers could also be paged providing the final element of virtual memory. 5) If your page memory is large enough then most edit sessions can be kept entirely in memory for fast access (since most source modules should be reasonable chunks (<30k is reasonable, 50k is gross, 100k is crazy)). Memory shuffling would be limited in most cases (I would hope). Of course there may be a physical limitation on the seek pointers, which would mean the line table doing garbadge collection on the page file (not at all for most reasonable edit tasks; unreasonable might be editing 900k a dictionary for a 5 hour session). If the line table and pager are gerneral enough, you create another page file, read in the current lines from the current page file, page then out to the new file and update the line tabel with the new seek pointer into the new file, lastly dispose of the now obsoleted page file. The EDITOR.201 (Virtual Memory Technique) article I plan for this summer should provide a bit more detail and code fragments. - ant -- ant@mks.com Anthony C Howe Mortice Kern Systems Inc. 35 King St. N., Waterloo, Ontario, Canada, N2J 6W9 "Fate favors fools, small children, and ships named Enterprise" - Riker
cortesi@informix.com (David Cortesi) (06/13/91)
In article <2857@pdxgate.UUCP> kirkenda@eecs.UUCP (Steve Kirkendall) writes: > >Elvis uses a temporary file for storing the edit buffer. This file is >divided into blocks of 1k bytes apiece. Each block contains one or >more whole lines; lines are not allowed to straddle a block boundary. Which means that no line may exceed the block size, or 1K. Oh, yuck! I think poorly of editors that put arbitrary limits on line size. Let me give you a ferinstance: Frame on the NeXT has this little bug in the PostScript it generates sometimes, and you can easily work around it by editing its PostScript output file to delete a line. But Frame also, sometimes, generates perfectly legitimate PostScript lines that are 3, 4, 5 kilobytes without a newline. (PostScript printers do not put arbitrary limits on line lengths.) So when I edit one of these doobies with vi, vi will quit about halfway through the file, croaking "line too long." The Edit application distributed with the NeXT, however, reads the same file just fine; it doesn't care how long lines may be. > >Now, does anybody want to hear *why* I did it this way? > Sure, what's your feeble excuse? (;-)
ant@mks.com (Anthony Howe) (06/14/91)
cortesi@informix.com (David Cortesi) writes: >In article <2857@pdxgate.UUCP> kirkenda@eecs.UUCP (Steve Kirkendall) writes: >> >>Elvis uses a temporary file for storing the edit buffer. This file is >>divided into blocks of 1k bytes apiece. Each block contains one or >>more whole lines; lines are not allowed to straddle a block boundary. > >Which means that no line may exceed the block size, or 1K. Oh, >yuck! I think poorly of editors that put arbitrary limits on line >size. Let me give you a ferinstance: Frame on the NeXT has this [...] A good liner/pager manager should be able to force line breaks at reasonable locations like after 1K provided that when the file is saved again that <newline> characters are NOT added. I remember mention that Elvis stored the <newline> in the 1K block. Does that mean if Elvis splits a long line it appends a <newline> or it just treats it as a logical <newline>? If it appends a <newline>, then double yuck! >The Edit application distributed with the NeXT, however, reads the same >file just fine; it doesn't care how long lines may be. -- ant@mks.com Anthony C Howe Mortice Kern Systems Inc. 35 King St. N., Waterloo, Ontario, Canada, N2J 6W9 "Fate favors fools, small children, and ships named Enterprise" - Riker
kirkenda@eecs.cs.pdx.edu (Steve Kirkendall) (06/15/91)
In article <1991Jun13.154125.6307@informix.com> cortesi@informix.com (David Cortesi) writes: >In article <2857@pdxgate.UUCP> kirkenda@eecs.UUCP (Steve Kirkendall) writes: >> >>lines are not allowed to straddle a block boundary. > >Which means that no line may exceed the block size, or 1K. >> >>Now, does anybody want to hear *why* I did it this way? >> >Sure, what's your feeble excuse? (;-) I wanted elvis to be compatible with vi, of course! Well, now I need some filler lines to keep Pnews happy. Umm... Hey, how 'bout the Bulls winning the Championship? Is anybody outside La-la land disappointed? Feels good to see Jordan get his ring. Its about time. And Pippen is getting a little glory, for a change. I'm also kinda relieved that Wild Bill Cartwright didn't bite off anybody's ear, or anything like that. ------------------------------------------------------------------------------- Steve Kirkendall kirkenda@cs.pdx.edu Grad student at Portland State U.