achowe@tiger.waterloo.edu (anthony howe) (02/04/90)
In article <7634@wpi.wpi.edu> jhallen@wpi.wpi.edu (Joseph H Allen) writes: >In article <13952@s.ms.uky.edu> pjl@ms.uky.edu (Paul Linton) writes: >>How would/do you 'store' the file in memory? I guess what i am asking is >>what is the data structure that would be used for a _very_ basic screen >>oriented editor. > >One of the simplest and most effecient ways of storing a file in memory is >with a gap buffer. The file is broken into three parts in one big array: The >beginning, the gap and the end. The beginning and end parts are just normal Is there a bible or standard text that has the of the theory behind the issue of roling your own editor worked out? I like many others have been writing my own editor based what I like. The code is not as well laid out as I would like but then that is careless designing on my part. It would be nice if there was some sort of reference that suggests how to implement all the basic structures and operations necessary for an editor. The reply to 'Simple Editor Question' was most interesting and showed me many of the primitives I could have used much like a curses package. I just wish that more of this information was seen in this group. If there is not a text anyone can refer me to, would readers of this group consider developing a serious of articles to discuss issues like: editor buffer data structures - multiple file buffers cut and paste problems displaying and editing around tabs emac-ish vs vi-ish screen display issues word wrap vs horizontal scrolling undo issues regular expression search/replace folds These are just some of the things I've faced in my learning process. It would be nice if we could put together something like EDITOR 101. - ant achowe@tiger.waterloo.edu | "It is hard to make the world go away _ -|-|_ _ | when it has decided to notice you." (_\ |\| | | | (_) |\| \/ | - Spock's World ___/ | disclaimer...
jhallen@wpi.wpi.edu (Joseph H Allen) (02/05/90)
In article <20518@watdragon.waterloo.edu> achowe@tiger.waterloo.edu (anthony howe) writes: >I like many others have been writing my own editor based what I like. >The code is not as well laid out as I would like but then that is >careless designing on my part. It would be nice if there was some >sort of reference that suggests how to implement all the basic >structures and operations necessary for an editor. As far as I know there isn't any text-book for editors. I have seen various articles about text editors and also a few books on related issues: There's lots of things about string searches and a few theory books about the snobol language (which encounters similer issues). I've made several text editors and I'm in the process of making another one. This most recent one is my emacs replacement so I've been doing quite a bit of thinking and researching (if its going to replace emacs, it had better be pretty good :). I'm in the mood today so I'll try to show some of the thought processes and research I've made. (** Warning, I'm not a licensed CS major so these may not be 'official' theories **) Perhaps if I start, others will expand/correct what I say. > editor buffer data structures - multiple file buffers GAP BUFFER I've seen two ways people have done this and know of at least 2 other approaches. One way is the gap buffer of my previous article. In the editor I used that in, I didn't start out using a gap buffer. Originally, I had a large array to hold the file in and a small array which held a small piece of the large array in which the inserting and deleting occured. Moving the point involved saveing (and possibly making space in the large array for) the small array and loading a new one. When I found out about the gap method I rewrote the buffer manager. Both my method and the gap method work on a similer method but the gap method is much more refined. If anyone knows, I'd be interested in hereing who invented the gap buffer. Gnu-emacs uses a gap buffer. The gap buffer has the best performance if you can insure that the gap is always at the cursor. If this is guarenteed, then the only time that there is ever a delay is when the gap gets completely filled- on unix systems, you have to realloc the block the buffer is in and make a new gap. On systems where you can access the far ends of the memory, you can prolong this as much as possible by starting out with the largest possible gap. This is what I've seen done on machines without memory management (micros) and on machines with really good memory management (the DEC 20, where to allocate memory you just access it). If the editor has multiple windows onto the same file it may not be possible to insure that the gap is always at the cursor. There is a tradeoff to think about here. You can do two things: move the gap with the cursor into the other window. Then there is a delay here, but the gap is always with the cursor. The second way is to move the gap only when you are about to insert or delete something. This is not a bad idea since you usually only edit in one window and only look at the others. You CAN have more than one pointer to the buffer but you can't have more than one gap (well, you could have more than one gap but it would be very, very messy to manage). I don't like the gap buffer for editors with multiple windows because of the gap motion delays. In particular, if you're on a virtual memory system, much of the buffer might be swapped out to the disk. Then if you have to move the gap there has to be a lot of disk accesses. Still, the gap editor does use the least amount of cycles (remember the things you do most often in an editor is type single characters and move the cursor short distances), so if you're most interested in not loading down the system... IBM METHOD One method which I suspect many old IBM machines use is to store the text in an array of fixed length arrays. This is a leftover from 80 column puch cards. LINKED LIST OF LINES BUFFER The next most common buffer data structure I've seen is the doubly linked list of lines buffer. In this case you store each line in a structure: struct line { struct line *next; /* Link to other lines */ struct line *prev; unsigned char size; /* Size of 'text' */ unsigned char bksize; /* Size of malloc block this is in */ char text[]; }; Each line contains only a small amount of text so you can insert and delete using brute force without any noticable delays (although this will use 20 times as many cycles as the gap buffer). This buffer scheme eliminates the large delays of the gap method and also eliminates some of the delays at higher levels in the editor (you don't have to search for ends of lines any more), but it also has some problems. Most significantly, the line length is limited and the file must be processed when it's loaded and stored. Also, splitting lines and doing cut & paste become slightly more complicated and the memory overhead is quite a bit higher (50% higher). However, even with these problems, many people choose it for its simplicity and speed- brief, jove, vedit, vi, and many more use this method. A slight variation of this scheme can ease some of its problems. Instead of attaching the headers to the text, you can have them seperate: struct line { struct line *next; struct line *prev; unsigned char size; /* Size of text */ unsigned char bksize; /* Size of block text is in */ char *text; }; If the headers are seperate like this then you can do things a little differently. Instead of loading files in small pieces (or in seperate buffers) you can load them in contiguously, the way you can with the gap buffer. You still have to go through this loaded text and find where all the lines are and make the headers, but this isn't so bad. An added benefit of this is that there will be bit less virtual memory swapping when you move through a lot of lines without looking at the text (this requires that the headers be packet together away from the text). If you are a truely masachistic programmer and don't have any memory problems, you can solve the line length limit by having a linked list of line segments within each line. The code for handling this is quite complex. I don't particularly like this buffer scheme either because of the line length limits and the huge memory overhead. A header will contain 14 bytes and a typical line has 20 characters in it- almost 100% more memory is needed for this method than the gap buffer. LINKED LIST BUFFER Another buffer scheme similer to the linked list of lines buffer is a linked list of blocks- I.E., ignore where the lines begin: struct block { struct block *next; struct block *prev; unsigned char size; unsigned char bksize; char *text; }; This method will also need at least 14 bytes in each header, but you can start out editing with each block containing a large amount of text (a full 256 characters, say). This solves the memory problem because it will only use about 5% more memory (14/256) than the gap buffer. But you will have to search for line-ends as with the gap buffer. BUFFER HINTS With the linked list of lines buffer and the linked list buffer, you can store hints (extra information) in the headers. For the linked-list of lines buffer these might include the column width of the line and which attributes (if the editor is a word processor) became active in this line (as an other way of not going through all the text, only the headers). For the linked list buffer, these might include the number of lines contained in the text block and the column width of this block. Again, so that you don't have to look at as much text. VIRTUAL MEMORY TECHNIQUES A good editor will have a software virtual memory manager. Even if the machine has hardware virtual memory, the process size limit is usually too small (anything smaller than the largest file allowed on the file system is too small- and then you want to be able to edit more than one file at once...). There are two general techniques for handling virtual memory. You could either have a very seperate virtual memory handler in which you have to translate virtual addresses to physical ones (or machine virtual ones) or have the virtual memory as an integral part of the buffer scheme. When the gap buffer is employed, the virtual memory handler is usually part of the buffer scheme. One way of doing this is to break the file being edited into many small files. You will only have one of these loaded at a time in the gap buffer arrangement. Usually there will be a set of interface functions between the buffer and the rest of the editor (fmgetc, fminsc, and fmdel of my gap buffer for example). Each of these has to be made to test for when the end of or the beginning of the buffer is reached and then swap to a new buffer. The more complex buffer functions get a little messy. My fmsave function which writes a number of characters out to a file beginning with the point would have to go through all the virtual memory files which contain data to be saved. A doubly linked list structure could be used to manage all the files and would allow new files to be inserted relatively easily. An important concept is to have two files for one whole buffer in memory. This way, when the end or beginning of the buffer is reached you only outload half of it. If this isn't done, there will be one position in the file where the disk will be accessed whenever you move around it. Also, if the screen update algorithm reads the screen from the buffer for each key... I think (I'm not sure) that this form of virtual memory was used often in the past for editors such as edt, teco and WordStar. It's not terribly appropriate when there are multiple windows since going between windows would require swapping (I theorize that this is why WordStar never had multiple windows). This method is very easy to implement and completely hides the virtual memory from higher levels. The second general method uses a virtual memory system which sits below the buffer scheme. It's usually used with the linked list methods since these don't require large contiguous blocks. Probably the best way of doing this is to use a linked list method with the headers seperate from the text. Then you store all the headers in one virtual memory file and the text in another. This is nice bacause to load a file all you have to do is copy it into ram or if it doesn't fit, copy it into a virtual memory file. Then there would be a set of functions to translate virtual addresses to physical ones: void *lock(struct vfile *file,unsigned long addr); void unlock(struct vfile *file,unsigned long addr); void change(struct vfile *file,unsigned long addr); 'lock' would translate the virtual address into a physical one and lock that bit of virtual memory into physical memory so that it could can be accessed. Then when accessing it is done, 'unlock' is called. If the memory was changed while it was locked, 'change' is called to indicate that the block is dirty and should be written out the virtual memory file. The virtual memory file should be broken into pages. Each page of the virtual memory file will have header like this when it's loaded into physical memory: struct vpage { struct vpage *next; /* linked list of pages */ VFILE *owner; /* points to owner VFILE structure */ unsigned long addr; /* Offset from beginning of file where this block is*/ int locks; /* number of locks on this page */ int changed; /* set if this block must be written out */ char data[16384]; /* data loaded from virtual file */ }; the vfile structure would be something like this: struct vfile { unsigned long size; /* Size of virtual memory file */ char *name; /* File name (or 0 if doesn't exist) */ int fd; /* Descriptor (-1 if file not open) */ }; The pages which are loaded into physical memory are stored in singly linked list. When 'lock' is called, it first searches this list to see if the page is already loaded. If it is, 'lock' will return a pointer into the data part of the vpage (plus whatever offset within that page is requested) and move the vpage to the head of the list. This is done since 'lock' should be called for the same page often. Lock will also increment the 'locks' count in vpage (and unlock will decrement it) to prevent it from beBing swapped out. When there is no more memory to load new pages, 'lock' will have to reuse old unused pages (ones which don't have 'change' or 'locks' set). If there are no reuseable pages, then 'lock' must swap out 'changed' pages which arn't locked. It's probably a good idea to swap out all the 'changed' pages (which arn't locked) when this condition occures because usually more pages are about to be loaded in when one page is loaded in. If all the pages arn't swapped out, the disk head will move between the file being loaded and the swap file for each block which is slow (and the disk drive makes a funny sound). The swap file will not be needed if there is enough room for the file be loaded completely into memory. Therefore, the comments in the vfile structure. The temporary swap file will be created when there's no more memory. [Also, it's a good idea to have linked list of the vfiles. This way if you run out of file descriptors (you can only have 32 open files in unix) you can close all the swap files (which should automatically reopen when needed)] Unless you want to implement a complete malloc system in the virtual files, it's a good idea to make the virtual memory block some multiple of the size of both the headers and the text. (make the headers 16 bytes, the text 256 and the virtual block some large power of 2). You also don't want to have text overlap between blocks. If you use linked list of lines, you will have to implement a malloc for the virtual memory files because it's too wastefull to have fixed sized blocks in the virtual memory when lines are typically only 10% of their maximum length. I'm not completely sure, but I think vedit and me use an malloc system in a paged virtual memory. POINTERS Usually there will be several pointers into a buffer. One for each window, one or two for block beginning and ends and the cursor. These pointers should be placed in a linked list so that the low level insert/delete functions can update them. If this isn't done then all pointers after inserts and deletes will move relative to the text they point to. I personally think you should keep a lot of data in the pointers: struct point { struct point *next; unsigned row; unsigned column; /* column number from beginning of line */ unsigned byte; unsigned byte_line; /* byte offset from beginning of line */ /* If linked list method is used */ unsigned long header; /* virtual address of header */ unsigned offset; /* offset into text of header */ /* If this is a wordprocessor */ unsigned attributes; unsigned right_margin; etc... }; All of this data should be updated for each pointer whenever pointers move and whenever inserts/deletes occure. Updating this data when doing inserts/deletes also saves you from having to check for bounds problems (I.E., if a pointer moved past the end of a file, or if one of the text blocks got deleted and so 'header' is no longer valid). Pointers should be the standard positioning objects within the editor. The basic editing functions should use pointers as their arguments: cursor_down(struct point *pointer); cursor_right(struct point *pointer); struct point *dup_point(struct point *pointer); save(file,struct point *from,struct point *to); insert_file(file,struct point *position); etc... > cut and paste problems My favorite way of doing cut/paste is to do cut & paste with the file saving and inserting functions. This way you can cut & paste between edit sessions. Although, on smaller systems you'd certainly want to have a memory buffer for cutting and pasting so that you don't always have to access slow disk drives. Cutting & pasting also require you to update all the pointers which follow the cur/paste position. > undo issues The key to undo is to have a standard set of buffer primatives (insert, delete and pointer motion). When any of these is executed, the converse is stored in an undo buffer of some type. If the linked list methods are used it should be possible to keep the undo data right there (I.E., a deleted block is already in the form of a linked list, so all you have to do is clean up the ends). Undo should also always have a Redo... incase you undo too much. Also I like the idea of having an undo/redo system that works with just the positions. So you can undo to the previous position and redo back to the original one. The points could be stored whenever text is changed at a position. > displaying and editing around tabs > screen display issues > word wrap vs horizontal scrolling There are two approaches to screen updates. The first is to have all the edit functions update the screen when they are executed. Editors that do this usually use few CPU cycles but are very complicated and arn't too customizable. vi is such an editor. The second approach is to have the screen update completely seperate from the edit functions. This is the nice modular way of making an editor which can be customized by people who don't know the complete screen update processes. The general algorithm for updating the screen is this: after each edit function is executed (each key is pressed) compare the data on the screen (or data in an array which is a copy of the screen) with the buffer and write any characters which don't match to the screen (if there are multiples windows, compare them all in case there is more than one on the same buffer). For micros, this is usually all that is needed since the screen is accessable as normal ram and this process is reasonably fast. The screen update function might be made suspendable if the screen update takes longer then you can type. If any keys are hit will the screen is being updated, then you execute that edit function and start the screen update over again afterwards. To handle tabs, control characters and other multi-character symbols two approaches can be taken. The first is to just not handler them. Convert the tabs into spaces and display the control characters in inverse video instead of as ^A. If you want to handle them then you need a translation table: struct { int width; /* Column width of sequence to display */ char *sequence; /* Sequence to display */ } table[256]; Then when you compare the buffer you go through this table first. 'width' could be set to '-1' to indicate special processing and 'sequence' could contain a function to do it. For exmaple, if width is -1 then sequence points to a tab handler which takes the current position being updated (a pointer) and returns a sequence of characters which should be displayed back to the screen update algorithm (I.E., 1 to 8 spaces, the number depending on the column position). TERMINALS A whole book could probably be written about how to update terminal screens going through slow serial ports. In general there are two parts to this. First, there should be a way to optomize setting the cursor position since there are usually many ways to do this (CR, BS output a character, etc...). The second is to try to find instances of terminal control sequences which make the terminal screen look more like the buffer. These include scrolling regions and line inserts/deletes and the methods of detecting when they can be used (Gosling's famous algorithm). Also included are easier sequences to handle: clear to the end of lines, clear to end of screen etc. Once these sequences are used the brute force compare method previously described is called to get anything missed. NETWORKS I've found that to get the screen update suspendable over terminals on networks requires that the editor know the baud rate of the terminal. Then whenever anything is sent to the terminal you must delay by the number of characters sent times the baud time before sending anything else. If this isn't done, data builds up in network server buffers and the screen becomes uninterruptable (hitting page down 5 times make the screen output 5 pages insteade of only one). INSIDE MULTI-CHARATCER SEQUENCES AND PAST THE ENDS OF LINES I think most people prefer not to have the cursor change column position when the row is changed. A way to do this is to have a flag in the pointer structure which indicates that the cursor is not at a legal column position. Then when the user types, space can be added to the ends of lines or the column position could be changed to a legal one. This is greatly superior to the other things I've seen: wordstar: immediatly go to legal column positions when rows change, edt: keep the cursor at the same byte offset from beginnings of lines (the cursor will jump to random column positions when the row changes if there are tabs or control characters), emacs & vi: remember the column position but move to a legal position anyway (and move back to the old column position whenever possible), turbo pascal: ignore tabs. ATTRIBUTES Handling attributes in wordprocessors requires that attribute change information be stored somewhere. A nice place to put this information is in the line/block headers. Unless the attributes are inverting, you need to store the new attribute along with what it was previously so that you can move 'up'. > regular expression search/replace > folds I'll leave these for others. I have to research regular expression compiling to say anything intelligent and I havn't encountered folds. [plus this article is getting just a bit long... :] Also, I'd add: macro languages and customizability issues. >- ant > > achowe@tiger.waterloo.edu | "It is hard to make the world go away > _ -|-|_ _ | when it has decided to notice you." > (_\ |\| | | | (_) |\| \/ | - Spock's World > ___/ | disclaimer... -- "Come on Duke, lets do those crimes" - Debbie "Yeah... Yeah, lets go get sushi... and not pay" - Duke
leech@homer.cs.unc.edu (Jonathan Leech) (02/06/90)
In article <7676@wpi.wpi.edu> jhallen@wpi.wpi.edu (Joseph H Allen) writes: >struct line > { > struct line *next; /* Link to other lines */ > struct line *prev; > unsigned char size; /* Size of 'text' */ > unsigned char bksize; /* Size of malloc block this is in */ > char text[]; > }; >... > >This buffer scheme eliminates the large delays of the gap method and also >eliminates some of the delays at higher levels in the editor (you don't have >to search for ends of lines any more), but it also has some problems. Most >significantly, the line length is limited and the file must be processed when >it's loaded and stored. If you insist on using an unsigned char as your line length, that would be a problem. Use a long. There are some other nice things you can do with the linked-list model. It makes selecting a subset of lines to edit (aka the XEDIT ALL command) very easy, for example. I come from the quarter-plane, rectangle-oriented school of editing, and this model fits much better than buffer gap, which in turn seems natural for stream-oriented editors like EMACS. -- Jon Leech (leech@cs.unc.edu) __@/ ``The experiment must be wrong'' - Richard Feynman (as quoted by Eugen Merzbacher), upon hearing that experimental data did not agree with theoretical predictions. Feynman was correct :-)
webber@porthos.rutgers.edu (Bob Webber) (02/06/90)
In article <20518@watdragon.waterloo.edu>, achowe@tiger.waterloo.edu (anthony howe) writes: > Is there a bible or standard text that has the of the theory behind > the issue of roling your own editor worked out? Don't know of a bible, but there is a cookbook: MIT/LCS/TM-165 Theory and Practice of Text Editors Or A Cookbook for an EMACS Craig A. Finseth May 1980 Laboratory for Computer Science; Massachusetts Institute of Technology; 545 Technology Square, Cambridge, Massachusetts 02139 106 pages. [was his B.S. thesis] 3 page introduction 29 pages on memory management including: buffer gap, linked line, multiple buffers, paged virtual memory, editing extremely large files and scratchpad memory. 21 pages on incremental redisplay including: line wrap, multiple windows, terminal types, redisplay algorithms, The Framer, and memory mapped systems. 20 pages on the command loop including: read-eval-print, error recovery, arguements, rebinding keys and functions, modes, kill and undo, and implementation languages (teco, sine, lisp, pl/1, C, fortran, pascal, ...) 7 pages on user interface hardware including: keyboards and graphical input. 3 pages on the world outside text editing. 10 pages of annotated bibliography. 1 page on some implementations of emacs type editors 6 pages on partial emacs command list enjoy. ---- BOB (webber@athos.rutgers.edu ; rutgers!athos.rutgers.edu!webber)
allbery@NCoast.ORG (Brandon S. Allbery) (02/06/90)
Question for editor wizards: I've been pondering a sort of cross between a linked list of blocks and the buffer/gap mechanism; call it "multi-gap", just for consideration. This would use a number of medium-to-large blocks (say, 8K); each block would initially be half full, with an optional empty block would be at the beginning of the "file" to handle inserts at the start of the buffer. This would seem to have many of the advantages of the gap system while avoiding some disadvantages (i.e. you can easily arrange for each window in a multi-window editor to have its window-point at or near a gap; oving the gap is less expensive because it only moves a maximum distance of the block size; inserting a file can be done by allocating one or more new blocks for its contents; cut and paste can be done by arranging for the cut text to be in its own block). It *does* have some (seemingly?) minor disadvantages over a straight buffer gap mechanism; for example, you want to split blocks that become too full, and coalesce neighboring blocks that are getting too small (with the exception of the first block). On the other hand, the resulting blocks can easily be combined with a VM mechanism. Am I missing anything, or is this a reasonable way to go about constructing an editor? I would think that something would use it if it were as good as it seems to be.... (P.S. Thanks for the editor tutorial(s).) ++Brandon -- Brandon S. Allbery allbery@NCoast.ORG, BALLBERY (MCI Mail), ALLBERY (Delphi) uunet!cwjcc.cwru.edu!ncoast!allbery ncoast!allbery@cwjcc.cwru.edu
zaphod@madnix.UUCP (Ron Bean) (02/10/90)
In Article <Feb.5.20.43.14.1990.22399@porthos.rutgers.edu> webber@porthos.rutgers.edu (Bob Webber) writes: > MIT/LCS/TM-165 > Theory and Practice of Text Editors Or A Cookbook for an EMACS > Craig A. Finseth > May 1980 > Laboratory for Computer Science; Massachusetts Institute of Technology; > 545 Technology Square, Cambridge, Massachusetts 02139 How exactly does one obtain a copy of this document? Do I just write a letter "to whom it may concern" at MIT/LCS? Or do I have to go to Boston and photocopy the original by hand? ================== zaphod@madnix.UUCP (Ron Bean) {harvard|rutgers|ucbvax}!uwvax!astroatc!nicmad!madnix!zaphod {decvax|att}!
webber@athos.rutgers.edu (Bob Webber) (02/11/90)
In article <1105@madnix.UUCP>, zaphod@madnix.UUCP (Ron Bean) writes: < < In Article <Feb.5.20.43.14.1990.22399@porthos.rutgers.edu< < webber@porthos.rutgers.edu (Bob Webber) writes: < < < MIT/LCS/TM-165 < < Theory and Practice of Text Editors Or A Cookbook for an EMACS < < Craig A. Finseth < < May 1980 < < Laboratory for Computer Science; Massachusetts Institute of Technology; < < 545 Technology Square, Cambridge, Massachusetts 02139 < < How exactly does one obtain a copy of this document? Do I just < write a letter "to whom it may concern" at MIT/LCS? Or do I have < to go to Boston and photocopy the original by hand? To-whom-it-may-concern letters tend to work ok. Other possibilities would be checking out the library at your local university or sending email to postmaster at some mit machine which will probably turn up some way of finding out what is necessary online. Different universities take different positions on charging for tech reports -- sometimes they don't bother, but other times they want some sort of covering charge. Also, they may not keep all of their tech reports in print (again, this varies from school to school). There are plenty of people from MIT on the net, so maybe one of them will have more precise details to post. --- BOB (webber@athos.rutgers.edu ; rutgers!athos.rutgers.edu!webber)