CXT105@PSUVM.BITNET (Christopher Tate) (10/23/89)
What sort of data structures are generally used for representing text in an editor or word-processor? Obviously TextEdit is an alternative, but it's slow (as far as I've heard) and imposes size restrictions. Has anyone here done work with writing editors or word-processors? Could you give a newcomer to the problem a little hand getting started...? ------- Christopher Tate || "I hate quotations!" -- Ralph Waldo Emerson
pete@titan.rice.edu (Pete Keleher) (10/23/89)
> What sort of data structures are generally used for representing text in an > editor or word-processor? Obviously TextEdit is an alternative, but it's slo > (as far as I've heard) and imposes size restrictions. Not only does TextEdit impose size restrictions, but it's updating leaves a lot to be desired. One of my goals was to get the editor to look and feel as solid as Think C's editor, but not be bare-bones. Good question. I have looked at source for two editors and have written one of my own. The best data structure that I have heard of would be "chunks". A chunk would be a chunk of text, maybe 256 bytes, typically half full. The chunks are then linked together in a linked list. Insertion/deletion usually involves moving only some of the bytes in the current chunk. Sometimes, of course, you must split chunks or combine them. There is usually no array of line pointers, except maybe for the lines currently being displayed. Two important data structures could be declared something like: typedef struct Chunk { struct Chunk *next, *prev; short len; char text[256]; } Chunk; typedef struct { Chunk *chunk; char *cp; } mark; In my editor, I use one large chunk for the file image that is read from disk, along with an array of line pointers into the file image. When a line is modified, I allocate a chunk and go from there. When lines are inserted/deleted I need to move all of the line pointers after the line in question, but in practice this is cheap. The overhead of having data in the file image in memory but no longer used is also cheap. This is definitely NOT the way to do it if you're designing from scratch, but I made some poor decisions in the beginning (basically I put off the decision, and by the time I could see where to go, I couldn't do it without re-writing everything). The up side is that: 1) In the case where relatively little is modified, (say less than 30%), memory costs are low. 2) It is very fast. In particular, I can access any line as fast as any other. as opposed to the straight chunk approach where you have to search through all of the lines between the start of the file and your current position just to find out what line you are on. -- =========================================================================== Pete Keleher pete@titan.rice.edu Rice University knows nuttin about what I say, or what I do ... ===========================================================================
beard@ux1.lbl.gov (Patrick C Beard) (10/23/89)
In article <89295.142753CXT105@PSUVM.BITNET> CXT105@PSUVM.BITNET (Christopher Tate) writes: >What sort of data structures are generally used for representing text in an >editor or word-processor? Obviously TextEdit is an alternative, but it's slow >(as far as I've heard) and imposes size restrictions. > >Has anyone here done work with writing editors or word-processors? Could you >give a newcomer to the problem a little hand getting started...? > I've thought about this a bit, though I've not gotten around to implementing it. The solution that seems the most flexible is a doubly linked list of lines (or paragraphs if you prefer). That way, you'll have fairly small blocks of memory to move around when you insert characters etc. And changing the order of things is very inexpensive. And the solution would definitely depend on what the possible attributes are that you'll be keeping track of. This is an interesting topic. What have others done? Patrick ------------------------------------------------------------------------------- - Patrick Beard, Macintosh Programmer (beard@lbl.gov) - - Berkeley Systems, Inc. ".......<dead air>.......Good day!" - Paul Harvey - -------------------------------------------------------------------------------
tim@hoptoad.uucp (Tim Maroney) (10/23/89)
The linked list suggestions made so far may be fast enough for a word processor, but the Memory Manager will really bog down when you have a lot of these chunks allocated. If you want better performance, it would probably be better to pre-allocate a large buffer and treat nulls as non-characters, the "holy buffer" approach. This will completely avoid Memory Manager overhead, but may require you to move more data explicitly. You can optimize it in various ways, as any second-semester data structures textbook will tell you. You can leave extra space to make insertions faster, or use only as much space as the user types if you feel deletions are more important. The subject is not trivial and I recommend some research before you begin. I just plunged ahead with a relocatable linked list approach for one of my terminal emulators and I wound up with something that couldn't keep up with 2400 baud on an 8MHz 68000. On the other hand, it was a lot faster for data input than the MPW Shell. So I guess my main advice is -- be prepared to throw away your first stab at it. Make sure knowledge of your data structures is confined to only those files that need it, so that you can change them when you find that your first algorithm bogs down in certain situations. Be sure to test speed with respect to operations like large pastes and very fast typing. And familiarize yourself beforehand with the various buffering strategies that can be found in the textbooks. Most importantly, be sure to leave yourself a way to buffer large files on disk. -- Tim Maroney, Mac Software Consultant, sun!hoptoad!tim, tim@toad.com FROM THE FOOL FILE: "As the expert on stupid, I'll take your word for it." -- Richard Sexton, richard@gryphon.COM
pepke@loligo (Eric Pepke) (10/23/89)
It just so happens that I am curently working on a text editor for programming right now. I am using a single buffer with a single gap using long integers to remember the offsets to the gap, which I think is similar to the way Capps used to do it. (If Capps were still a supported product, I would probably be using it.) A reasonably large gap is kept at the current insertion point. Normal character insertions just go into the gap and make the gap one character smaller. Deletions of single characters just increase the size of the gap by one. The only times that text has to be moved are when a new gap is allocated or the old gap is removed. The idle routine tries to do this at unobtrusive times. Scroll bar handling is done as an integral part of the text editor. I strongly recommend that you do this rather than follow TextEdit's method. Scroll bar handling for long integers is very easy to do cleanly if you have enough information available. If you try to tack it on outside of the editor a la TextEdit, it is very difficult to do cleanly. Eric Pepke INTERNET: pepke@gw.scri.fsu.edu Supercomputer Computations Research Institute MFENET: pepke@fsu Florida State University SPAN: scri::pepke Tallahassee, FL 32306-4052 BITNET: pepke@fsu Disclaimer: My employers seldom even LISTEN to my opinions. Meta-disclaimer: Any society that needs disclaimers has too many lawyers.
amanda@intercon.com (Amanda Walker) (10/24/89)
In article <299@fsu.scri.fsu.edu>, pepke@loligo (Eric Pepke) writes: > It just so happens that I am curently working on a text editor for programming > right now. I am using a single buffer with a single gap using long integers > to remember the offsets to the gap, which I think is similar to the way Capps > used to do it. (If Capps were still a supported product, I would probably be > using it.) Yup, sounds like Capps. For a lot of applications, a simple buffer-gap scheme is definitely a good way to go. It is quite fast in normal kinds of editing, and it is simple enough that it's easy to implement and maintain. It's biggest bad point is that it still needs to have the whole document in memory. This isn't so bad for something like GNU Emacs, where stuff just gets swapped out, but on a Mac it can be annoying. You could build some kind of simple paging system underneath the editor buffer, though. Another problem is that it lends itself to "big hunk o' ASCII" kinds of text editing. Adding style run & script manager support gets... interesting. If you're doing things that are more word processing than text editing, I'd suggest using the linked list of paragraphs approach. This also gives you a natural unit of swapability for editing in cramped memory. -- Amanda Walker <amanda@intercon.com>
jc@gmdzi.UUCP (Juergen Christoffel) (10/25/89)
pepke@loligo (Eric Pepke) writes: >It just so happens that I am curently working on a text editor for programming >right now. I am using a single buffer with a single gap using long integers >to remember the offsets to the gap, which I think is similar to the way Capps >used to do it. (If Capps were still a supported product, I would probably be >using it.) I've been thinking about these issues too for a while now and it really looks that one has to re-invent most of the memory manager here in order to get decent performance. Sigh. Eric Pepke's description reminded me of an old MIT Memo which described this technique in detail which you might be interested in; luckily I found a copy after some searching: MIT/LCS/TM-165 Theory and Practice of Text Editors (or: A Cookbook for an Emacs) Craig A. Finseth May 1980 It's still up-to-date these days; chapter two is about memory management (about 30 pages), chapter three covers incremental redisplay strategies. You might like to get hold of a copy yourself. The table of contents for chapter two lists: Data Structures Marks Interface Procedures Buffer Gap Gap Size Multiple Gaps and Why They Don't Work The Hidden Second Gap Linked Line Storage Comparison Error recovery Comparison Multiple Buffers Paged Virtual Memory Editing Extremely Large Files Scratchpad memory Poking around in the gnuemacs sources a bit assured me that it uses a similar approach (routines like make_gap and move_gap called from SelfInsert) so this approach seems to be still in use today. Symbolics' lisp machines (which I'm quite familiar with) use the linked line approach in their ZWEI editor substrate (ZWEI was initially developed at MIT for MIT's lisp machines) and they must have had a reason to not use the gap approach. Maybe it's their special hardware support for various data structures. Tradeoffs between these two approaches are visible e.g. for buffer pointers (bps for short - or marks as they are called in MPW's editor). If you use linked lines you may store bps relative to these lines and have to update them only when inserting into the line they point to while the gap approach would require to update them for each insertion. It depends on how many bps you have ... I really like bps and would like to see them in more editors (e.g. Think C's) -- lisp machines use them a lot for example to record the definitions of functions/procedures and global variables so you simply have to mouse an indentifier while holding down a modifier key and off you go to the definition (the real one, NOT the first textual appearance in the buffer as THINK does -- that's annoying if you use prototypes in C!) So hopefully everybody writing a new editor for a Mac spends some time thinking how s/he could provide such support. It's really useful, not just for programmers but for large texts as well, think of bps pointing to chapters/sections of your paper or applications of various style sheets etc. Using a general chunk approach (might think of linked lines as a special case of chunks) instead of the gap approach seems a straightforward way for editors which have some more complicated structures to maintain than "just" linear text: Hypertext systems need to keep track of nodes and links in their data structures and I wonder whether the gap approach is capable of representing such text too in an efficient manner. Does anybody have some experience in this and would like to share it? >Scroll bar handling is done as an integral part of the text editor. I strongly >recommend that you do this rather than follow TextEdit's method. Scroll bar >handling for long integers is very easy to do cleanly if you have enough >information available. If you try to tack it on outside of the editor a la >TextEdit, it is very difficult to do cleanly. Would you please be so kind to elaborate a bit on this topic as I would like to learn more about your approach. Thanks! Juergen Christoffel ======================================================================== Juergen Christoffel, jc@gmdzi.UUCP, (++49 2241) 14-2421 German National Research Laboratory for Computer Science (GMD) Schloss Birlinghoven, Postfach 1240, D-5205 St. Augustin 1, FRG ========================================================================
tom@iconsys.UUCP (Tom Kimpton) (10/31/89)
In article <PETE.89Oct22200144@titan.rice.edu> pete@titan.rice.edu (Pete Keleher) writes: > >> What sort of data structures are generally used for representing text in an >> editor or word-processor? Obviously TextEdit is an alternative, but it's slo >> (as far as I've heard) and imposes size restrictions. > >Good question. I have looked at source for two editors and have written one >of my own. The best data structure that I have heard of would be "chunks". A >chunk would be a chunk of text, maybe 256 bytes, typically half full. The >chunks are then linked together in a linked list. Insertion/deletion usually >involves moving only some of the bytes in the current chunk. Sometimes, of >course, you must split chunks or combine them. There is usually no array of >line pointers, except maybe for the lines currently being displayed. Two >important data structures could be declared something like: > >typedef struct Chunk { > struct Chunk *next, *prev; > short len; > char text[256]; >} Chunk; > I did somethings very similar, but used smaller chunks. Because it was meant to be a program editor, I guestimated that the average line would be fairly short, and used 64 byte chunks, with sideways links. Joining lines involved no data movement, because the display code used the length of each chunk, and kept displaying while there were right links. Insertion and deletion involved no more than a single chunk's data (right & left shifting characters), empty chunks being deleted and new chunks being inserted when a chunk filled up. I didn't use C strings (null terminated) because I wanted to be able to edit binary files. Also, with sideways links, there is no problem with arbitrary length lines (vi's "Line too long" problem-- arggh!). Something I didn't do at the time, but would if doing it again, was to have an indent field, for auto-indentation, you might be able to cut down the size of the chunks. This was for a mono-font,size etc. editor. If you wanted to allow the use of different fonts, sizes, styles, etc, you might consider using high-bit set (non-ASCII) characters to indicate one of 128 "style" records (more if you have your code use one or more following bytes). Lots of fun. Editors are fun (well I guess most any programming project can be fun :-). Good luck with your program! -- Tom Kimpton UUCP: {uunet,caeco,nrc-ut}!iconsys!tom Software Engineer INTERNET: tom@iconsys.uu.net Icon International, Inc. BITNET: icon%byuadam.bitnet (multi-user acct) Orem, Utah 84058 PHONE: (801) 225-6888
d88-bli@nada.kth.se (Bo Lindbergh) (11/01/89)
In article <413@iconsys.UUCP> tom@iconsys.UUCP (Tom Kimpton) writes: [ stuff about text editor data structures deleted ] > If you wanted to allow the use >of different fonts, sizes, styles, etc, you might consider using >high-bit set (non-ASCII) characters to indicate one of 128 "style" >records (more if you have your code use one or more following bytes). No, no, for Odin's sake NO! You might call characters with codes above 127 "non-ASCII", but I call them indispensable international characters. An editor that won't let the user use them is useless outside the USA. #include <Generic flame about American provincialism> Bo Lindbergh
d88-jwa@nada.kth.se (Jon Watte) (11/03/89)
In article <2245@draken.nada.kth.se> d88-bli@nada.kth.se (Bo Lindbergh) writes: >In article <413@iconsys.UUCP> tom@iconsys.UUCP (Tom Kimpton) writes: >[ stuff about text editor data structures deleted ] >> If you wanted to allow the use >>of different fonts, sizes, styles, etc, you might consider using >>high-bit set (non-ASCII) characters to indicate one of 128 "style" >>records (more if you have your code use one or more following bytes). >No, no, for Odin's sake NO! Of course Bo LIndberg's right. Inside Mac says: XI) Thou shalt not assume a character requres only one byte Using the Script manager, and all of the international utilities, requres some extra work, yes, but it will be a good program when it's done. Keep the style record off the text, to maintain compatibility with TEXT-only editors. I had a flame war with some people at Microsoft about this a while ago :-) Happy Hacking ! h+@nada.kth.se -- This .signature is longer than 4 lines. If you'd like to see it in whole, please send a note to me. I'm h+@nada.kth.se and also h+@proxxi.se 8')