[comp.sys.mac.programmer] Serious programming question

CXT105@PSUVM.BITNET (Christopher Tate) (10/23/89)

What sort of data structures are generally used for representing text in an
editor or word-processor?  Obviously TextEdit is an alternative, but it's slow
(as far as I've heard) and imposes size restrictions.

Has anyone here done work with writing editors or word-processors?  Could you
give a newcomer to the problem a little hand getting started...?

-------
Christopher Tate      ||      "I hate quotations!" -- Ralph Waldo Emerson

pete@titan.rice.edu (Pete Keleher) (10/23/89)

> What sort of data structures are generally used for representing text in an
> editor or word-processor?  Obviously TextEdit is an alternative, but it's slo
> (as far as I've heard) and imposes size restrictions.

Not only does TextEdit impose size restrictions, but it's updating leaves
a lot to be desired. One of my goals was to get the editor to look and feel
as solid as Think C's editor, but not be bare-bones.

Good question. I have looked at source for two editors and have written one
of my own. The best data structure that I have heard of would be "chunks". A
chunk would be a chunk of text, maybe 256 bytes, typically half full. The
chunks are then linked together in a linked list. Insertion/deletion usually
involves moving only some of the bytes in the current chunk. Sometimes, of
course, you must split chunks or combine them. There is usually no array of
line pointers, except maybe for the lines currently being displayed. Two 
important data structures could be declared something like:

typedef struct Chunk {
	struct Chunk	*next, *prev;
	short		len;
	char		text[256];
} Chunk;

typedef struct {
	Chunk	*chunk;
	char	*cp;
} mark;

In my editor, I use one large chunk for the file image that is read from disk,
along with an array of line pointers into the file image. When a line is 
modified, I allocate a chunk and go from there. When lines are inserted/deleted
I need to move all of the line pointers after the line in question, but in
practice this is cheap. The overhead of having data in the file image in memory
but no longer used is also cheap. This is definitely NOT the way to do it if
you're designing from scratch, but I made some poor decisions in the beginning
(basically I put off the decision, and by the time I could see where to go, I
couldn't do it without re-writing everything). The up side is that:

	1) In the case where relatively little is modified, (say less than 
	   30%), memory costs are low.
	2) It is very fast. In particular, I can access any line as fast as any
	   other. as opposed to the straight chunk approach where you have to 
	   search through all of the lines between the start of the file and
	   your current position just to find out what line you are on.



--

===========================================================================
Pete Keleher						pete@titan.rice.edu

Rice University knows nuttin about what I say, or what I do ...
===========================================================================

beard@ux1.lbl.gov (Patrick C Beard) (10/23/89)

In article <89295.142753CXT105@PSUVM.BITNET> CXT105@PSUVM.BITNET (Christopher Tate) writes:
>What sort of data structures are generally used for representing text in an
>editor or word-processor?  Obviously TextEdit is an alternative, but it's slow
>(as far as I've heard) and imposes size restrictions.
>
>Has anyone here done work with writing editors or word-processors?  Could you
>give a newcomer to the problem a little hand getting started...?
>

I've thought about this a bit, though I've not gotten around to implementing
it.  The solution that seems the most flexible is a doubly linked list of
lines (or paragraphs if you prefer).  That way, you'll have fairly small
blocks of memory to move around when you insert characters etc.  And changing
the order of things is very inexpensive.  And the solution would definitely
depend on what the possible attributes are that you'll be keeping track of. 

This is an interesting topic.  What have others done?

Patrick 

-------------------------------------------------------------------------------
-  Patrick Beard, Macintosh Programmer                        (beard@lbl.gov) -
-  Berkeley Systems, Inc.  ".......<dead air>.......Good day!" - Paul Harvey  -
-------------------------------------------------------------------------------

tim@hoptoad.uucp (Tim Maroney) (10/23/89)

The linked list suggestions made so far may be fast enough for a word
processor, but the Memory Manager will really bog down when you have a
lot of these chunks allocated.  If you want better performance, it
would probably be better to pre-allocate a large buffer and treat nulls
as non-characters, the "holy buffer" approach.  This will completely
avoid Memory Manager overhead, but may require you to move more data
explicitly.  You can optimize it in various ways, as any
second-semester data structures textbook will tell you.  You can leave
extra space to make insertions faster, or use only as much space as the
user types if you feel deletions are more important.

The subject is not trivial and I recommend some research before you
begin.  I just plunged ahead with a relocatable linked list approach
for one of my terminal emulators and I wound up with something that
couldn't keep up with 2400 baud on an 8MHz 68000.  On the other hand,
it was a lot faster for data input than the MPW Shell.  So I guess my
main advice is -- be prepared to throw away your first stab at it.
Make sure knowledge of your data structures is confined to only those
files that need it, so that you can change them when you find that your
first algorithm bogs down in certain situations.  Be sure to test speed
with respect to operations like large pastes and very fast typing.  And
familiarize yourself beforehand with the various buffering strategies
that can be found in the textbooks.  Most importantly, be sure to leave
yourself a way to buffer large files on disk.
-- 
Tim Maroney, Mac Software Consultant, sun!hoptoad!tim, tim@toad.com

FROM THE FOOL FILE:
"As the expert on stupid, I'll take your word for it."
  -- Richard Sexton, richard@gryphon.COM

pepke@loligo (Eric Pepke) (10/23/89)

It just so happens that I am curently working on a text editor for programming
right now.  I am using a single buffer with a single gap using long integers
to remember the offsets to the gap, which I think is similar to the way Capps 
used to do it.  (If Capps were still a supported product, I would probably be
using it.)

A reasonably large gap is kept at the current insertion point.  Normal 
character insertions just go into the gap and make the gap one character 
smaller.  Deletions of single characters just increase the size of the gap by
one.  The only times that text has to be moved are when a new gap is allocated
or the old gap is removed.  The idle routine tries to do this at unobtrusive
times.

Scroll bar handling is done as an integral part of the text editor.  I strongly
recommend that you do this rather than follow TextEdit's method.  Scroll bar
handling for long integers is very easy to do cleanly if you have enough 
information available.  If you try to tack it on outside of the editor a la
TextEdit, it is very difficult to do cleanly.

Eric Pepke                                     INTERNET: pepke@gw.scri.fsu.edu
Supercomputer Computations Research Institute  MFENET:   pepke@fsu
Florida State University                       SPAN:     scri::pepke
Tallahassee, FL 32306-4052                     BITNET:   pepke@fsu

Disclaimer: My employers seldom even LISTEN to my opinions.
Meta-disclaimer: Any society that needs disclaimers has too many lawyers.

amanda@intercon.com (Amanda Walker) (10/24/89)

In article <299@fsu.scri.fsu.edu>, pepke@loligo (Eric Pepke) writes:
> It just so happens that I am curently working on a text editor for programming
> right now.  I am using a single buffer with a single gap using long integers
> to remember the offsets to the gap, which I think is similar to the way Capps 
> used to do it.  (If Capps were still a supported product, I would probably be
> using it.)

Yup, sounds like Capps.  For a lot of applications, a simple buffer-gap
scheme is definitely a good way to go.  It is quite fast in normal kinds
of editing, and it is simple enough that it's easy to implement and maintain.

It's biggest bad point is that it still needs to have the whole document in
memory.  This isn't so bad for something like GNU Emacs, where stuff just
gets swapped out, but on a Mac it can be annoying.  You could build some
kind of simple paging system underneath the editor buffer, though.

Another problem is that it lends itself to "big hunk o' ASCII" kinds of text
editing.  Adding style run & script manager support gets... interesting.
If you're doing things that are more word processing than text editing, I'd
suggest using the linked list of paragraphs approach.  This also gives you
a natural unit of swapability for editing in cramped memory.

--
Amanda Walker <amanda@intercon.com>

jc@gmdzi.UUCP (Juergen Christoffel) (10/25/89)

pepke@loligo (Eric Pepke) writes:

>It just so happens that I am curently working on a text editor for programming
>right now.  I am using a single buffer with a single gap using long integers
>to remember the offsets to the gap, which I think is similar to the way Capps 
>used to do it.  (If Capps were still a supported product, I would probably be
>using it.)

I've been thinking about these issues too for a while now and it
really looks that one has to re-invent most of the memory manager here
in order to get decent performance. Sigh.

Eric Pepke's description reminded me of an old MIT Memo which
described this technique in detail which you might be interested in;
luckily I found a copy after some searching:

	MIT/LCS/TM-165
	Theory and Practice of Text Editors
	(or: A Cookbook for an Emacs)
	Craig A. Finseth
	May 1980

It's still up-to-date these days; chapter two is about memory
management (about 30 pages), chapter three covers incremental
redisplay strategies.  You might like to get hold of a copy yourself.

The table of contents for chapter two lists:

	Data Structures
	Marks
	Interface Procedures
	Buffer Gap
	   Gap Size
	   Multiple Gaps and Why They Don't Work
	   The Hidden Second Gap
	Linked Line
	   Storage Comparison
	   Error recovery Comparison
	Multiple Buffers
	Paged Virtual Memory
	Editing Extremely Large Files
	Scratchpad memory

Poking around in the gnuemacs sources a bit assured me that it uses a
similar approach (routines like make_gap and move_gap called from
SelfInsert) so this approach seems to be still in use today.

Symbolics' lisp machines (which I'm quite familiar with) use the
linked line approach in their ZWEI editor substrate (ZWEI was
initially developed at MIT for MIT's lisp machines) and they must have
had a reason to not use the gap approach. Maybe it's their special
hardware support for various data structures.

Tradeoffs between these two approaches are visible e.g. for buffer
pointers (bps for short - or marks as they are called in MPW's
editor). If you use linked lines you may store bps relative to these
lines and have to update them only when inserting into the line they
point to while the gap approach would require to update them for each
insertion. It depends on how many bps you have ...

I really like bps and would like to see them in more editors (e.g.
Think C's) -- lisp machines use them a lot for example to record the
definitions of functions/procedures and global variables so you simply
have to mouse an indentifier while holding down a modifier key and off
you go to the definition (the real one, NOT the first textual
appearance in the buffer as THINK does -- that's annoying if you use
prototypes in C!)  So hopefully everybody writing a new editor for a
Mac spends some time thinking how s/he could provide such support.
It's really useful, not just for programmers but for large texts as
well, think of bps pointing to chapters/sections of your paper or
applications of various style sheets etc.

Using a general chunk approach (might think of linked lines as a
special case of chunks) instead of the gap approach seems a
straightforward way for editors which have some more complicated
structures to maintain than "just" linear text: Hypertext systems need
to keep track of nodes and links in their data structures and I wonder
whether the gap approach is capable of representing such text too in
an efficient manner. Does anybody have some experience in this and
would like to share it?

>Scroll bar handling is done as an integral part of the text editor.  I strongly
>recommend that you do this rather than follow TextEdit's method.  Scroll bar
>handling for long integers is very easy to do cleanly if you have enough 
>information available.  If you try to tack it on outside of the editor a la
>TextEdit, it is very difficult to do cleanly.

Would you please be so kind to elaborate a bit on this topic as I
would like to learn more about your approach. Thanks!

Juergen Christoffel

========================================================================
  Juergen Christoffel, jc@gmdzi.UUCP, (++49 2241) 14-2421
  German National Research Laboratory for Computer Science (GMD)
  Schloss Birlinghoven, Postfach 1240, D-5205 St. Augustin 1, FRG
========================================================================

tom@iconsys.UUCP (Tom Kimpton) (10/31/89)

In article <PETE.89Oct22200144@titan.rice.edu> pete@titan.rice.edu (Pete Keleher) writes:
>
>> What sort of data structures are generally used for representing text in an
>> editor or word-processor?  Obviously TextEdit is an alternative, but it's slo
>> (as far as I've heard) and imposes size restrictions.
>
>Good question. I have looked at source for two editors and have written one
>of my own. The best data structure that I have heard of would be "chunks". A
>chunk would be a chunk of text, maybe 256 bytes, typically half full. The
>chunks are then linked together in a linked list. Insertion/deletion usually
>involves moving only some of the bytes in the current chunk. Sometimes, of
>course, you must split chunks or combine them. There is usually no array of
>line pointers, except maybe for the lines currently being displayed. Two 
>important data structures could be declared something like:
>
>typedef struct Chunk {
>	struct Chunk	*next, *prev;
>	short		len;
>	char		text[256];
>} Chunk;
>

I did somethings very similar, but used smaller chunks.  Because
it was meant to be a program editor, I guestimated that the
average line would be fairly short, and used 64 byte chunks, with
sideways links.  Joining lines involved no data movement, because
the display code used the length of each chunk, and kept displaying
while there were right links.  Insertion and deletion involved
no more than a single chunk's data (right & left shifting characters),
empty chunks being deleted and new chunks being inserted when a chunk
filled up.  I didn't use C strings (null terminated) because I wanted
to be able to edit binary files.  Also, with sideways links, there
is no problem with arbitrary length lines (vi's "Line too long"
problem-- arggh!).  Something I didn't do at the time, but would if
doing it again, was to have an indent field, for auto-indentation,
you might be able to cut down the size of the chunks.  This was
for a mono-font,size etc. editor.  If you wanted to allow the use
of different fonts, sizes, styles, etc, you might consider using
high-bit set (non-ASCII) characters to indicate one of 128 "style"
records (more if you have your code use one or more following bytes).

Lots of fun.  Editors are fun (well I guess most any programming
project can be fun :-).  Good luck with your program!

-- 
Tom Kimpton                    UUCP: {uunet,caeco,nrc-ut}!iconsys!tom
Software Engineer	       INTERNET: tom@iconsys.uu.net
Icon International, Inc.       BITNET: icon%byuadam.bitnet (multi-user acct)
Orem, Utah 84058               PHONE: (801) 225-6888

d88-bli@nada.kth.se (Bo Lindbergh) (11/01/89)

In article <413@iconsys.UUCP> tom@iconsys.UUCP (Tom Kimpton) writes:
[ stuff about text editor data structures deleted ]
>                                   If you wanted to allow the use
>of different fonts, sizes, styles, etc, you might consider using
>high-bit set (non-ASCII) characters to indicate one of 128 "style"
>records (more if you have your code use one or more following bytes).

No, no, for Odin's sake NO!

You might call characters with codes above 127 "non-ASCII", but I call them
indispensable international characters.  An editor that won't let the user
use them is useless outside the USA.

#include <Generic flame about American provincialism>

   Bo Lindbergh

d88-jwa@nada.kth.se (Jon Watte) (11/03/89)

In article <2245@draken.nada.kth.se> d88-bli@nada.kth.se (Bo Lindbergh) writes:
>In article <413@iconsys.UUCP> tom@iconsys.UUCP (Tom Kimpton) writes:
>[ stuff about text editor data structures deleted ]
>>                                   If you wanted to allow the use
>>of different fonts, sizes, styles, etc, you might consider using
>>high-bit set (non-ASCII) characters to indicate one of 128 "style"
>>records (more if you have your code use one or more following bytes).

>No, no, for Odin's sake NO!

Of course Bo LIndberg's right. Inside Mac says:

XI) Thou shalt not assume a character requres only one byte

Using the Script manager, and all of the international utilities,
requres some extra work, yes, but it will be a good program when
it's done. Keep the style record off the text, to maintain
compatibility with TEXT-only editors.

I had a flame war with some people at Microsoft about this a
while ago :-)

Happy Hacking !
						h+@nada.kth.se
-- 
This .signature is longer than 4 lines. If you'd like to see it in whole,
please send a note to me. I'm h+@nada.kth.se and also h+@proxxi.se    8')