[comp.emacs] Force exact matching of tag name

worley@compass.UUCP (Dale Worley) (02/03/90)

When I try to look up a tag with find-tag, I wind up seeing every tag
that contains the tag I'm looking for.  (For instance, look up
"find-file" in Emacs' tags.)  I've made changes to fix this problem.
The first change is to change the search in find-tag to be a regexp
search rather than a string search.  (This should be
upward-compatible, since no language that I know of uses any of Emacs
regexp special characters as symbol constituents.)  The second is to
add a new special character "\y" to regexp searching, which means
"symbol boundary", just as "\b" matches word boundaries.
(Unfortunately, this change is necessary, since there is no simple way
to specify "beginning of symbol" or "end of symbol", since symbol
constituents can be of two syntax classes.)  Then, if you want to find
a tag exactly, you can search for, for instance, "\yfind-file\y".

Dale Worley		Compass, Inc.			worley@compass.com
--
The great tragedy of science, the slaying of a beautiful theory by an ugly
fact.
        --Thomas Henry Huxley

*** regex.h.orig	Wed Jan 31 09:37:41 1990
--- regex.h	Wed Jan 31 09:38:45 1990
***************
*** 249,255 ****
      notwordbound, /* Succeeds if not at a word boundary */
      syntaxspec,  /* Matches any character whose syntax is specified.
  		    followed by a byte which contains a syntax code, Sword or such like */
!     notsyntaxspec /* Matches any character whose syntax differs from the specified. */
    };
  
  extern char *re_compile_pattern ();
--- 249,256 ----
      notwordbound, /* Succeeds if not at a word boundary */
      syntaxspec,  /* Matches any character whose syntax is specified.
  		    followed by a byte which contains a syntax code, Sword or such like */
!     notsyntaxspec, /* Matches any character whose syntax differs from the specified. */
!     symbolbound  /* Succeeds if at a symbol boundary */
    };
  
  extern char *re_compile_pattern ();


*** regex.c.orig	Wed Jan 31 09:37:49 1990
--- regex.c	Wed Jan 31 09:47:43 1990
***************
*** 127,132 ****
--- 127,133 ----
  
  #ifndef Sword /* must be non-zero in some of the tests below... */
  #define Sword 1
+ #define Ssymbol 2
  #endif
  
  #define SYNTAX(c) re_syntax_table[c]
***************
*** 637,642 ****
--- 638,647 ----
  	      PATPUSH (notwordbound);
  	      break;
  
+ 	    case 'y':
+ 	      PATPUSH (symbolbound);
+ 	      break;
+ 
  	    case '`':
  	      PATPUSH (begbuf);
  	      break;
***************
*** 818,823 ****
--- 823,829 ----
  	case notwordbound:
  	case wordbeg:
  	case wordend:
+ 	case symbolbound:
  	  continue;
  
  	case endline:
***************
*** 1493,1498 ****
--- 1499,1520 ----
  	      || SYNTAX (d == end1 ? *string2 : *d) != Sword) /* Next char not a letter */
  	    break;
  	  goto fail;
+ 
+ 	case symbolbound:
+ 	  {
+ 	    char before_syntax, after_syntax;
+ 
+ 	    if (d == string1  /* Points to first char */
+ 		|| d == end2  /* Points to end */
+ 		|| (d == end1 && size2 == 0)) /* Points to end */
+ 	      break;
+ 	    before_syntax = SYNTAX (d[-1]);
+ 	    after_syntax = SYNTAX (d == end1 ? *string2 : *d);
+ 	    if ((before_syntax == Sword || before_syntax == Ssymbol)
+ 		!= (after_syntax == Sword || after_syntax == Ssymbol))
+ 	      break;
+ 	    goto fail;
+           }
  
  #ifdef emacs
  	case before_dot:


*** tags.el.orig	Fri Feb  2 11:56:15 1990
--- tags.el	Fri Feb  2 11:56:52 1990
***************
*** 151,157 ****
         (setq tagname last-tag))
       (setq last-tag tagname)
       (while (progn
! 	      (if (not (search-forward tagname nil t))
  		  (error "No %sentries containing %s"
  			 (if next "more " "") tagname))
  	      (not (looking-at "[^\n\177]*\177"))))
--- 151,157 ----
         (setq tagname last-tag))
       (setq last-tag tagname)
       (while (progn
! 	      (if (not (re-search-forward tagname nil t))
  		  (error "No %sentries containing %s"
  			 (if next "more " "") tagname))
  	      (not (looking-at "[^\n\177]*\177"))))

kim@spock (Kim Letkeman) (02/05/90)

In article <9002021705.AA06173@sn1987a.compass.com>, worley@compass.UUCP (Dale Worley) writes:
| When I try to look up a tag with find-tag, I wind up seeing every tag
| that contains the tag I'm looking for.  (For instance, look up
| "find-file" in Emacs' tags.)  I've made changes to fix this problem.
| The first change is to change the search in find-tag to be a regexp
| search rather than a string search.  (This should be
|[...]

We made similar changes in our environment, but with one major
difference. We created a second function called find-tag-exact. This
function performs the exact match on a word boundary. We did this
because of the fairly severe penalty you pay for regexp searching over
string searching (the difference is very noticeable even on a sun4.

We would recommend the second function in environments with a lot of
symbols. (Our environment contains about 1500 pascal files with many
symbols each - the tags file is 486191 bytes long.)

By the way, we created an awk script to handle the creation of a tags
file for our variant of pascal. I include it here in case anyone might
find it useful.

Kim

--------cut here-----------------------------------------------------
#! /bin/sh
#

# Creates an emacs (etags equivalent) TAGS file for a variant of
# pascal. Note that this is a filter, so output should be piped to
# your tags file directly.
#
# Syntax expected:
#
# function [entry] <name> (parms)
#   or
# procedure [entry] <name> (parms)
#
# where the term entry signifies an exported definition.
#
# Example: To create a file named TAGS for the current srcdir, just
# type:
#
# etags $SRCDIR/*.pas > TAGS &
#
# and in 15 minutes (sun4) or 40 minutes (sun3), you'll have a tags
# file. 

gawk '\

BEGIN { 
	IGNORECASE = 1
}

function dump_tag_entries()
{
	if (tag_entry_line>2) {
		tag_entries[2] = sprintf ("%s,%d",curr_filename,
					total_chars_in_tag_entry);
		for (i=1; i<=tag_entry_line; i++) {
			print tag_entries[i];
			delete tag_entries[i];
		}
	}
}


# This short null pattern checks to see if we have changed files. If so,
# we dump the previous entries and reinitialize the array and supporting
# variables.

{
	if (FILENAME!=curr_filename) {
		dump_tag_entries();
		curr_filename = FILENAME;
		curr_char_posn = 0;
		total_chars_in_tag_entry = 0;
		tag_entries[1] = "";
		tag_entry_line = 2;
	}
}

# This pattern selects procedure and function definitions with 99.9%
# accuracy. Only the ugliest of code can slip through and I dont want
# to look at that stuff anyway. The if statement eliminates code that 
# does not put the function or procedure name on the same line as the 
# definition. The search string is kept to a minimum length by removing
# any parameters.

/^[ 	]*(procedure|function)[ 	]/ {
	if (($2 ~ /entry/ && NF>2) ||
	    ($2 !~ /entry/ && NF>1)) {
		search_pattern = $0;
		sub ("[ 	]*\\(.*$", "", search_pattern);
		sub ("[ 	]*;.*$", "", search_pattern);
		tag_entries[++tag_entry_line] = sprintf ("%s%c%d,%d",
							search_pattern, 127,
							FNR, curr_char_posn);
		total_chars_in_tag_entry += \
			length(tag_entries[tag_entry_line])+1;
	}
}

# This action updates the character offset for the search string.
	
{
	curr_char_posn += length($0)+1;
}

# After all is said and done there is still one more set of entries.

END { 
	dump_tag_entries()
}' $*
-- 
Kim Letkeman    uunet!mitel!spock!kim

worley@compass.com (Dale Worley) (02/09/90)

   From: kim@spock (Kim Letkeman)
   Newsgroups: comp.emacs
   Date: 5 Feb 90 14:00:02 GMT

   We made similar changes in our environment, but with one major
   difference. We created a second function called find-tag-exact. This
   function performs the exact match on a word boundary. We did this
   because of the fairly severe penalty you pay for regexp searching over
   string searching (the difference is very noticeable even on a sun4.

I don't entirely understand this.  Matching on a word boundary doesn't
really get the job done, since, say, a "-" is a non-word character.
Thus, "find-file" matches "dired-find-file" on word boundaries.

Also, I don't see how you can avoid the penalty for regexp searching
if you restrict your matches to occur on word boundaries, since that
*is* a regexp search.  If there is some clever trick using string
searches and other tests that's faster than an equivalent regexp
search, then the regexp code needs to be improved!

Dale Worley		Compass, Inc.			worley@compass.com
--
If the United States were really a democracy, we would have concentration
camps for AIDS patients.

pedz@pedz.austin.ibm.com (Perry Smith) (02/09/90)

My solution to the problem was to change the search to a re-search so
that \< and \> could be used along with any other regular expression
construct.  The one pitfall is that '_' is not a word character in the
TAGS file normally and that screws things up as well (but can be fixed
fairly easily).

pedz