[comp.lang.c++] name mangler

vaughan%cadillac.cad.mcc.com@MCC.COM (Paul Vaughan) (04/02/91)

Several people including myself have asked for a name mangler. Michael
Tiemann has replied that such a thing exists in the g++ file
cplus-method.c, but I haven't figured out any way to use that just
yet.  I've been looking into the problem a bit and have come to a
sticky question: Exactly what input strings should a name mangler
accept?


Michael's mangling function is called as part of the process of
compiling code and is oriented very strongly for this purpose. It
appears to accept any legal function declaration (or a start of a
definition) of the current context in parse tree form.  Any types
mentioned must have been declared, default argument expressions are
admitted, explicit type declarations (i.e. void foo(struct Bar);) etc.
are allowed. This makes for a very complex grammar for function
declarations--for instance, it subsumes the expression grammar of C++.

I looked into the way that gdb handles name mangling. It avoids the
issue by only doing name demangling. That is, when you type in a C++
function name (not a complete declaration) like "foo" to set a
breakpoint, it looks through all symbols that start with foo,
demangles any matches and compares the demangled base name (the
demangler has an option to return only the base name, instead of a
full declaration with argument specs) against the given base name. (As
an aside, this code doesn't quite work for ordinary functions in gdb
3.6, and I don't understand how it is intended to work when a function
is overloaded).

The reason I wanted a name mangler was in connection with dynamic
linking. I'd like to be able to specify a full declaration (but not in
any context of typedefed names, and without the return type) and get
out the mangled symbol.  For instance,


"foo(Foo, Bar*, int, int)"

would give 

"_foo__FG3FooP3Barii"

for g++-1.39

I'm wondering, is this even feasible? Would it be necessary to have
built up the context of typedef'ed names? Suppose that certain
restrictions to the input format of unmangled names, such as
prohibiting

	foo(Foo, struct Bar*, int, int)

were in effect. Then would there a exist a 1:1 mapping between legal
mangled and unmangled names?

Does anyone have a specification for the mangler that is simpler than
ferreting it out of the demangler in cplus-dem.cc?

How many people would be interested in having a bison grammar based
mangler and demangler?

tiemann@CYGNUS.COM (Michael Tiemann) (04/02/91)

    Does anyone have a specification for the mangler that is simpler than
    ferreting it out of the demangler in cplus-dem.cc?

You need to do the whole job because of typedefs.  I.e.,

	typedef int foo;
	typedef int bar;

	foo f (bar);

mangles to the same thing that

	bar f (foo);

mangles to.

Michael

vaughan%cadillac.cad.mcc.com@MCC.COM (Paul Vaughan) (04/03/91)

	You need to do the whole job because of typedefs.  I.e.,

		typedef int foo;
		typedef int bar;

		foo f (bar);

I was thinking that in the "function specification language", typedefs
in this sense would not be allowed. Even though foo might be declared

foo f(bar);

in some source code, it would have to be declared

int f(int)

in a function specification to be accepted by the mangler. Note that
there are other differences between this function specification
language and C++.  For instance, 

class Foo  {
  int foo(Foo*);
};

int Foo::foo(Foo*);

isn't a valid declaration in C++. (Oooh, speaking of valid C++, note
that the above is accepted by g++-1.39 but not by cfront 2.0--bug?.)
I was thinking that any identifier (name other than reserved words,
symbols, or basic types) would be assumed to directly name a user
defined type.  Typedefed aliases or full anonymous struct definitions
would not be accepted.

It seems clear that one requirement would be that the mangler be able
to accept any output generated by the demangler and vice versa. I
think the simplifications I'm making are consistent with that. However
it's not clear what other requirements exist for a useful tool. For
instance, these specs would not necessarily let you directly use
pieces of source code or output from the compiler as input to the
mangler. I don't see any way of creating such a mangler without
analyzing the entire source code for a module and that's significantly
more work than I want the mangler to do. Aside from compilation, it
seems the reason most people have cited for wanting a mangler is for
dynamic loading. I'm not sure if these simplifications would
adequately support that.

tiemann@CYGNUS.COM (Michael Tiemann) (04/03/91)

I think the way to handle dynamic loading is related to the way that
parameterized types must be handled.  I would like to see discussion
about how the linker and compiler should communicate to handle both
jobs with equal facility.  If we can get this working, then using the
name mangler that comes with the compiler will be a simple application
of software reuse.

Michael

kelley@mpd.tandem.com (Michael Kelley) (04/04/91)

I did get a number of people sending in about my name mangler, so here it is.
I haven't had time to move it to any other machine than a sparc, so you may
have to do some hacking to plug in the right tools (nawk, nm, etc.)  Same goes
for it's use with g++: I just don't know.  But it's a start, anyway.  

#!/bin/sh
# $Id: mangle,v 1.1 91/04/03 12:04:34 kelley Exp $
# 
# usage: mangle [-c]
# 	generates a map between demangled and mangled names to stdout, with
# 	a mangled name on the line following its demangled counterpart.
# 	It reads from stdin. With the '-c' argument, you'll get a C++ structure.
#
# written by Mike Kelley, email kelley@mpd.tandem.com
#
## Copyright 1991, by Tandem Computers, Incorporated, Austin, Texas.
## 
##                         All Rights Reserved
## 
## Permission to use, copy, modify, and distribute this software and its 
## documentation for any purpose and without fee is hereby granted, 
## provided that the above copyright notice appear in all copies and that
## both that copyright notice and this permission notice appear in 
## supporting documentation, and that the names of Tandem not be
## used in advertising or publicity pertaining to distribution of the
## software without specific, written prior permission.  
## 
## TANDEM DISCLAIMS ALL WARRANTIES WITH REGARD TO THIS SOFTWARE, INCLUDING
## ALL IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS, IN NO EVENT SHALL
## DIGITAL BE LIABLE FOR ANY SPECIAL, INDIRECT OR CONSEQUENTIAL DAMAGES OR
## ANY DAMAGES WHATSOEVER RESULTING FROM LOSS OF USE, DATA OR PROFITS,
## WHETHER IN AN ACTION OF CONTRACT, NEGLIGENCE OR OTHER TORTIOUS ACTION,
## ARISING OUT OF OR IN CONNECTION WITH THE USE OR PERFORMANCE OF THIS
## SOFTWARE.
## 
#
# known limitations/bugs: 
#	- need to investigate the way enums are handled for C++ variants; with
#     cfront, they are mangled just as are class names
#	- if your filter doesn't work, then neither will this
#	- doesn't work for constructors
#   - you can't default "unsigned" to "unsigned int"
#   - demangled names given as input get changed to their fundamental
#     counterparts (via typedef) as output--you'll have to do your own
#     substitution
#	- it's *very* slow
#	- little or no error checking
#
TMPDIR=${TMPDIR-/tmp}
MANGDIR=$TMPDIR/mangle$$
usage="usage: `basename $0` [-c]"
here=`pwd`
CC=/usr/CC/sun4/CC
CPP=/lib/cpp
filter="c++filt -s"
# use "nm -B" on SYSV variants
NM=nm

if test $# -eq 1; then
	if test "$1" = "-c"; then
		format=1
	else
		echo $usage
		exit 1
	fi
elif test $# -ne 0; then
	echo $usage
	exit 1
fi

mkdir $MANGDIR
trap "cd $here; rm -rf $MANGDIR; exit 1" 1 2 3 15
cd $MANGDIR
touch input
echo '#include "classes.h"' > nonMember.c
$CPP | sed -e "s/^[ ]*//" -e "s/;//g" | while read prototype; do
	# take out line control information
	(echo $prototype | egrep '#.*' > /dev/null) && continue;
	# look for (only one line) typedefs
	echo $prototype | egrep 'typedef .*' > /dev/null # no '-s' on SYSV
	if test $? -eq 0; then
		echo $prototype >> input
		echo $prototype | sed -e "s/unsigned //" \
		-e "s/typedef[ ]*[_a-zA-Z]*[( \*]*\([^() \*]*\).*/TYPEDEF,\1/" \
			>> input
	else
		echo $prototype | egrep '[_a-zA-Z]+::.*' > /dev/null;
		if test $? -ne 0; then		# not a class member
			call="`echo $prototype | sed 's/:://'`";
			echo "static void $call { }" >> nonMember.c;
		else
			class=`expr "$prototype" : '\(.*\)::.*'`;
			call=`expr "$prototype" : '.*::\(.*\)'`;
			file=${class}.c;
			if test ! -f $file; then
				echo '#include "classes.h"' > $file;
				echo "class $class {" >> $file;
			fi
			echo "virtual void $call { }" >> $file;
		fi
		echo $call | tr '()' ',,' | \
			sed -e "s/unsigned //" -e "s/const //" -e "s/[ \*&]//g" >> input;
	fi
done

nawk -F, ' \
	BEGIN { \
		typedef[1] = "char"; \
		typedef[2] = "short"; \
		typedef[3] = "int"; \
		typedef[4] = "long"; \
		typedef[5] = "float"; \
		typedef[6] = "double"; \
		ntypes = 6; \
	} \
	{ \
		if (substr($1, 1, 8) == "typedef ") printf("%s;\n", $0); \
		else if ($1 == "TYPEDEF") typedef[++ntypes] = $2; \
		else { \
			for (i = 2; i < NF; i++) { \
				# cant get expr in array to work...
				# if (!($i in typedef)) printf("class %s;\n", $i); \
				if (length($i) == 0) break; \
				for (j = 1; j <= ntypes; j++) if (typedef[j] == $i) break; \
				if (j > ntypes) printf("class %s;\n", $i); \
				# could add to typedef array, but better to let it go...
			} \
		} \
	} ' input > classes.h

if [ "$format" ]; then
cat << END_CAT
static struct MangleNode {
	const char *prototype;
	const char *mangled;
	void (*entry)(...);
} mapping[] = {
END_CAT
fi

for f in *.c; do
	if test $f != "nonMember.c"; then
		echo "};" >> $f;
	elif test "`wc -l $f`" -eq 1; then
		break;
	fi
	object="`echo $f | cut -f1 -d.`.o"
	$CC -c -g $f 2> /dev/null
	(test $? -ne 0) && continue
	# sed portion to take out '_' Sun prepends before function name
	$NM $object | $filter | sed "s/\([: ]\)_/\1/g" > input
	if [ "$format" ]; then
		nawk '$2 == "t" { \
			printf("\"%s", $3); \
			for (i = 4; i < NF; i++) printf(" %s", $i); \
			printf("\", \"%s\", 0,\n", $NF); \
		} ' input;
	else
		nawk '$2 == "t" { \
			for (i = 3; i <= NF; i++) \
				if (i == NF) printf("\n%s\n", $i); \
				else printf("%s ", $i); \
		} ' input;
	fi
done
if [ "$format" ]; then
	echo "};"
fi

cd $here
rm -rf $MANGDIR
#
# here begins sample input: just cut here, whack off the beginning '##',
# and you're on your way...
##typedef unsigned int (*PfHash)(const char*);
##typedef char Boolean;
##X::f(int, const Y&);
##X::f();
##Y::f(int, const Y&);
##zoo::ack(float, unsigned long, double*, Z*);
##job(int, foobar&);
##::go(Boolean);
##HashTable::setup(PfHash, unsigned int);

Mike Kelley
Tandem Computers, Austin, TX
kelley@mpd.tandem.com
(512) 244-8830 / Fax (512) 244-8247