[comp.protocols.tcp-ip] NetFind and its Internet load

xcaret@csn.org (Xcaret Research) (05/03/91)

Some concerns have been raised in various news groups about the
potential Internet load and legal propriety of NetFind, a white pages
tool sold and distributed by Xcaret Research, Inc.  Xcaret Research
appreciates the concern of individuals and organizations who keep
network resources from being abused, and we would like to make it clear 
that we are also concerned about such abuse.  In fact, the authors of
NetFind were very careful to consider the load imposed by NetFind and 
conducted a six month study to gather information about about the 
usage of NetFind and the load imposed on the Internet. 

In this message we overview NetFind, and then address these concerns.

Given the name of a person on the Internet and a rough description of
where the person works (such as the name of the institution or the
city/state/country in which it is located), NetFind searches for
electronic mailbox information about the person.  NetFind uses a unique
method to actively search the Internet for the person.  It does not
attempt to keep a database of users across the Internet; such a database
would be quite large, difficult to populate completely, and constantly
out of date.  Instead, NetFind uses the natural database of the Internet
itself: it sends multiple parallel requests across the Internet to
machines where it suspects the person may reside.  The whole process is
surprisingly fast, because NetFind sends searches out in parallel.
NetFind can locate over 1.4 million people in 2,500 different sites
around the world, with response time on the order of 5-30 seconds per
search.

The primary concern that arose about NetFind was its potential load on
the Internet.  Clearly, any tool that uses parallel searches to descend
from the top of the Domain tree and search each server would be
unreasonably costly.  NetFind does not do this.  The NetFind search
procedure uses several mechanisms that significantly limit the scope of
searches.  First, the user selects at most 3 domains to search (an
example of one domain being "colorado.edu"), from the list of domains
matching the organization component of the search request.  Next,
NetFind queries the Domain Naming System to locate authoritative name
server hosts for each of these domains.  The idea is that these hosts
are often on central administrative machines, with accounts and/or mail
forwarding information for many users at a site.  Each of these machines
is then queried using the Simple Mail Transfer Protocol, in an attempt
to find mail forwarding information about the specified user.  If such
information is found, the located machines are then probed using the
"finger" protocol, to reveal more detailed information about the person
being sought.  The results from finger searches can sometimes yield
other machines to search as well.  A number of mechanisms are used to
allow searches to proceed when some of the protocols are not supported
on remote hosts.  Ten lightweight threads are used to allow sets of
DNS/SMTP/finger lookup sequences to proceed in parallel, to increase
resilience to host and network failures.  The tool enforces a number of
other restrictions on the cost of searches, such as the total number of
hosts to finger.

NetFind began as a research prototype, designed and implemented by
Michael Schwartz and Panagiotis Tsirigotis at the University of
Colorado.  Before becoming a commercial product, the research prototype
was deployed at approximately 50 institutions world wide, and extensive
measurements were collected over a period of 6 months of use, about the
cost of searches, time distribution of searches, etc.

The average search uses 136 packets.  While this is larger than typical
directory services (like X.500), NetFind has significantly larger scope
and better timeliness properties than these other services, since it
gets information from the sources where people do their daily computing,
rather than from auxiliary databases.  To put the cost into perspective,
it is equivalent to a very short telnet or moderate size FTP session.

We estimate that if NetFind were to be used by one hundred people at
each site on the Internet where NetFind can find people, it would
increase the NSFNET load by approximately 1.4% above its current load of
4 billion packets per month.  In comparison, FTP currently accounts for
23% of the NSFNET packets.  Moreover, the load generated by NetFind
represents the addition of a significant new type of service.  Providing
new services necessarily will increase network load.

A detailed discussion of the research that led to the NetFind product is
available in the paper "Experience with a Semantically Cognizant
Internet White Pages Directory Tool", Journal of Internetworking:
Research and Experience 2, 1 (March, 1991).

As for the legal issue: Some people have expressed concern that NetFind
represents an inappropriate use of the Internet, because it is
commercial software.  This is a misinterpretation of network appropriate
use policy, which simply regulates the type of traffic that traverses
the network (as opposed to the type of software that generates this
traffic).  There are many pieces of commercial software that generate
packets on the Internet, such as Sun's TCP implementation.  As with
these other pieces of software, appropriate use responsibility rests in
the hands of the user.  Just as it would be inappropriate to use FTP to
transfer commercial data across the Internet, it would be inappropriate
to use NetFind for commercial purposes.  Yet, there are many appropriate
uses for FTP, and for NetFind.

If you have further questions about NetFind, please contact:

	Xcaret Research, Inc.
	2060 Broadway, Suite 320
	Boulder, CO  80302
	(800) 736-1285
	netfind@xcaret.com

emv@ox.com (Ed Vielmetti) (05/03/91)

I'll believe all of the quantitative measurements about NetFind being
sparing of Internet resources, carefully sending out as few packets as
possible and not doing anything stupid.  Think of it as an expert
system, where the expert modeled is the "expert internet user".
From the description of it, I think that an expert internet user like
myself could do a better job, though perhaps not as quickly, because I
have access to more specialized and better databases than just
DNS/SMTP/finger, and more tricky and unobvious ways of looking.

My major problem with tools like NetFind is that although they address
the "resource discovery" problem for a single user, they don't have
any positive side-effects for the rest of the internet.  Nothing about
NetFind adds to any Internet infrastructure; it doesn't make the
problem any easier for the next person down the line or somewhere
else who has the same problem.  In comparision, the efforts of the
various X.500 projects produce something tangible that the rest of the
network can consume later.  

Systems which consume Internet resources and don't have any positive
benefits for the rest of the network are Evil and Rude, no matter how
small the resources are that they consume.  Things which have been
placed into this category at various times are email-based archive
servers (because of their accidental and heavy loads on transit mail
systems), network management by means of pinging random machines,
"mail throughput testers" which send mail through a congested system
to see how congested the mail system is (!?), and rebroadcasting huge
binaries to usenet newsgroups upon the request of one or two people
who missed it.  A badly implemented NetFind could fall into this
category; there's no sign that it actually does.

In this particular case, however, since the research has been
published, the prospective user of NetFind can look up the algorithms
involved and see just how clever the product is before buying.  Since
most of the ad hoc expert systems for Internet user location haven't
been written down, codified, and studied, this is useful information
which deserves a closer look.   See also
	latour.colorado.edu:/pub/RD.Papers/White.Pages.ps.Z
a preprint of the NetFind paper in the Journal of Internetworking.

If I read the paper the right way, users of NetFind are
expected to monitor usenet news and store a database of hostname /
organization pairs on disk, like the following MH scan would do:
	scan -format '%{Organization} %{From}' 
and keep this around for a while (after trimming out user names).
Modulo a few goofy things you'll see with people putting their own
headers (scan alt.sex.pictures to see that) and bland usenet-internet
gateways (see this newsgroup for that) that information's rather good.  
Keep it for a few months, for the newsgroups you expect to care about,
and your ability to find people should be substantially enhanced.

-- 
 Msen	Edward Vielmetti
/|---	moderator, comp.archives
	emv@msen.com

"(6) The Plan shall identify how agencies and departments can
collaborate to ... expand efforts to improve, document, and evaluate
unclassified public-domain software developed by federally-funded
researchers and other software, including federally-funded educational
and training software; "
			High-Performance Computing Act of 1991, S. 218

schwartz@latour.colorado.edu (Mike Schwartz) (05/07/91)

In article <EMV.91May3032230@poe.aa.ox.com> emv@ox.com (Ed Vielmetti) writes:

> My major problem with tools like NetFind is that although they address
> the "resource discovery" problem for a single user, they don't have any
> positive side-effects for the rest of the internet.  Nothing about
> NetFind adds to any Internet infrastructure; it doesn't make the problem
> any easier for the next person down the line or somewhere else who has
> the same problem.  In comparision, the efforts of the various X.500
> projects produce something tangible that the rest of the network can
> consume later.

Maybe this is too simplistic of an interpretation, but it seems your
argument boils down to the fact that NetFind is basically a client of
existing services, rather than a new service in its own right (like
X.500).  But from a user's perspective, this distinction is irrelevant.
What counts is whether the user can find the information they need, how
easily, and at what cost to the network.

It's true that keeping information in a server would allow that
information to be cached for future searches, but I have found that if
someone is "reachable" by NetFind, it is usually pretty easy to find
them with NetFind.  There isn't much need to look at what someone else
did to search for that person.  As an aside, searching for more general
types of resources (like anonymous FTP files) is a harder problem, and
the architecture I use for that project does utilize the results of
previous users' searches in facilitating future users' searches.

I think your objection to a tool only helping one user at the time of
use, without contributing to other users by its specific use, is really
wrong.  If this were the standard against which all software was
compared, we would get rid of most of the software in the world.

I also think your view of what is "tangible" is biased by your role as
moderator of comp.archives.  This is a nice contribution to "network
infrastructure", but as I see it, generating information collections
(which is what both comp.archives and X.500 do, in a general sense) is
only one way for users to get and share information.

In fact, I believe it makes more sense to search for some types of
resources where they naturally reside than it does to build a database
about them, since the database needs to be populated and kept up to date.
I see at least 3 cases where this can be true:
	1.  Dynamic, timely data.
        2.  Data with problems of transfer of authority (i.e., where people
            may not be willing to relinquish control of their data to
            relatively centralized administration, like a server per site).
        3.  Large information spaces of the nature that only a small
            fraction of the data will ever be needed (and hence the effort
            to populate a database will not be effectively amortized).
Internet white pages fits at least (1), since users move around, and
tracking their movements in a database presents administrative problems.
I believe it fits (2) and (3) as well.

 - Mike Schwartz
   Dept. of Computer Science
   Univ. of Colorado - Boulder

emv@ox.com (Ed Vielmetti) (05/07/91)

In article <1991May6.173923.174@colorado.edu> schwartz@latour.colorado.edu (Mike Schwartz) writes:

   > My major problem with tools like NetFind is that although they address
   > the "resource discovery" problem for a single user, they don't have any
   > positive side-effects for the rest of the internet.  

   Maybe this is too simplistic of an interpretation, but it seems your
   argument boils down to the fact that NetFind is basically a client of
   existing services, rather than a new service in its own right (like
   X.500).  

It may be just a matter of terminology; if NetFind were billed as just
a souped-up version of finger, then it could be evaluated in the
context of being basically a client of other services.  But with the
claims of it being a "Semantically Cognizant Internet White Pages
Directory Tool" with the ability to reach "1,147,000 users in 1,929
administrative domains", when it's mentioned in the same breath as
X.500 projects and as an alternative to them, something about it calls
for a more critical examination.

Just to qualify the numbers -- 1,147,000 reachable users is 1,929
reachable domains, each with an average of 119 hosts (mean based on
sample size of 75), with each of those hosts containing a
"conservative estimate" of 5 users per host.  I don't see any
breakdown of success rate by type of domain; notably, the only success
numbers I could find (80+% hit rate by day, 70+% by night) don't
attempt to measure success to the 40% of the database that's not in
the USA.  Perhaps there are a million people out there; I'm not
convinced of how many people you can find.

The performance figures didn't correct for sample bias in the
observer; it would be expected that the author would look for people
in a field related to his (computer science).  Since computer science
departments are often those in charge of running the name servers on
campus, the particular happy accident of the search algorithm relying
on SMTP lookups to the primary name servers may work overly well for
CS dept.  searches.  It is less likely to work well for lookups on
people who are more peripheral to the campus network infrastructure.
An interesting exercise would be to run NetFind against the names of
10 senior librarians, 10 junior physics faculty members, 10
mathematics graduate students, and 10 undergraduate French majors,
suitably scattered about; I have some guesses as to how well your
results will turn out.

(In truth none the numbers tossed around in the paper are especially
convincing; it would have been appropriate to qualify estimated packet
counts and user counts with estimated error ranges.  It's not possible
for me to justify 1,147,000 users any more than 1,146,000 users; a
more plausible figure is "on the order of a million users". That's
especially true without a good rationale for picking 5 users per host,
a figure which appears out of the blue with absolutely no
references....)

I note that your paper shows (fig 3) that usage of your NetFind
prototype tapers off to an average of one use every two weeks.  There
is no indication from the study whether usage dropped so sharply from
the original high average of 7 uses in the first day, or why it drops
so far below the estimated 10 searches per week (quoted from RFC
1107).  Given the expectation of relatively static communities of
interest and the ready availablity of e-mail address information of
potential colleagues by alternative access methods (business cards,
telephone calls, private mailing lists, netnews) it's not surprising
to me that the need for zero-prior-knowlege user lookup information is
lower than 1/day.  But given that the usage trails off to almost nil
after 200 days of use, it would seem to call into question the
long-term usability of your product.  Have you done any retrospective
work on determining why user usage levels dropped to such low levels?

   ... I have found that if someone is "reachable" by NetFind, it is
   usually pretty easy to find them with NetFind.

That's hard to argue with.  But it doesn't yield any insight into what
makes people hard to locate, or how to design campus and corporate
information systems so that people can easily be found without
resorting to extraordinary sleuthing measures.  You casually write off
(in section 5, Related Work) the efforts of campuses to provide local
X.500 services which are accessable via finger; though it's not
directly germane to your research, it would have at least been useful
to point out that X.500 servers can be deployed within the existing
system to good effect.  Stick an X.500 system at yourdomain.org with
a big pile of user names in it, make it so finger@yourdomain.org does
the right thing, and for large institutions like UIUC, UMich, MIT you
have a larger problem solved than trying to chase pointers through a
domain hierarchy.  Granted, the information is somewhat more stale and
less likely to be exactly true; but I think it's arguable that
zero-knowlege searches are looking more for a pointer to information
than an exact match.  (e.g. finger vielmetti@umich.edu and you'll get
something, but you might have to chase it down a bit to find out from
a human that I've moved recently.)

   As an aside, searching for more general types of resources (like
   anonymous FTP files) is a harder problem, and the architecture I
   use for that project does utilize the results of previous users'
   searches in facilitating future users' searches.

Yes, I've read the paper; can't say that it compares with a service
like "archie", though, even if the software were available.  My
reactions to that paper can wait for another message. I'm not
impressed with the amount of effort you've spent on seeing how people
have really addressed the problem;  in particular, your success rates
for scanning the net for interesting information are skewed because
i'm doing it for you already....

   I think your objection to a tool only helping one user at the time of
   use, without contributing to other users by its specific use, is really
   wrong.  If this were the standard against which all software was
   compared, we would get rid of most of the software in the world.

    - Mike Schwartz
      Dept. of Computer Science
      Univ. of Colorado - Boulder

I think my point is valid.  For me to want to let you accomplish a
particular task on the Internet (a shared, finite resource) you need
to justify to me that it's worth it to let you interpose your packets
in the way of my packets on the way to their destination.  I will be
unwilling to do this unless I'm generous or unless I can see some
benefit (or very low cost) from you doing so.  Remember that your use
of the net is generally going to make my use of the net marginally
slower, less convenient, and more risky, unlike e.g. your use of an
editor on your local system.  That's the story of negative
externalities and the "tragedy of the commons"; everyone does a little
thing that's convenient for them but which causes the playground to be
littered.  (e.g. the cutoff of nudie pictures on USA FTP sites causing
the saturation of the USA-Finland internet link, and the subsequent
barrage of traffic in alt.sex.pictures).  Provide me with something
useful, a scrap of code I can use or a good idea to work with, and
I'll let you go about your business

NetFind does seem to pose certain risks to the rest of the net; you
could be very efficiently bombarding my slow links on a wild goose
chase trying to find someone somewhere else.  In truth, I'm sure that
the tradeoff is positive, and that I would be quite happy if just one
person somewhere used NetFind to find me.  A more salient risk is that
successful efforts like NetFind would lead people to believe that
generating queryable information collections a la X.500 is not
necessary in the long run and that we'd be content with ad hoc
solutions.

[
The paper I'm making references to is ftp'able as
	latour.colorado.edu:/pub/RD.Papers/White.Pages.ps.Z
]

-- 
Edward Vielmetti, vice president for research, MSEN Inc. 	emv@msen.com

"often those with the power to appoint will be on one side of a
controversial issue and find it convenient to use their opponent's
momentary stridency as a pretext to squelch them"

kline@ux1.cso.uiuc.edu (Charley Kline) (05/07/91)

emv@ox.com (Ed Vielmetti) writes:

> Stick an X.500 system at yourdomain.org with
> a big pile of user names in it, make it so finger@yourdomain.org does
> the right thing, and for large institutions like UIUC, UMich, MIT you
> have a larger problem solved than trying to chase pointers through a
> domain hierarchy.  Granted, the information is somewhat more stale and
> less likely to be exactly true; but I think it's arguable that
> zero-knowlege searches are looking more for a pointer to information
> than an exact match.  (e.g. finger vielmetti@umich.edu and you'll get
> something, but you might have to chase it down a bit to find out from
> a human that I've moved recently.)

Since you mentioned UIUC...

We're not an X.500 shop here, but we do have a campus-wide "white
pages" service which users can update themselves, and it has an
interface to sendmail as well as finger. You can find me with
"finger 'charley kline'@uiuc.edu", and you can send mail to people's
full names, as in  "mail stacy-forsythe@uiuc.edu". People who move
can change their own entries, so information is never out of date
as long as the person cares enough to keep their information up
to date.

I think the point is that organizational white-pages databases are
already in great supply. I wonder if the "Semantically Cognizant White
Pages Service" understands the semantics of the various ones in use. If
so, it would make the search for people that much less intensive.

________________________________________________________________________
Charley Kline, KB9FFK, PP-ASEL                          c-kline@uiuc.edu
University of Illinois Computing Services            Packet: kb9ffk@w9yh
1304 W. Springfield Ave, Urbana IL  61801                 (217) 333-3339

ddean@rain.andrew.cmu.edu (Drew Dean) (05/08/91)

	There are some interesting points here.  If the person you're trying
to find has (a) been on the net for a long time, (b) works for the military
or a military contractor, or (c) is the technical or administrative contact
for a domain, a whois query to nic.ddn.mil will usually get an answer.  But
yet even relatively well known people such as Ed Vielmetti aren't in that
database.
	Stanford runs a whois server on stanford.edu that has a campus
database in, so that's useful if you know a person is there; as Ed points
out UMich, MIT, and UIUC; at CMU finger name@andrew.cmu.edu will do the same
thing; if you know they're in CS, finger name@cs.cmu.edu will also work.
However, most of the net isn't setup like this, although I'd say it would
probably be a good thing. If you know where a person is, (and you're lucky
:-)), a nice note to postmaster is another reasonable approach.  If not,
nslookup and fingering main machines (ie. not every workstation in a
cluster, just the fileservers & time-sharing machines) will usually work.
For those who are SMTP literate, the VRFY command is also worth trying,
although certain SMTP servers don't support it.
	So if a person is in the NIC database, or you know where they are,
you can find them without too much work.  The big problem is if neither of
these cases apply.  Would someone like to donate a machine to run a really
big whois database ?  Even so, you still have aliasing problem; the current
whois database at nic.ddn.mil has 2 "Adams, Rick" entries for example.
(It gives email addresses and phone numbers for both, so if you (think) you
know where they are, it's easy to get the right one, but in this networked
age I might not know where they are -- if I can reach them via the net, who
cares ?)  The NIC (& CMU) solution of 4 character alphanumeric IDs seems a
bit impersonal, at best, although I won't complain because I don't have a
better idea....:-)  This is the case that NetFind may be good for; I haven't
seen it so I can't comment.  However, if you don't have a good idea where to
start, I don't see how it can avoid traversing the country on (costly)
backbones -- which is the problem if a lot of people use it.  It seems we're
no closer than when we started, but with a machine generating the finger's
and VRFY's rather than a person.  Sigh....


	
-- 
Drew Dean
Drew_Dean@rain.andrew.cmu.edu
[CMU provides my net connection; they don't necessarily agree with me.]

asp@UUNET.UU.NET (Andrew Partan) (05/12/91)

> From: csn!xcaret  (Xcaret Research)
> Subject: NetFind and its Internet load

> .... NetFind queries the Domain Naming System to locate authoritative name
> server hosts for each of these domains. ....
> .... Each of these machines
> is then queried using the Simple Mail Transfer Protocol ....
> .... located machines are then probed using the
> "finger" protocol ....

I assume that you have thought about domains that are not on the
Internet & only have MX records; about nameservers that do not run SMTP
servers; about hosts that do not run finger; about firewalls & gateways
that do not permit some protocols or hosts to be reached?

We run a nameserver for 1000+ domains that are not on the Internet;
said nameserver does not run SMTP or finger.  We are a MX forwarder for
about 900+ domains.  I don't want my mail servers hit with more load -
at times every cycle counts.

	--asp@uunet.uu.net (Andrew Partan)

schwartz@latour.colorado.edu (Mike Schwartz) (05/12/91)

In article <9105112005.AA04828@uunet.uu.net> asp@UUNET.UU.NET (Andrew Partan) writes:

> I assume that you have thought about domains that are not on the
> Internet (...)  I don't want my mail servers hit with more load - at
> times every cycle counts.

NetFind does not probe servers that are in different domains than the
institutions being searched.  So, if a site has mail forwarding through
uunet (for example), NetFind will tell the user the site isn't on the
Internet, and not probe that domain further.
 - Mike