[comp.ai.neural-nets] Pattern Recognition databases.....

bill@wayback.unm.edu (william horne) (07/26/90)

A few months ago somebody posted a list of databases for pattern recognition
available by ftp at Berkley (I think).  There were quite a few including
the IRIS databases and stuff about poisonous vs. nonpoisonous mushrooms,
etc.. etc..

Could somebody email me the approapriate ftp site and password...

Thanks,
Bill Horne
bill@wayback.unm.edu

reynolds@thalamus.bu.edu (John Reynolds) (07/26/90)

>>>>> On 25 Jul 90 23:22:18 GMT, bill@wayback.unm.edu (william horne) said:

In article <1990Jul25.232218.8203@ariel.unm.edu> bill@wayback.unm.edu (william horne) writes:

> A few months ago somebody posted a list of databases for pattern recognition
> available by ftp at Berkley (I think).  There were quite a few including
> the IRIS databases and stuff about poisonous vs. nonpoisonous mushrooms,
> etc.. etc..

> Could somebody email me the approapriate ftp site and password...

> Thanks,
>  Horne
> bill@wayback.unm.edu

===============================================================================
              This is the UCI Repository Of Machine Learning Databases
                                7 February 1990
            ics.uci.edu: /usr2/spool/ftp/pub/machine-learning-databases
                    Site Librarian: David W. Aha (aha@ics.uci.edu)
            47 databases (5884K plus 1 offline database of unknown size)
===============================================================================

Included in this directory are data sets that have been or can be used to
evaluate learning algorithms. Each data file (*.data) consists of
individual records described in terms of attribute-value pairs.  See the
corresponding *.names file for voluminous documentation.  (Some files
_generate_ databases; they do not have *.data files.)

The contents of this repository can be remotely copied to other network
sites via ftp.  Both the userid and password are "anonymous".  As of
today, I've uncompressed the data files.  However, they are usually in a
compressed state: use the "binary" command to ftp in order to tell it that
the file being transferred has been compressed.  Otherwise, ftp will
assume that it is an ASCII file and will not transfer it properly.
Compressed files, whose filenames are postpended with ".Z", can be
uncompressed using the "uncompress" and "uncompressdir" functions.

Notes:
 1. We're always looking for additional databases.  Please send yours, with
    documentation.  Thanks.  Current documentation requirements are located
    in file DOC-REQUIREMENTS. Complaints and suggestions for improvements 
    are welcome anytime.

 2. There is also the "undocumented" sub-directory which contains six
    databases that require attention before being incorporated into the
    repository.  You are welcome to access them.

 3. Ivan Bratko has asked me to restrict the access on the databases he
    donated from the Ljubljana Oncology Institute.  These databases, under
    the breast-cancer, lymphography, and primary-tumor directories, are
    unreadable to you.  However, we are allowed to share them with academic
    institutions upon request.  If used, these databases (like several
    others) require providing proper citations be made in published articles
    that use them.  The citation requirements can be found in each database's
    corresponding documentation file.

 4. Finally, I'm maintaining a list of CORRESPONDENTS and TRANSACTIONS.
    Perhaps someone on your site is listed among the CORRESPONDENTS and
    can provide you with some of these databases and related information.
    (I have corresponded with over 75 people so far concerning these 
    databases.)  TRANSACTIONS is a log of my correspondence with others,
    which should enlighten you as to what problems we're having, etc.

David W. Aha
Repository Librarian
     
----------------------------------------------------------------------
Brief Overview of Databases:

Quick Listing:
 1. annealing
 2. audiology
 3. autos
 4. breast-cancer (restricted access)
 5-6. chess-end-games
 7. cpu-performance!herer
 8. echocardiogram
 9. glass
 10. hayes-roth
 11-14. heart-disease
 15. hepatitis
 16. iris
 17. labor-negotiations
 18-19. led-display-creator
 20. lymphography (restricted access)
 21. mushroom (JHR GOT IT)
 22. primary-tumor (restricted access)
 23. shuttle-landing-control
 24-25. soybean
 26. spectrometer
 27-34. thyroid-disease
 35. university
 36. voting-records
 37-38. waveform domain
 39-46. Undocumented databases: sub-directory undocumented
   1. Bradshaw's flare data
   2. Pat Langley's data generator
   3. David Lewis's information retrieval (IR) data collection (offline)
   4. Mike Pazzani's economic sanctions database
   5. Ross Quinlan's latest version of the thyroid database
   6. Philippe Collard's database on cloud cover images
   7. Mary McLeish & Matt Cecile's database on horse colic
   8. Paul O'Rorke's database containing theorems from Principia Mathematica
 47. Nine small EBL domain theories and examples in sub-directory ebl

Quick Summaries of Each Database:
1. Annealing data (unknown source)
   -- Documentation: On everything except database statistics
   -- Background information on this database: unknown
   -- Many missing attribute values

2. Audiology data (Baylor College)
   -- Documentation: On everything except database statistics
   -- Non-standardized attributes (differs between instances)
   -- All attributes are nominally-valued

3. Automobile data (1985 Ward's Automotive Yearbook)
   -- Documentation: On everything except statistics and class distribution
   -- Good mix of numeric and nominal-valued attributes
   -- More than 1 attribute can be used as a class attribute in this database

4. Breast cancer database (Ljubljana Oncology Institute)
   -- Documentation: On everything except database statistics
   -- Well-used database
   -- 286 instances, 2 classes, 9 attributes + the class attribute

5-6. Chess endgames data creator 
     1. king-rook-vs-king-knight
        -- Documentation: limited (nothing on class distribution, statistics)
        -- This concerns king-knight versus king-rook end games
        -- The database creator is coded in Common Lisp
     2. king-rook-vs-king-pawn
        -- Documentation: sufficient
        -- This concerns king-rook versus king-pawn end games
        -- Originally described by Alen Shapiro 

7. Computer hardware described in terms of its cycle time, memory size, etc.
   and classified in terms of their relative performance capabilities (CACM
   4/87)   
   -- Documentation: complete
   -- Contains integer-valued concept labels
   -- All attributes are integer-valued

8. Echocardiogram database (Reed Institute, Miami)
   -- Documentation: sufficient
   -- 13 numeric-valued attributes
   -- Binary classification: patient either alive or dead after survival period

?9. Glass Identification database (USA Forensic Science Service)
    -- Documentation: completed
    -- 6 types of glass 
    -- Defined in terms of their oxide content (i.e. Na, Fe, K, etc)
    -- All attributes are numeric-valued 

??10. Hayes-Roth and Hayes-Roth's database
    -- Described in their 1977 paper
    -- Topic: human subjects study

11-14. Heart Disease databases (Sources listed below)
      -- Documentation: extensive, but statistics and missing attribute
         information not yet furnished (perhaps later)
      -- 4 databases: Cleveland, Hungary, Switzerland, and the VA Long Beach
      -- 13 of the 75 attributes were used for prediction in 2 separate 
         tests, each of which achieved approximately 75%-80% classification
         accuracy
      -- The chosen 13 attributes are all continuously valued

?15. Hepatitis database (G.Gong: CMU)
    -- Documentation: incomplete
    -- 155 instances with 20 attributes each; 2 classes
    -- Mostly Boolean or numeric-valued attribute types
   
?16. Iris Plant database (Fisher, 1936)
   -- Documentation: complete
   -- 3 classes, 4 numeric attributes, 150 instances 
   -- 1 class is linearly separable from the other 2, but the other 2 are
      not linearly separable from each other (simple database)

17. Labor relations database (Collective Bargaining Review)
    -- Documentation: no statistics
    -- Please see the labor directory for more information

?18-19. LED display domains (Classification and Regression Trees book)
    -- Documentation: sufficient, but missing statistical information
    -- All attributes are Boolean-valued
    -- Two versions: 7 and 24 attributes
    -- Optimal Baye's rate known for the 10% probability of noise problem
    -- Several ML researchers have used this domain for testing noise tolerancy
    -- We provide here 2 C programs for generating sample databases

?20. Lymphography database (Ljubljana Oncology Institute)
    -- Documentation: incomplete
    -- CITATION REQUIREMENT: Please use (see the documentation file)
    -- 148 instances; 19 attributes; 4 classes; no missing data values

!21. Mushrooms in terms of their physical characteristics and classified
    as poisonous or edible (Audobon Society Field Guide) (JHR GOT THIS)
    -- Documentation: complete, but missing statistical information
    -- All attributes are nominal-valued
    -- Large database: 8124 instances (2480 missing values for attribute #12)

22. Primary Tumor database (Ljubljana Oncology Institute)
    -- Documentation: incomplete
    -- CITATION REQUIREMENT: Please use (see the documentation file)
    -- 339 instances; 18 attributes; 22 classes; lots of missing data values

23. Shuttle Landing Control database
    -- tiny, 15-instance database with 7 attributes per instance; 2 classes
    -- appears to be well-known in the decision-tree community

?24-25. Soybean data (Michalski)
   -- Documentation: Only the statistics is missing
   -- (2 sizes)
   -- Michalski's famous soybean disease databases

?26. Low resolution spectrometer data (IRAS data -- NASA Ames Research Center)
    -- Documentation: no statistics nor class distribution given
    -- LARGE database...and this is only 531 of the instances
    -- 98 attributes per instance (all numeric)
    -- Contact NASA-Ames Research Center for more information

?27-34. Thyroid patient records classified into disjoint disease classes 
       (Garavan Institute)
       -- Documentation: as given by Ross Quinlan
       -- 6 databases from the Garavan Institute in Sydney, Australia
       -- Approximately the following for each database:
          -- 2800 training (data) instances and 972 test instances
          -- plenty of missing data
          -- 29 or so attributes, either Boolean or continuously-valued
       -- 2 additional databases, also from Ross Quinlan, are also here
          -- hypothyroid.data and sick-euthyroid.data
          -- Quinlan believes that these databases have been corrupted
          -- Their format is highly similar to the other databases

?35. University data (Lebowitz)
    -- Documentation: scant; we've left it in its original (LISP-readable) form
    -- 285 instances, including some duplicates
    -- At least one attribute, academic-emphasis, can have multiple values
       per instance
    -- The user is encouraged to pursue the Lebowitz reference for more 
       information on the database

?36. Congressional voting records classified into Republican or Democrat (1984
    United Stated Congressional Voting Records)
    -- Documentation: completed
    -- All attributes are Boolean valued; plenty of missing values; 2 classes
    -- Also, their is a 2nd, undocumented database containing 1986 voting 
       records here. (will be)

?!37-38. Waveform data generator (Classification and Regression Trees book)
       -- Documentation: no statistics
       -- CART book's waveform domains
       -- 21 and 40 continuous attributes respectively
       -- difficult concepts to learn, but known Bayes optimal classification
          rate of 86% accuracy

39-46. Undocumented databases: see the sub-directory named undocumented
   1. Bradshaw's flare data
   2. Pat Langley's data generator
   3. David Lewis's information retrieval (IR) data collection (offline)
   4. Mike Pazzani's economic sanctions database
   5. Ross Quinlan's latest version of the thyroid database
   6. Philippe Collard's database on cloud cover images
   7. Mary McLeish & Matt Cecile's database on horse colic
   8. Paul O'Rorke's database containing theorems from Principia Mathematica

47. Nine simple small EBL domain theories and examples in sub-directory ebl
   1. cup
   2. deductive.assumable (contains three domain theories)
   3. emotion
   4. ice
   5. pople
   6. safe-to-stack
   7. suicide