[comp.lang.perl] Novice Questions

inc@tc.fluke.COM (Gary Benson) (02/05/90)

Hello -- I am a novice not just to PERL, but to programming itself.
I work with PERL scripts written by someone else, and fortunately he
did a nice job of structuring and commenting his work, so I have been
(mostly) able to make any little modifications that have been needed.

However, I have a new job facing me that requires me to write a PERL script
totally from scratch, and I am finding that my ignorance of programming
concepts in general is hindering my understanding of the on-line PERL
manual. I hope it is not inappropriate to pose very rudimentary questions
here, but if it is, I'll refrain and look somewhere else for the information
I need. Following is a short list of questions I have right now. They are in
order of the urgency with which I think I should learn them.

1. How would I make the first n characters on a line into a variable?
   What if I needed to make the characters at positions 8-10 a variable?

2. Can PERL create two output files, one containing the "massaged"
   information from an input file, the other containing data collected while
   the first file was being created? Is it necessary to do that? It seems
   that the same thing could be accomplished in an array whose contents are
   written at the end of the "massaged data" file. Is that true? How would
   either of these things be done?

3. Could some kind soul give me short, usable, un-esoteric definitions of
   the following terms? Each of these appears somewhere in the PERL manual:

   data type
   scalars
   arrays of scalars
   associative arrays of scalars
   array element
   struct tm
   literals, pseudo-literals
   filehandle

Thanks in advance for any help. This newsgroup as a whole seems rather
advanced, so a mailed answer may be more appropriate than a posting. And if
my posting is inappropriate, please just say so and I'll refrain.



-- 
Gary Benson   -=[S M I L E R]=-   inc@fluke.tc.com

inc@tc.fluke.COM (Gary Benson) (02/06/90)

Last week I posted a request for the answers to a number of novice questions,
and asked that people reply by email rather than posting, as the level of my
questions was probably too low to be of general interest.

Well, I have been getting mail! Someone once said that the good thing about
unix is that you can ask 10,000 people the same question; and the bad thing
is that you then get 10,000 answers!

THANKS to everyone who has responded to my posting. One person recommended
that I synopsize/edit the responses and post that for the benefit of other
novices. This seemed like a good idea, so I will work on that and post it
in the next day or so.

Again, I appreciate the many detailed and clear answers I recieved, and it
even helped to be told "I don't understand this question", because it showed
me the many ways it is possible to be imprecise when speaking of things as
complex as PERL.

One person wrote that PERL was "created by Super Macho Hackers to keep their
Maniac Power Urges under control". I don't know about that, but I am certain
that this person's next statement is right on the mark: PERL is *not* a
trivial language.





-- 
Gary Benson   -=[S M I L E R]=-   inc@fluke.tc.com

inc@tc.fluke.COM (Gary Benson) (02/09/90)

This posting synopsises the many responses to my recent request for answers
to novice questions. I had intended to do this for my own use anyway, so I
hope that other beginners will benefit from these explanations. In every
case, I restate my original question, (adding clarification where needed)
then put together an answer comprised of all the best parts of the answers.
I have not simply listed everyone's answers; I agree with the person who
suggested I post this when he said that there is not much value in just
listing all the responses.


>1. How would I make the first n characters on a line into a variable?
>   What if I needed to make the characters at positions 8-10 a variable?

Several people pointed out that this is too ambiguous to be answered without
more information, so first I'll clarify the question.

I need a program that manipulates and extracts data from an input file made
up of very structured information, a database in fact. For example, one
"record" looks like this:

L52ALNA001  AS3514PM5712          3 DIGITAL MULTIMETER 000100001EA          01A
L52ALNA001  H00029446 057 12001   3                                         03A
L52ALNA001  1                               A9 B          403110040100      01C
L52ALNA001  A001                                                            01D

The first field in each line is always "L52ALNA" followed by a three digit
sequence code. One thing the program must do is to resequence these numbers.
To construct the new number, I thought it would be handy to have the
unvarying part called by a name (make L52LALNA into a variable), and to make
the varying part another variable. I see now that I do not need to do
anything with the unvarying part, because the "digits of interest" are the
three numbers in columns 7, 8, and 9 (counting the first column as "0". I
only need to be concerned with the part of the number that changes.

To answer the question, many people made the assumption that this is what I
was getting at, and explained thus:

    Assuming that the line is in $_ (which is true if you use the "while
    (<>)" construct), then you can use the "substr" function to extract
    characters from it. For example, substr($_, 0, 7) returns a string with
    the first 7 characters of the line (if you've modified $[, the index
    base, you may need to change the second argument to $[). Assuming you
    count "positions" starting at one, and you have left $[ at zero, then
    extracting the characters at positions 8-10 is substr($_, 7, 3) (it
    might be more intuitive if you set $[ to one, then the correct
    expression is substr($_, 8, 3).

This was the most frequently mentioned method, and apparently is the
"generic" solution. One person took a slightly different approach, one that
I think is perhaps simpler, but may not be as useful in terms of using
the information later in the program:

    To match characters at the beginning of a line, try:

	      /^(.{n})/	     or	       /^(....)/

    Put however many dots you need to match n characters. In each case the
    characters matched get assigned to a variable called $1.

    To match characters as in the second part of the question, use:

	      /^(.){7}(...)/
    
    This assigns the first 7 characters to $1 and the 8th through 10th
    characters to $2.

> 2. Can PERL create two output files, one containing the "massaged"
>    information from an input file, the other containing data collected while
>    the first file was being created? Is it necessary to do that? It seems
>    that the same thing could be accomplished in an array whose contents are
>    written at the end of the "massaged data" file. Is that true? How would
>    either of these things be done?

Again, my question was phrased in an ambiguous way. To me, "massaging" data
meant that I would locate a piece of information in an input file (for
example the 3 numbers in columns 7, 8, and 9) , and when I wrote the output
file, I would correct the sequencing of these numbers so that they began
001, 002, 003, and so on, but so that each line of the record began with the
same numerical identifier. In other words, every line of record 001 carries
the record sequence number.

The beginning of records is identified by a line having the value 01A at the
end of the line (columns 78, 79, and 80). I wondered if I could use that to
identify the end of each record and increment the sequence number. My
meaning was that I wanted to create two output files: one was to look just
like the input file, but with some changes to the information, and the other
was to contain statistics about the first.

For example, say I needed to count how many of the records had a certain
value in the "RECORD TYPE" field. OUT_FILE_1 would have all the records,
while OUT_FILE_2 would contain total number of each record type.

When I asked if that was necessary, I meant to be asking if writing two
files was the only way to accomplish this. Instead, could I keep the "RECORD
TYPE" totals in an array and write it into a single output file?

Perhaps the clearest answer to this question was this one:

    This program opens one input file and two output files. it prints to
    "outfile1" the input number incremented by 1. It prints to "outfile2"
    the letter in variable "$ch" which is incremented each time.

    Note: you can increment character strings, but as the manual says, if
    you ever use the variable for anything other than a character string, it
    won't work. I tried putting characters in the input file, but since
    sometimes there are numbers, it didn't work. I didn't try any other
    examples. The related portion of the manual is shown below:

    The autoincrement operator has a little extra built-in magic to it. If
    you increment a variable that is numeric, or that has ever been used in
    a numeric context, you get a normal increment. If, however, the variable
    has only been used in string contexts since it was set, and has a value
    that is not null and matches the pattern /^[a-zA-Z]*[0-9]*$/, the
    increment is done as a string, preserving each character within its
    range, with carry:

	  print	++($foo	= '99');   # prints '100'
	  print	++($foo	= 'a0');   # prints 'a1'
	  print	++($foo	= 'Az');   # prints 'Ba'
	  print	++($foo	= 'zz');   # prints 'aaa'

    The autodecrement is not magical.

    Now, you can read to and write from the same file if you want to open
    and close the file. Also, you can open a file as read/write, but look
    under "open" in the manual to see if it will do what you expect it to
    do.


#### PROGRAM ################################################################

#! /usr/public/perl
$debug = 0;

$FALSE = 0;
$TRUE = 1;


# open files
open(IN, "<infile");
open(OUT1,">outfile1");
open(OUT2,">outfile2");

$ch = "a";

# read data while file named "infile" is not empty
while(<IN>)
{
    $inval = $_;

    # increment data from input file and print to output file  named "outfile1"
    printf OUT1 "%d\n",++$inval;

    # increment character and print to "outfile2"
    printf OUT2 "%s\n",++$ch;
}

# close files
close(INFILE);
close(OUTFILE1);
close(OUTFILE2);

#### infile ##################################################################
1
2
3
4

#### outfile1 ################################################################
2
3
4
5

#### outfile2 ################################################################
b
c
d
e
##############################################################################

    Here is another program that creates 2 output files. This program
    determines if the first character in the line is a letter or not and
    sends those lines beginning with letters to one file, and those that do
    not begin with a letter to another. This is exactly the kind of thing I
    was asking about.

### PROGRAM SPLIT_ALPHA ####################################################

    #!/usr/local/bin/PERL

    open(alpha,">alphabetic") || die "Can't open alphabetic";
    open(nonalpha,">nonalpha") || die "Can't open nonalpaha";
    while(<>) {
    	if (/^[a-zA-Z]/) {
    		print alpha $_;
    	} else {
    		print nonalpha $_;
    	}
    }
    close(alpha);
    close(nonalpha);
###############################################################################


My third question asked for definitions of terms that I had come across in
reading the PERL manual, but which none of my normal reference books defined
adequately.

> 3. Could some kind soul give me short, usable, un-esoteric definitions of
>    the following terms? Each of these appears somewhere in the PERL manual:

>   data type

    The type of data!

    Data is stored and interpreted by a language in different ways. In most
    programming languages, there are different kinds of variables for
    different kinds of data, just like you use 3x5 index cards for some
    things (which can be quite varied) and income tax forms for others
    (which are pretty specific - they're a bit stiff for toilet paper, but
    make good bird-cage liner :-}). For example, in C, you have different
    kinds of variables for characters, integers, floating-point values and
    more complicated combinations of these.

    PERL really only supports two data types: strings and numbers. Strings
    contain ASCII characters, and have an associated length. Numbers are
    numeric quantities (either floating point or integers), have a value,
    and may participate in numeric operations such as addition or
    multiplication. In PERL, conversion between strings and numbers is
    trivial and automatic; if you use a number in a context that requires a
    string, the number is converted to a string that represents that numeric
    value. Similarly, if a string is used in a context which requires a
    number, PERL will attempt to convert the string to a numeric value by
    looking for ASCII digits which might represent a numeric value. (If no
    digits are found, the conversion results in the numeric value 0).

    PERL distinguishes the following ways of organizing data: as scalars,
    arrays of scalars, and associative arrays of scalars.

>  scalars

    A scalar is a simple data type, as opposed to a compound data type. A
    scalar is one-dimensional data, either a one-dimensional numeric array
    or a simple string (i.e., a one dimensional array of characters). The
    term is generally used to refer to the basic types a language supports.
    All of the types mentioned for C are scalars. PERL just has one type
    "scalar", which covers both numbers and strings. Given a scalar
    variable, you can stick anything that looks like a number or a string
    into it.

>  arrays of scalars

    An array is a set of data, together with an index, usually written
    something like a[5]=20; In this case the array is called 'a', the index
    is 5, and the value of the array element is 20. You might think of
    houses, with their occupant. The street (array) might be called
    washington_row, and the houses (1, 2, 3) contain "Mr Smith", "Mr Jones",
    and "Mr Smith". (Culture gap - In England houses are numbered
    sequentially, in the US I understand they are numbered in yards from
    some fixed point or other, try and think english for this example). Note
    that each house has a unique address (index), but there is nothing to
    stop different houses having the same contents.

    As a two-dimensional organization of data, an array is just a numbered
    bunch of scalars, like a numbered list. $array[0] is one, $array[1] is
    another... as many as you like.

    For a simple example of the use of an associative array, you can record a
    class's marks by simply doing:

	 $mark{'John'} = 83;
	 $mark{'Michael'} = 79;
	 $mark{'Holly'} = 84;
	 etc.

    Here is another way to use an associative array in the same situation.
    Suppose you're collecting statistics about a class, with an input file
    having lines like the following:

	 100 Johnny B Good
	 12 J. Random Loser

    You could gather statistics about the distribution of marks in a class
    with a program like the following:

	 while (<>) {
	 split;		# Break line into bits
	 mark = $_[0];	# Scalar "mark" is set to the first bit
         $array[mark]++;	# Increment the corresponding array entry
	 }
	 for (i = 0; i <= 100; i++) {
	 print "$array[$i] students received mark $i.\n";
	 }

    Arrays start at array[0] and work upwards for as many entries as they
    have. For example, if you set array[1000] to something, then PERL will
    clear out array[0] through array[999] in anticipation of your needing
    them. Actually, arrays start with the variable $[, which is initially 0,
    but can be set to, say, 1 if you prefer.

>  associative arrays of scalars

    An array of one dimensional arrays which is indexed by "association"
    with a string rather than by simple sequential numbers. A simple example
    might be to associate "name" with "gary" in an array called personal_data.
    Now when personal_data["name"] is referenced "gary" is the value
    returned.

    An associative array is like a normal array except you don't have to use
    consecutive numbers. This makes things a lot slower, as a practical
    matter, but it's really handy when you need it.

    In the previous example about the houses on an English street making up
    an array, the houses might have names "Oaklands", "DunRoaming",
    "Fairview" etc., which may often be more useful than using numbers to
    identify them. Or consider you want to build a file that lets you go
    from employee name to employee number. To look for Freds number, you
    could start at one, see if the name of employee number 1 is fred, if so
    print 1 and stop, else see if employee number 2's name is fred, if so
    print 2 etc etc, but what would be nicer would be to say print
    $employee_number{'Fred'}; the answer come straight out. Few languages
    allow you to do this, but PERL, Awk, (and perhaps Snobol) do.

    If you need to keep a running count of something happening in your
    program, you will probably end up using an associative array. For
    example, if you are trying to keep track of how many 'foo', 'bar', and
    'bletch' occurrances are in a file, you can do something like:

	 while (<>) {
	 	next unless /(foo|bar|bletch)/;
	 	$found = $1; # $found is now 'foo', 'bar', or 'bletch'
	 	$count{$found}++; # increment $count{'foo'}, or whatever
	 }
	 
	 for $key (sort keys %count) {
	 	print "$key was found $count{$key} times\n";
	 }


    See the keys(), values(), and each() functions for more things you can
    do with associative arrays.

>  array element

    An array is a means of organizing data such that an "index" allows each
    access to each individual data item. A simple array of 4 items
    (elements) called blue consists of four data storage locations blue[0],
    blue[1], blue[2], and blue[3]. The number that distinguishes between
    the elements is the index.

    For another example, say that @foo is an array, or %bar is an
    associative array, then $foo[0], $foo[1], $foo[2], etc. are its
    elements. Each has a value, e.g. $foo[0] = 'John'; $foo[1] = 'Michael';
    $foo[2] = 'Holly'; which are also referred to as the array's elements.
    The same holds for $bar{'John'}, etc.

>  struct tm

    A structure, used in Unix to hold 'time'. A struct statement defines a
    data structure. A data  structure is a way of grouping a set of related
    data under common name. PERL does not support data structures.

    However, in the discussions of gmtime EXPR and localtime EXPR, the
    on-line PERL manual page states:

	 "Converts a ttime as returned by the time function to a 9-element
	 array ... [....details omitted....] ... All array elements are
	 numeric, and come straight out of a struct tm."

    The person who cleared up the meaning of this put it this way:

    You don't really need to know. In C, you can lump together existing
    data types into bunches called structures, structs for short. A struct
    tm is declared like this:

          struct tm {
                  int tm_sec;     /* 0-59  seconds */
                  int tm_min;     /* 0-59  minutes */
                  int tm_hour;    /* 0-23  hour */
                  int tm_mday;    /* 1-31  day of month */
                  int tm_mon;     /* 0-11  month */
                  int tm_year;    /* 0-    year - 1900 */
                  int tm_wday;    /* 0-6   day of week (Sunday = 0) */
                  int tm_yday;    /* 0-365 day of year */
                  int tm_isdst;   /* flag: daylight savings time in effect */
                  char **tm_zone; /* abbreviation of time zone name */
                  long tm_gmtoff; /* offset from GMT in seconds */
          };

    Where "int" means an integer, "long" is a different kind of integer
    (PERL doesn't distinguish), and char ** is an indirect reference to a
    string like "EST", "PDT", etc.

>  literals, pseudo-literals

    A literal refers to a data value "literally", as a constant value. 1, 2,
    and 3 are literal numbers. "Hello world" is a literal string. You
    frequently find literals used in relational expressions, for example:

	 if (num_chars == 5) then do xyz;

    This means that if the variable equals "literally" 5, then do whatever.

    Put another way, a literal is a scalar that appears literally in the
    code, like 15, 3.14159, or 'Hi Mom'. THe following are NOT literals:
    $foo, $foo[15], or $foo{'The first digit of pi is 3'}.

    A pseudo-literal is something that looks like a literal in some way.
    Pseudo-literals (in the PERL sense) are created using the ` character,
    (backtick). You can say:

	 $fred = `date`;

    This will run the date command on your machine, and put the output into
    $fred. Contrast this with:

	 $fred = 'date';

    which will put the string of letters 'd','a','t', and 'e' into $fred.
    Now `date` looks like a string, but every time it's evaluated (every
    time the program reaches it, like in $foo = `date`;), the date program
    executes and the result (something like "Mon Feb  5 17:39:29 EST 1990")
    is used. In the example, this is what $foo gets. <FILE> is similar;
    every time you get to that in a program, the next line from the file is
    returned. It's a bit like a literal, but not constant.

>  filehandle

    A filehandle is a unique descriptor associated with a specific file.
    When several files are opened at the same time, each has its own
    filehandle (remember "handle" from CB jargon?). When you want to access
    a particular file you refer to it by its filehandle. 

    PERL gets input and output from files, and as it's going through a file,
    it keeps track of things like where it is in the file. A filehandle (or
    file handle, both forms are used) is used to keep track of this. When you
    print something, it goes to a default filehandle, but you can create
    others (with open()) and use them to read from and print to.

    So how does a filehandle differ from a plain ordinary fileNAME?

    On Unix, PERL provides 3 handles for you, called STDIN, STDOUT, and
    STDERR (they were lower case in PERL 2). STDIN is normally attached to
    your terminal, unless you redirect it using the '<' character on the
    Unix command line, or with a pipe. If the "SPLIT_ALPHA"  program above
    were in a file called splitalpha, then the command

	 perl splitalpha

    would read from STDIN, which would be your terminal.

    But the command:

	 perl splitalpha < /etc/passwd

    would still read from STDIN, which would now be the password file.

    And the command:

	 grep 1 /etc/passwd | perl splitalpha

    would again read from STDIN, but that would be the pipe (created with
    the '|' character), which would be the output of the 'grep' command.

    In a similar manner, STDOUT is where PERL would send its output if you
    didn't give the print statement a filehandle, which would be your
    terminal, unless you use '> icarus' to send it to a file named icarus,
    or '| grep THANKS_A_MILLION' which would send it to the program grep
    (which would then print any lines with 'THANKS_A_MILLION' in them).

    STDERR is where error messages go, which will again be your terminal,
    unless you make special arrangements.

The responses to my posting helped an awfully lot, and I want to thank every
one who took the time and effort to provide such clear answers to my
questions. Even if I did not quote you here, I still appreciated every
reply. In no particular order, the people represented in one way or another
in this posting are:

    Colin Plumb		<uiucuxc!lion.waterloo.edu!ccplumb>
    Craig Johnson	vince@intrepid, tc.fluke.COM
    Mike Kirita		mike@rf1, tc.fluke.COM
    John P. Nelson	<decwrl!decvax!genrad.com!jpn>
    Icarus Sparry	<uiucuxc!gdr.bath.ac.uk!I.Sparry>
    Raymond Chen	<uiucuxc!bosco.Berkeley.EDU!raymond>
    Randal Schwartz	<uiucuxc!iwarp.intel.com!merlyn>

In this posting, I tried to construct my summary in a way that made use of
the most compact, and/or clearest, and/or elegant and/or humorous
explanations I received. I wanted to include the answers that came closest
to answering the questions I was actually asking, without too much extra
detail. Most of the answers here include parts of at least three reponses.

Again, thank you all. My next novice question: Randal Schwartz has indicated
that his book will not be particularly targeted to a novice audience. Are
there any suggestions for entry-level study until I am able to get his book?
The pun definitely intentional -- I'll probably be able to "get" the book
about the same time it is available to be "gotten"!




-- 
Gary Benson   -=[S M I L E R]=-   inc@fluke.tc.com

Everybody lies; but it doesn't matter, since nobody listens. -Anonymous