inc@tc.fluke.COM (Gary Benson) (02/05/90)
Hello -- I am a novice not just to PERL, but to programming itself. I work with PERL scripts written by someone else, and fortunately he did a nice job of structuring and commenting his work, so I have been (mostly) able to make any little modifications that have been needed. However, I have a new job facing me that requires me to write a PERL script totally from scratch, and I am finding that my ignorance of programming concepts in general is hindering my understanding of the on-line PERL manual. I hope it is not inappropriate to pose very rudimentary questions here, but if it is, I'll refrain and look somewhere else for the information I need. Following is a short list of questions I have right now. They are in order of the urgency with which I think I should learn them. 1. How would I make the first n characters on a line into a variable? What if I needed to make the characters at positions 8-10 a variable? 2. Can PERL create two output files, one containing the "massaged" information from an input file, the other containing data collected while the first file was being created? Is it necessary to do that? It seems that the same thing could be accomplished in an array whose contents are written at the end of the "massaged data" file. Is that true? How would either of these things be done? 3. Could some kind soul give me short, usable, un-esoteric definitions of the following terms? Each of these appears somewhere in the PERL manual: data type scalars arrays of scalars associative arrays of scalars array element struct tm literals, pseudo-literals filehandle Thanks in advance for any help. This newsgroup as a whole seems rather advanced, so a mailed answer may be more appropriate than a posting. And if my posting is inappropriate, please just say so and I'll refrain. -- Gary Benson -=[S M I L E R]=- inc@fluke.tc.com
inc@tc.fluke.COM (Gary Benson) (02/06/90)
Last week I posted a request for the answers to a number of novice questions, and asked that people reply by email rather than posting, as the level of my questions was probably too low to be of general interest. Well, I have been getting mail! Someone once said that the good thing about unix is that you can ask 10,000 people the same question; and the bad thing is that you then get 10,000 answers! THANKS to everyone who has responded to my posting. One person recommended that I synopsize/edit the responses and post that for the benefit of other novices. This seemed like a good idea, so I will work on that and post it in the next day or so. Again, I appreciate the many detailed and clear answers I recieved, and it even helped to be told "I don't understand this question", because it showed me the many ways it is possible to be imprecise when speaking of things as complex as PERL. One person wrote that PERL was "created by Super Macho Hackers to keep their Maniac Power Urges under control". I don't know about that, but I am certain that this person's next statement is right on the mark: PERL is *not* a trivial language. -- Gary Benson -=[S M I L E R]=- inc@fluke.tc.com
inc@tc.fluke.COM (Gary Benson) (02/09/90)
This posting synopsises the many responses to my recent request for answers to novice questions. I had intended to do this for my own use anyway, so I hope that other beginners will benefit from these explanations. In every case, I restate my original question, (adding clarification where needed) then put together an answer comprised of all the best parts of the answers. I have not simply listed everyone's answers; I agree with the person who suggested I post this when he said that there is not much value in just listing all the responses. >1. How would I make the first n characters on a line into a variable? > What if I needed to make the characters at positions 8-10 a variable? Several people pointed out that this is too ambiguous to be answered without more information, so first I'll clarify the question. I need a program that manipulates and extracts data from an input file made up of very structured information, a database in fact. For example, one "record" looks like this: L52ALNA001 AS3514PM5712 3 DIGITAL MULTIMETER 000100001EA 01A L52ALNA001 H00029446 057 12001 3 03A L52ALNA001 1 A9 B 403110040100 01C L52ALNA001 A001 01D The first field in each line is always "L52ALNA" followed by a three digit sequence code. One thing the program must do is to resequence these numbers. To construct the new number, I thought it would be handy to have the unvarying part called by a name (make L52LALNA into a variable), and to make the varying part another variable. I see now that I do not need to do anything with the unvarying part, because the "digits of interest" are the three numbers in columns 7, 8, and 9 (counting the first column as "0". I only need to be concerned with the part of the number that changes. To answer the question, many people made the assumption that this is what I was getting at, and explained thus: Assuming that the line is in $_ (which is true if you use the "while (<>)" construct), then you can use the "substr" function to extract characters from it. For example, substr($_, 0, 7) returns a string with the first 7 characters of the line (if you've modified $[, the index base, you may need to change the second argument to $[). Assuming you count "positions" starting at one, and you have left $[ at zero, then extracting the characters at positions 8-10 is substr($_, 7, 3) (it might be more intuitive if you set $[ to one, then the correct expression is substr($_, 8, 3). This was the most frequently mentioned method, and apparently is the "generic" solution. One person took a slightly different approach, one that I think is perhaps simpler, but may not be as useful in terms of using the information later in the program: To match characters at the beginning of a line, try: /^(.{n})/ or /^(....)/ Put however many dots you need to match n characters. In each case the characters matched get assigned to a variable called $1. To match characters as in the second part of the question, use: /^(.){7}(...)/ This assigns the first 7 characters to $1 and the 8th through 10th characters to $2. > 2. Can PERL create two output files, one containing the "massaged" > information from an input file, the other containing data collected while > the first file was being created? Is it necessary to do that? It seems > that the same thing could be accomplished in an array whose contents are > written at the end of the "massaged data" file. Is that true? How would > either of these things be done? Again, my question was phrased in an ambiguous way. To me, "massaging" data meant that I would locate a piece of information in an input file (for example the 3 numbers in columns 7, 8, and 9) , and when I wrote the output file, I would correct the sequencing of these numbers so that they began 001, 002, 003, and so on, but so that each line of the record began with the same numerical identifier. In other words, every line of record 001 carries the record sequence number. The beginning of records is identified by a line having the value 01A at the end of the line (columns 78, 79, and 80). I wondered if I could use that to identify the end of each record and increment the sequence number. My meaning was that I wanted to create two output files: one was to look just like the input file, but with some changes to the information, and the other was to contain statistics about the first. For example, say I needed to count how many of the records had a certain value in the "RECORD TYPE" field. OUT_FILE_1 would have all the records, while OUT_FILE_2 would contain total number of each record type. When I asked if that was necessary, I meant to be asking if writing two files was the only way to accomplish this. Instead, could I keep the "RECORD TYPE" totals in an array and write it into a single output file? Perhaps the clearest answer to this question was this one: This program opens one input file and two output files. it prints to "outfile1" the input number incremented by 1. It prints to "outfile2" the letter in variable "$ch" which is incremented each time. Note: you can increment character strings, but as the manual says, if you ever use the variable for anything other than a character string, it won't work. I tried putting characters in the input file, but since sometimes there are numbers, it didn't work. I didn't try any other examples. The related portion of the manual is shown below: The autoincrement operator has a little extra built-in magic to it. If you increment a variable that is numeric, or that has ever been used in a numeric context, you get a normal increment. If, however, the variable has only been used in string contexts since it was set, and has a value that is not null and matches the pattern /^[a-zA-Z]*[0-9]*$/, the increment is done as a string, preserving each character within its range, with carry: print ++($foo = '99'); # prints '100' print ++($foo = 'a0'); # prints 'a1' print ++($foo = 'Az'); # prints 'Ba' print ++($foo = 'zz'); # prints 'aaa' The autodecrement is not magical. Now, you can read to and write from the same file if you want to open and close the file. Also, you can open a file as read/write, but look under "open" in the manual to see if it will do what you expect it to do. #### PROGRAM ################################################################ #! /usr/public/perl $debug = 0; $FALSE = 0; $TRUE = 1; # open files open(IN, "<infile"); open(OUT1,">outfile1"); open(OUT2,">outfile2"); $ch = "a"; # read data while file named "infile" is not empty while(<IN>) { $inval = $_; # increment data from input file and print to output file named "outfile1" printf OUT1 "%d\n",++$inval; # increment character and print to "outfile2" printf OUT2 "%s\n",++$ch; } # close files close(INFILE); close(OUTFILE1); close(OUTFILE2); #### infile ################################################################## 1 2 3 4 #### outfile1 ################################################################ 2 3 4 5 #### outfile2 ################################################################ b c d e ############################################################################## Here is another program that creates 2 output files. This program determines if the first character in the line is a letter or not and sends those lines beginning with letters to one file, and those that do not begin with a letter to another. This is exactly the kind of thing I was asking about. ### PROGRAM SPLIT_ALPHA #################################################### #!/usr/local/bin/PERL open(alpha,">alphabetic") || die "Can't open alphabetic"; open(nonalpha,">nonalpha") || die "Can't open nonalpaha"; while(<>) { if (/^[a-zA-Z]/) { print alpha $_; } else { print nonalpha $_; } } close(alpha); close(nonalpha); ############################################################################### My third question asked for definitions of terms that I had come across in reading the PERL manual, but which none of my normal reference books defined adequately. > 3. Could some kind soul give me short, usable, un-esoteric definitions of > the following terms? Each of these appears somewhere in the PERL manual: > data type The type of data! Data is stored and interpreted by a language in different ways. In most programming languages, there are different kinds of variables for different kinds of data, just like you use 3x5 index cards for some things (which can be quite varied) and income tax forms for others (which are pretty specific - they're a bit stiff for toilet paper, but make good bird-cage liner :-}). For example, in C, you have different kinds of variables for characters, integers, floating-point values and more complicated combinations of these. PERL really only supports two data types: strings and numbers. Strings contain ASCII characters, and have an associated length. Numbers are numeric quantities (either floating point or integers), have a value, and may participate in numeric operations such as addition or multiplication. In PERL, conversion between strings and numbers is trivial and automatic; if you use a number in a context that requires a string, the number is converted to a string that represents that numeric value. Similarly, if a string is used in a context which requires a number, PERL will attempt to convert the string to a numeric value by looking for ASCII digits which might represent a numeric value. (If no digits are found, the conversion results in the numeric value 0). PERL distinguishes the following ways of organizing data: as scalars, arrays of scalars, and associative arrays of scalars. > scalars A scalar is a simple data type, as opposed to a compound data type. A scalar is one-dimensional data, either a one-dimensional numeric array or a simple string (i.e., a one dimensional array of characters). The term is generally used to refer to the basic types a language supports. All of the types mentioned for C are scalars. PERL just has one type "scalar", which covers both numbers and strings. Given a scalar variable, you can stick anything that looks like a number or a string into it. > arrays of scalars An array is a set of data, together with an index, usually written something like a[5]=20; In this case the array is called 'a', the index is 5, and the value of the array element is 20. You might think of houses, with their occupant. The street (array) might be called washington_row, and the houses (1, 2, 3) contain "Mr Smith", "Mr Jones", and "Mr Smith". (Culture gap - In England houses are numbered sequentially, in the US I understand they are numbered in yards from some fixed point or other, try and think english for this example). Note that each house has a unique address (index), but there is nothing to stop different houses having the same contents. As a two-dimensional organization of data, an array is just a numbered bunch of scalars, like a numbered list. $array[0] is one, $array[1] is another... as many as you like. For a simple example of the use of an associative array, you can record a class's marks by simply doing: $mark{'John'} = 83; $mark{'Michael'} = 79; $mark{'Holly'} = 84; etc. Here is another way to use an associative array in the same situation. Suppose you're collecting statistics about a class, with an input file having lines like the following: 100 Johnny B Good 12 J. Random Loser You could gather statistics about the distribution of marks in a class with a program like the following: while (<>) { split; # Break line into bits mark = $_[0]; # Scalar "mark" is set to the first bit $array[mark]++; # Increment the corresponding array entry } for (i = 0; i <= 100; i++) { print "$array[$i] students received mark $i.\n"; } Arrays start at array[0] and work upwards for as many entries as they have. For example, if you set array[1000] to something, then PERL will clear out array[0] through array[999] in anticipation of your needing them. Actually, arrays start with the variable $[, which is initially 0, but can be set to, say, 1 if you prefer. > associative arrays of scalars An array of one dimensional arrays which is indexed by "association" with a string rather than by simple sequential numbers. A simple example might be to associate "name" with "gary" in an array called personal_data. Now when personal_data["name"] is referenced "gary" is the value returned. An associative array is like a normal array except you don't have to use consecutive numbers. This makes things a lot slower, as a practical matter, but it's really handy when you need it. In the previous example about the houses on an English street making up an array, the houses might have names "Oaklands", "DunRoaming", "Fairview" etc., which may often be more useful than using numbers to identify them. Or consider you want to build a file that lets you go from employee name to employee number. To look for Freds number, you could start at one, see if the name of employee number 1 is fred, if so print 1 and stop, else see if employee number 2's name is fred, if so print 2 etc etc, but what would be nicer would be to say print $employee_number{'Fred'}; the answer come straight out. Few languages allow you to do this, but PERL, Awk, (and perhaps Snobol) do. If you need to keep a running count of something happening in your program, you will probably end up using an associative array. For example, if you are trying to keep track of how many 'foo', 'bar', and 'bletch' occurrances are in a file, you can do something like: while (<>) { next unless /(foo|bar|bletch)/; $found = $1; # $found is now 'foo', 'bar', or 'bletch' $count{$found}++; # increment $count{'foo'}, or whatever } for $key (sort keys %count) { print "$key was found $count{$key} times\n"; } See the keys(), values(), and each() functions for more things you can do with associative arrays. > array element An array is a means of organizing data such that an "index" allows each access to each individual data item. A simple array of 4 items (elements) called blue consists of four data storage locations blue[0], blue[1], blue[2], and blue[3]. The number that distinguishes between the elements is the index. For another example, say that @foo is an array, or %bar is an associative array, then $foo[0], $foo[1], $foo[2], etc. are its elements. Each has a value, e.g. $foo[0] = 'John'; $foo[1] = 'Michael'; $foo[2] = 'Holly'; which are also referred to as the array's elements. The same holds for $bar{'John'}, etc. > struct tm A structure, used in Unix to hold 'time'. A struct statement defines a data structure. A data structure is a way of grouping a set of related data under common name. PERL does not support data structures. However, in the discussions of gmtime EXPR and localtime EXPR, the on-line PERL manual page states: "Converts a ttime as returned by the time function to a 9-element array ... [....details omitted....] ... All array elements are numeric, and come straight out of a struct tm." The person who cleared up the meaning of this put it this way: You don't really need to know. In C, you can lump together existing data types into bunches called structures, structs for short. A struct tm is declared like this: struct tm { int tm_sec; /* 0-59 seconds */ int tm_min; /* 0-59 minutes */ int tm_hour; /* 0-23 hour */ int tm_mday; /* 1-31 day of month */ int tm_mon; /* 0-11 month */ int tm_year; /* 0- year - 1900 */ int tm_wday; /* 0-6 day of week (Sunday = 0) */ int tm_yday; /* 0-365 day of year */ int tm_isdst; /* flag: daylight savings time in effect */ char **tm_zone; /* abbreviation of time zone name */ long tm_gmtoff; /* offset from GMT in seconds */ }; Where "int" means an integer, "long" is a different kind of integer (PERL doesn't distinguish), and char ** is an indirect reference to a string like "EST", "PDT", etc. > literals, pseudo-literals A literal refers to a data value "literally", as a constant value. 1, 2, and 3 are literal numbers. "Hello world" is a literal string. You frequently find literals used in relational expressions, for example: if (num_chars == 5) then do xyz; This means that if the variable equals "literally" 5, then do whatever. Put another way, a literal is a scalar that appears literally in the code, like 15, 3.14159, or 'Hi Mom'. THe following are NOT literals: $foo, $foo[15], or $foo{'The first digit of pi is 3'}. A pseudo-literal is something that looks like a literal in some way. Pseudo-literals (in the PERL sense) are created using the ` character, (backtick). You can say: $fred = `date`; This will run the date command on your machine, and put the output into $fred. Contrast this with: $fred = 'date'; which will put the string of letters 'd','a','t', and 'e' into $fred. Now `date` looks like a string, but every time it's evaluated (every time the program reaches it, like in $foo = `date`;), the date program executes and the result (something like "Mon Feb 5 17:39:29 EST 1990") is used. In the example, this is what $foo gets. <FILE> is similar; every time you get to that in a program, the next line from the file is returned. It's a bit like a literal, but not constant. > filehandle A filehandle is a unique descriptor associated with a specific file. When several files are opened at the same time, each has its own filehandle (remember "handle" from CB jargon?). When you want to access a particular file you refer to it by its filehandle. PERL gets input and output from files, and as it's going through a file, it keeps track of things like where it is in the file. A filehandle (or file handle, both forms are used) is used to keep track of this. When you print something, it goes to a default filehandle, but you can create others (with open()) and use them to read from and print to. So how does a filehandle differ from a plain ordinary fileNAME? On Unix, PERL provides 3 handles for you, called STDIN, STDOUT, and STDERR (they were lower case in PERL 2). STDIN is normally attached to your terminal, unless you redirect it using the '<' character on the Unix command line, or with a pipe. If the "SPLIT_ALPHA" program above were in a file called splitalpha, then the command perl splitalpha would read from STDIN, which would be your terminal. But the command: perl splitalpha < /etc/passwd would still read from STDIN, which would now be the password file. And the command: grep 1 /etc/passwd | perl splitalpha would again read from STDIN, but that would be the pipe (created with the '|' character), which would be the output of the 'grep' command. In a similar manner, STDOUT is where PERL would send its output if you didn't give the print statement a filehandle, which would be your terminal, unless you use '> icarus' to send it to a file named icarus, or '| grep THANKS_A_MILLION' which would send it to the program grep (which would then print any lines with 'THANKS_A_MILLION' in them). STDERR is where error messages go, which will again be your terminal, unless you make special arrangements. The responses to my posting helped an awfully lot, and I want to thank every one who took the time and effort to provide such clear answers to my questions. Even if I did not quote you here, I still appreciated every reply. In no particular order, the people represented in one way or another in this posting are: Colin Plumb <uiucuxc!lion.waterloo.edu!ccplumb> Craig Johnson vince@intrepid, tc.fluke.COM Mike Kirita mike@rf1, tc.fluke.COM John P. Nelson <decwrl!decvax!genrad.com!jpn> Icarus Sparry <uiucuxc!gdr.bath.ac.uk!I.Sparry> Raymond Chen <uiucuxc!bosco.Berkeley.EDU!raymond> Randal Schwartz <uiucuxc!iwarp.intel.com!merlyn> In this posting, I tried to construct my summary in a way that made use of the most compact, and/or clearest, and/or elegant and/or humorous explanations I received. I wanted to include the answers that came closest to answering the questions I was actually asking, without too much extra detail. Most of the answers here include parts of at least three reponses. Again, thank you all. My next novice question: Randal Schwartz has indicated that his book will not be particularly targeted to a novice audience. Are there any suggestions for entry-level study until I am able to get his book? The pun definitely intentional -- I'll probably be able to "get" the book about the same time it is available to be "gotten"! -- Gary Benson -=[S M I L E R]=- inc@fluke.tc.com Everybody lies; but it doesn't matter, since nobody listens. -Anonymous