[comp.lang.perl] Help with split, replace and number formats.

jmc@eagle.inesc.pt (Miguel Casteleiro) (01/29/91)

Hi there!

I have some problems that I need to solve so I can finish some
perl scripts.

1)  I need to split the following line:

    "This is a line ( test,with ugly typing."

into the array:

    ('"This','is','a','line','(','test,','with','ugly','typing."')

Please note the punctuation characters.
Is there a split pattern to do this?


2)  I need to replace the word: 'teste'
                   by the word: 'aebtecd'

In this replace operation the character 't' gets replaced by 'a'
and 'c' is appended to the word, and the character 's' is replaced
by 'b' and 'd' is appended to the word.
I need this strange replace to use a 7-bit sort to sort 8-bit
(ISO-8859-1) text.
What I need is to translate some characters by some others, and
for each character translated I need to append a character to the
string (something like: tr/ts/ab/cd/ :-).
Is there an easy way to do this?


3)  Finally, is there an easy way to print numbers in the form:

    12345678.12  ->  12,345,678$12


Thanks for any help!
--
                                                                      __
 Miguel Casteleiro at                                            __  ///
 INESC, Lisboa, Portugal.        "News: so many articles,        \\\/// Only
 Email: jmc@eagle.inesc.pt        so little time..."              \XX/ Amiga

tchrist@convex.COM (Tom Christiansen) (01/30/91)

From the keyboard of jmc@eagle.inesc.pt (Miguel Casteleiro):
:Hi there!

Bom Dia!  Voce^s conhecem Perl no Portugal????

:I have some problems that I need to solve so I can finish some
:perl scripts.
:
:1)  I need to split the following line:
:
:    "This is a line ( test,with ugly typing."
:
:into the array:
:
:    ('"This','is','a','line','(','test,','with','ugly','typing."')
:
:Please note the punctuation characters.
:Is there a split pattern to do this?

Well.......  I can think of a several ways off the top of my head:

0)  You can split on /([\s,]+)/ and retain the delimiters and then
    run back through the array and merge the ones that are commas
    and toss those that aren't.  Ug.

1)  You could first simply split on white space, and then run back 
    through the array looking for \S,\S and splitting those, but 
    retaining the comma.  Kinda ug.

2)  You can munge the data first to fix the ugly typing:
	s/,(\S)/, $1/g;
    and now split on white space as usual.  This seems best to me
    of these three approaches.


:2)  I need to replace the word: 'teste'
:                   by the word: 'aebtecd'
:
:In this replace operation the character 't' gets replaced by 'a'
:and 'c' is appended to the word, and the character 's' is replaced
:by 'b' and 'd' is appended to the word.
:I need this strange replace to use a 7-bit sort to sort 8-bit
:(ISO-8859-1) text.
:What I need is to translate some characters by some others, and
:for each character translated I need to append a character to the
:string (something like: tr/ts/ab/cd/ :-).
:Is there an easy way to do this?

I ask myself which 7-bit ascii sort you're using, and why the existing
sorts don't work for 8-bits.  It's the collating sequence, right?

For your example, I did this:

    $_ = 'teste';
    $_ .= 'c' x s/t/a/g;
    $_ .= 'd' x s/s/b/g;
    print "result is $_\n";

or with a different variable:

    $foo = 'teste';
    $foo .= 'c' x ($foo =~ s/t/a/g);
    $foo .= 'd' x ($foo =~ s/s/b/g);
    print "result is $foo\n";


But that yields 'aebaeccd', not what you said you wanted.  Did you
not want all the t's translated?  If you only want the first
one, the /g should be removed.


:3)  Finally, is there an easy way to print numbers in the form:
:
:    12345678.12  ->  12,345,678$12

    $_ = '12345678.12';  # note quotes!!!

    s/\./\$/;
    1 while s/(.*\d)(\d{3})/$1,$2/;

    result -> "12,345,678$12"

or in euronotation: 

    $_ = '12345678.12';

    s/\./\,/;
    1 while s/(.*\d)(\d{3})/$1.$2/;

    # result -> "12.345.678,12"

The quotes are so we don't have problems going into floating point
notation.  This would also help first:

    $_ = sprintf("%10.2f", $_);  # discard boring bits


--tom
--
"Hey, did you hear Stallman has replaced /vmunix with /vmunix.el?  Now
 he can finally have the whole O/S built-in to his editor like he
 always wanted!" --me (Tom Christiansen <tchrist@convex.com>)

jmc@eagle.inesc.pt (Miguel Casteleiro) (01/31/91)

In article <1991Jan29.171228.17738@convex.com> tchrist@convex.COM (Tom Christiansen) writes:
>From the keyboard of jmc@eagle.inesc.pt (Miguel Casteleiro):
>:Hi there!
>
>Bom Dia!  Voce^s conhecem Perl no Portugal????
                                ^^em
Sim, e mais algumas coisitas!

>:1)  I need to split the following line:
>:
>:    "This is a line ( test,with ugly typing."
>:
>:into the array:
>:
>:    ('"This','is','a','line','(','test,','with','ugly','typing."')
>:
>  [some ways off do it deleted]
>
>2)  You can munge the data first to fix the ugly typing:
>	s/,(\S)/, $1/g;
>    and now split on white space as usual.  This seems best to me
>    of these three approaches.

I'll use this approach.  It seems to work fine.  Thanks!

>:2)  I need to replace the word: 'teste'
>:                   by the word: 'aebtecd'
>:
>:In this replace operation the character 't' gets replaced by 'a'
>:and 'c' is appended to the word, and the character 's' is replaced
>:by 'b' and 'd' is appended to the word.
>:I need this strange replace to use a 7-bit sort to sort 8-bit
>:(ISO-8859-1) text.
>:What I need is to translate some characters by some others, and
>:for each character translated I need to append a character to the
>:string (something like: tr/ts/ab/cd/ :-).
>:Is there an easy way to do this?
>
>I ask myself which 7-bit ascii sort you're using, and why the existing
>sorts don't work for 8-bits.  It's the collating sequence, right?

I'm using 7-bit and 8-bit sorts and none of them do what I want!
The test I gave was incorrect, sorry :-(  What I want is to replace
the word 'teste' by the word
         'aebaecdc'.

I'll explain better what I want.  Let's say that "A" is an "a" with
a grave accent and "B" is an "a" with an acute accent.  So, the
sorting order will be (at least for the portuguese):

a A B b c d e f ...

So, I will have the following sorted words:

Aac
abc
Abc
acbd

Please note that "a" = "A" = "B" only if the words are different
(not counting the characters "a", "A" and "B").  If they are equal
then "a" < "A" < "B".

To accomplish this, the best way I can think of, is to replace:

"a" by "a" and append "a"
"A" by "a" and append "b"
"B" by "a" and append "c"

So, a 7-bit sort will see the previous words as:

aacba
abca
abcb
acbda

and will sort properly.

The code I'm using to do this is:

$word = "Aac";

$_ = $word;
tr/aAB/aaa/;
$sword = $_;

$_ = $word;
tr/aAB//c;
tr/aAB/abc/;

$sword .= $_;

and $sword will be 'aacba'.

If there is an easy way to do this, please let me know.  Also, if
someone can think of a better way to sort 8-bit text, please let
me know.

> [ A solution for the 'teste' example deleted ]
>
>:3)  Finally, is there an easy way to print numbers in the form:
>:
>:    12345678.12  ->  12,345,678$12
>
>    $_ = '12345678.12';  # note quotes!!!
>
>    s/\./\$/;
>    1 while s/(.*\d)(\d{3})/$1,$2/;
>
>    result -> "12,345,678$12"

I'll use this code, Thanks!

> [ A solution for the euronotation deleted ]
>
>--tom
> always wanted!" --me (Tom Christiansen <tchrist@convex.com>)
--
                                                                      __
 Miguel Casteleiro at                                            __  ///
 INESC, Lisboa, Portugal.        "News: so many articles,        \\\/// Only
 Email: jmc@eagle.inesc.pt        so little time..."              \XX/ Amiga

raymond@math.berkeley.edu (Raymond Chen) (01/31/91)

In article <1991Jan30.181924.47@eagle.inesc.pt>, jmc@eagle (Miguel Casteleiro) writes:
>[T]he sorting order will be (at least for the portuguese):
>
>a A B b c d e f ...
>
>If there is an easy way to do this, please let me know.  

# This is a standard trick.

# You only need to do this part once.
$portuguese_order = "aABbcdef";
$ascii_order = 
   pack("c" . length($portuguese_order), 1 .. length($portuguese_order));
eval 'sub port2sort 
      { foreach (@_) {  tr/'.$portuguese_order.'/'.$ascii_order.'/; } }';
eval 'sub sort2port
      { foreach (@_) {  tr/'.$ascii_order.'/'.$portuguese_order.'/; } }';

# and here's how you use it:

@words = ("a", "A", "B", "c", "b");

&port2sort(@words);		# convert to intermediate format
@sorted_words = sort @words;	# sort the intermediate format
&sort2port(@sorted_words);	# convert back

print join(":", @sorted_words);

# Observe that a similar trick can be used to perform other types of sorting;
# for example, if you want digits to sort *after* letters, or if you want
# the letter "p" to alphabetize before the letter "h", like this:

for(sort("herl ","Just ","packer,","anotper ")){y/Jahp/Japh/;print;}

ccplumb@rose.uwaterloo.ca (Colin Plumb) (02/02/91)

So you start out with a key string, with some magic letters.
You want to build a primary key with the magic letters
replaced by other letters and the non-magic ones retained,
and a secondary key with the magic letters replaced by others
with the non-magic letters deleted.  This is easy, with tr/.../.../:

$primary = $secondary = $_;
$primary ~= tr/<magic>/<replace>/;

$secondary ~= tr/<non-magic>//d;
$secondary ~= tr/<magic>/<replace>/;

$key = $primary . $secondary;;

Will that do what you want?

Note: it might be a good idea to ensure the first character of the
secondary key sorts to less than any possible letter in the primary,
with an explicit delimiter (space works well, as does null...)
if necessary.  This is so (if uppercase letters are magic and
their non-magic replacements are lower-case)

aBcda -> abcdab
aBcd  -> abcdb

These sort in order abcdab, abcdb, which is equivalent to aBcda, aBcd,
which isn't what you eventually want if I understand the problem correctly.
-- 
	-Colin

ccplumb@rose.uwaterloo.ca (Colin Plumb) (02/16/91)

jmc@eagle.inesc.pt (Miguel Casteleiro) wrote:
>Hi there!
>
>I have some problems that I need to solve so I can finish some
>perl scripts.
>
>1)  I need to split the following line:
>
>    "This is a line ( test,with ugly typing."
>
>into the array:
>
>    ('"This','is','a','line','(','test,','with','ugly','typing."')
>
>Please note the punctuation characters.
>Is there a split pattern to do this?

Well, it depends on details, but defining a word as \w+,
whitespace as \s+, and punctuation as [^\w\s]+, and
assuming you want to split after punctuation and before
words, even if there is no explicit space, then just add
one and split as usual:

s/([^\w\s]+)(\w)/\1 \2/g
split;

>2)  I need to replace the word: 'teste'
>                   by the word: 'aebtecd'
>
>In this replace operation the character 't' gets replaced by 'a'
>and 'c' is appended to the word, and the character 's' is replaced
>by 'b' and 'd' is appended to the word.
>I need this strange replace to use a 7-bit sort to sort 8-bit
>(ISO-8859-1) text.
>What I need is to translate some characters by some others, and
>for each character translated I need to append a character to the
>string (something like: tr/ts/ab/cd/ :-).
>Is there an easy way to do this?

Yes.
$suffix = $_;
$suffix =~ tr/ts//cd;	# delete anything other then t and d
#                 ^^ This 'cd' has *nothing* to do with the one below!
$suffix =~ tr/ts/cd/;	# map t and s to c and d
tr/ts/ab/;
$_ .= $suffix;
-- 
	-Colin