Hallo, Some people asked me last week to summerize the answers given to my columns question. I promissed to to do so, so here it is. First, this was my question: >Hallo, > >I have a (text) file with the following format: > >123420010110010010101 (and so on this way => ) >234531111101100101010 >678940000010010111010 >340151010101110101010 >and so on... > >Now let's say the first four characters are a PIN number, >the following two characters the subject's age, and so on. >How can I read such a file split in (fixed) columns ? >I would like to insert tabs at the end of each field, creating >a tab-delimited text file. To all who replied, of course (as someone said yesterday) 'Thanks for the lightbulb over the head' Here are the answers given (in random order) ============================================================= I saw most of the answers and I believe that the fastest way to do it is: while (<>) { $pin = substr ($_, 0, 6); $age = substr ($_, 6, 2); etc... } ============================================================= This is my solution: I assume you are reading the text file through a filehandle: <F> while (<F>) { $_ =~ s/(\d{4})(\d{2})(\d{15})/$1\t$2\t$3/; $fHolder .= $_; } The regular expression groups each line into three sets of 4, 2 and 15 digits respectively and then reconstructs the line with the tab character between grouped sets. Each line thus processed is appended to the variable $fHolder for further processing (or for writing out to disk). ============================================================= Try this (untested) code: ##---------------------------- start sample open(INPUT, "filename") || die; open(OUTPUT, ">filewithtabs") || die; while (<INPUT>) { ## Try to replace 4 digits and 2 digits at the beginning of a line ## with the matched 4 digits, tab, the matched 2 digits, tab. if (s/^(\d{4})(\d\d)/$1\t$2\t/) { ## line 6 (see below) print OUTPUT; } else { ## The match failed, there were not 6 digits at the ## beginning of the input line print "warning: line of unexpected format ($_)\n"; } } close(INPUT); close(OUTPUT); ##---------------------------- end sample If you are certain your input data are correct, more minimal code could replace the "if" code at line 6 with: s/^(.{4})(..)/$1\t$2\t/; print; If you actually want to get values out to manipulate in your program, change the "if" code starting at line 6 to: if ( ($pin, $age, $the_rest) = /^(\d{4})(\d\d)(.*)/ ) { ## Now you can work with $age, $pin, or $the_rest ## Example: $max_age = $age if ($age > $max_age); } else { ## The match failed. print "warning: line of unexpected format ($_)\n"; ## If you prefer, print bad lines to some error file ## instead of the screen--good for large data sets. } ============================================================= Try something along these lines: $template = "a4 a2 a2"; ($pin, $age, $so-on) = unpack($template, $_); The "a4 a2 ..." is defining alpha fields for the unpack instruction. So it'll take the first four chars and make it a field, the next 2 and make it a field, etc. It's quite handy, and with Perl's loose type checking, quite flexible. templates can do a whole lot more, if you want, and are discusses in the programming perl books. ============================================================= while(<>) { print join("\t",unpack("a4a2...",$_)); } where the unpack format describes the fixed format fields... ============================================================= See the Llama book Ch. 17 under Fixed-length Random Access Databases. Try something like this: my ($pin, $age) = unpack ("A4 A2", <>); print ($pin, "\t", $age, "\n"); If you just want to insert tabs (and not process the data) then try: print (join ("\t", unpack ("A4 A2", <>)), "\n"); Don't use regexps because this is faster and easier to understand. ============================================================= Somthing like print "$1\t$2" if /^(\d{4})(\d\d)/; ============================================================= $s = '123420010110010010101'; $s =~ /^(\d{4})(\d{2})(\d)(\d{14})/; #read the regular expressions $pin = $1; #man pages for more info $age = $2; $oth = $3; $misc = $4; print "$pin\t$age\t$oth\t$misc"; #"\t" = tab Or, on the whole file: open(FILE,$filename) || die "$!\n"; open(NEWFILE,">$newfilename") || die "$!\n"; while($s = <FILE>) { $s = '123420010110010010101'; $s =~ /^(\d{4})(\d{2})(\d)(\d{14})/; $pin = $1; $age = $2; $oth = $3; $misc = $4; print NEWFILE "$pin\t$age\t$oth\t$misc"; } close(NEWFILE); close(FILE); ============================================================= You probaly want to look at unpack() for reading the fixed width fields in a while(<FILEHANDLE>) loop and some variant of 'print join()' to write them back out. ============================================================= If I had to do this, I'd use a regular expression. Reading in the file line by line, I'd do something like ($pin,$age,....) = /(\d{4})(\d{2})..../; and if I wanted to tab delimit the results and print them out, simply print "$pin\t$age\t....\n"; ============================================================= The alternative to using a 'regex' is to use 'substr'. It is worth try: $pin = substr($_, 0, 4); $age = substr($_, 4, 2); You could of course do the thing in one hit: $new = substr($_, 0, 4)."\t". substr($_, 4, 2)."\t". substr($_, 6, 2)."\t". etc ============================================================= open(INFILE, "/in/file/name") or die("no in: $!, stopped"); open(OUTFILE, ">/out/file/name") or die("no out: $!, stopped"); while (<INFILE>) { # the numbers following the "a"s indicate field width my($pin, $age, $etc) = unpack("a4a2a99", $_); print OUTFILE "$pin\t$age\t$etc\t\n"; } According to the Camel, 2nd ed., p. 66, using unpack() is more efficient than patterns like /(...)(..)(.....)/ . ============================================================= You'll probably get lots of responses telling you to use "PACK", which is certainly more efficient. Here's how to do it if you want to understand what you're doing :*) Assuming the input file is open for read on INFILE and the output file is open for write on OUTFILE: $INFILE="TheFileToRead"; $OUTFILE=">TheFileToWrite"; #the ">" says we're allowed to write to the file open INFILE | die "Can't open $INFILE to read.\n"; open OUTFILE | die "Can't open $OUTFILE to write.\n"; while (<INFILE>) { #everything inside this WHILE will automatically use each #each successive line of INFILE unless we tell it otherwise #...which we won't @TheData = split(""); #split the line into single character parts $ThePIN=join("",@TheData[1..4]); join the first four into a string $TheAge=join("",@TheData[5..6]); $TheBlah=join("",@TheData[7.blah]); etc etc #finally, print the outfile. print OUTFILE "$ThePIN\t$TheAge\t$TheBlah\tetc\n"; } close OUTFILE; close INFILE; ============================================================= If I were doing this, I'd read in the line of text, and use substr() a few times to pick off the individual items and push them into an array. Like this: while(<>) { @array = (); push (@array, substr($_, 0, 4)); push (@array, substr($_, 4, 4)); [...etc...] print join("\t", @array), "\n"; } ============================================================= # Nico Rozendaal, # # University of Maastricht, # # Dept. of Psychiatry & Neuropsychology, # # Maastricht, The Netherlands. # # e-mail: nico.rozendaal@np.unimaas.nl #