diff options
Diffstat (limited to 'doc/text.html')
-rw-r--r-- | doc/text.html | 564 |
1 files changed, 564 insertions, 0 deletions
diff --git a/doc/text.html b/doc/text.html new file mode 100644 index 0000000..f4fdac8 --- /dev/null +++ b/doc/text.html @@ -0,0 +1,564 @@ +<!-- =defdoc funtext funtext n --> +<HTML> +<HEAD> +<TITLE>Column-based Text Files</TITLE> +</HEAD> +<BODY> + +<!-- =section funtext NAME --> +<H2><A NAME="funtext">Funtext: Support for Column-based Text Files</A></H2> + +<!-- =section funtext SYNOPSIS --> +<H2>Summary</H2> +<P> +This document contains a summary of the options for processing column-based +text files. + +<!-- =section funtext DESCRIPTION --> +<H2>Description</H2> + +<P> +Funtools will automatically sense and process "standard" +column-based text files as if they were FITS binary tables without any +change in Funtools syntax. In particular, you can filter text files +using the same syntax as FITS binary tables: +<pre> + fundisp foo.txt'[cir 512 512 .1]' + fundisp -T foo.txt > foo.rdb + funtable foo.txt'[pha=1:10,cir 512 512 10]' foo.fits +</pre> + +<P> +The first example displays a filtered selection of a text file. The +second example converts a text file to an RDB file. The third example +converts a filtered selection of a text file to a FITS binary table. + +<P> +Text files can also be used in Funtools image programs. In this case, +you must provide binning parameters (as with raw event files), using +the bincols keyword specifier: + +<pre> + bincols=([xname[:tlmin[:tlmax:[binsiz]]]],[yname[:tlmin[:tlmax[:binsiz]]] +</pre> + +For example: +<pre> + funcnts foo'[bincols=(x:1024,y:1024)]' "ann 512 512 0 10 n=10" +</pre> + +<H2>Standard Text Files</H2> + +<P> +Standard text files have the following characteristics: + +<ul> +<li> Optional comment lines start with # +<li> Optional blank lines are considered comments +<li> An optional table header consists of the following (in order): +<ul> + <li> a single line of alpha-numeric column names + <li> an optional line of unit strings containing the same number of cols + <li> an optional line of dashes containing the same number of cols +</ul> +<li> Data lines follow the optional header and (for the present) consist of + the same number of columns as the header. +<li> Standard delimiters such as space, tab, comma, semi-colon, and bar. +</ul> + +<P> +Examples: + +<pre> + # rdb file + foo1 foo2 foo3 foos + ---- ---- ---- ---- + 1 2.2 3 xxxx + 10 20.2 30 yyyy + + # multiple consecutive whitespace and dashes + foo1 foo2 foo3 foos + --- ---- ---- ---- + 1 2.2 3 xxxx + 10 20.2 30 yyyy + + # comma delims and blank lines + foo1,foo2,foo3,foos + + 1,2.2,3,xxxx + 10,20.2,30,yyyy + + # bar delims with null values + foo1|foo2|foo3|foos + 1||3|xxxx + 10|20.2||yyyy + + # header-less data + 1 2.2 3 xxxx + 10 20.2 30 yyyy +</pre> + +<P> +The default set of token delimiters consists of spaces, tabs, commas, +semi-colons, and vertical bars. Several parsers are used +simultaneously to analyze a line of text in different ways. One way +of analyzing a line is to allow a combination of spaces, tabs, and +commas to be squashed into a single delimiter (no null values between +consecutive delimiters). Another way is to allow tab, semi-colon, and +vertical bar delimiters to support null values, i.e. two consecutive +delimiters implies a null value (e.g. RDB file). A successful parser +is one which returns a consistent number of columns for all rows, with +each column having a consistent data type. More than one parser can +be successful. For now, it is assumed that successful parsers all +return the same tokens for a given line. (Theoretically, there are +pathological cases, which will be taken care of as needed). Bad parsers +are discarded on the fly. + +<P> +If the header does not exist, then names "col1", "col2", etc. are +assigned to the columns to allow filtering. Furthermore, data types +for each column are determined by the data types found in the columns +of the first data line, and can be one of the following: string, int, +and double. Thus, all of the above examples return the following +display: +<pre> + fundisp foo'[foo1>5]' + FOO1 FOO2 FOO3 FOOS + ---------- --------------------- ---------- ------------ + 10 20.20000000 30 yyyy +</pre> + +<H2>Comments Convert to Header Params</H2> + +<P> +Comments which precede data rows are converted into header parameters and +will be written out as such using funimage or funhead. Two styles of comments +are recognized: + +<P> +1. FITS-style comments have an equal sign "=" between the keyword and +value and an optional slash "/" to signify a comment. The strict FITS +rules on column positions are not enforced. In addition, strings only +need to be quoted if they contain whitespace. For example, the following +are valid FITS-style comments: + +<pre> + # fits0 = 100 + # fits1 = /usr/local/bin + # fits2 = "/usr/local/bin /opt/local/bin" + # fits3c = /usr/local/bin /opt/local/bin /usr/bin + # fits4c = "/usr/local/bin /opt/local/bin" / path dir +</pre> + +Note that the fits3c comment is not quoted and therefore its value is the +single token "/usr/local/bin" and the comment is "opt/local/bin /usr/bin". +This is different from the quoted comment in fits4c. + +<P> +2. Free-form comments can have an optional colon separator between the +keyword and value. In the absence of quote, all tokens after the +keyword are part of the value, i.e. no comment is allowed. If a string +is quoted, then slash "/" after the string will signify a comment. +For example: + +<pre> + # com1 /usr/local/bin + # com2 "/usr/local/bin /opt/local/bin" + # com3 /usr/local/bin /opt/local/bin /usr/bin + # com4c "/usr/local/bin /opt/local/bin" / path dir + + # com11: /usr/local/bin + # com12: "/usr/local/bin /opt/local/bin" + # com13: /usr/local/bin /opt/local/bin /usr/bin + # com14c: "/usr/local/bin /opt/local/bin" / path dir +</pre> + +<P> +Note that com3 and com13 are not quoted, so the whole string is part of +the value, while comz4c and com14c are quoted and have comments following +the values. + +<P> +Some text files have column name and data type information in the header. +You can specify the format of column information contained in the +header using the "hcolfmt=" specification. See below for a detailed +description. + +<H2>Multiple Tables in a Single File</H2> + +<P> +Multiple tables are supported in a single file. If an RDB-style file +is sensed, then a ^L (vertical tab) will signify end of +table. Otherwise, an end of table is sensed when a new header (i.e., +all alphanumeric columns) is found. (Note that this heuristic does not +work for single column tables where the column type is ASCII and the +table that follows also has only one column.) You also can specify +characters that signal an end of table condition using the <b>eot=</b> +keyword. See below for details. + +<P> +You can access the nth table (starting from 1) in a multi-table file +by enclosing the table number in brackets, as with a FITS extension: + +<pre> + fundisp foo'[2]' +</pre> +The above example will display the second table in the file. +(Index values start at 1 in oder to maintain logical compatibility +with FITS files, where extension numbers also start at 1). + + +<H2>TEXT() Specifier</H2> + +<P> +As with ARRAY() and EVENTS() specifiers for raw image arrays and raw +event lists respectively, you can use TEXT() on text files to pass +key=value options to the parsers. An empty set of keywords is +equivalent to not having TEXT() at all, that is: + +<pre> + fundisp foo + fundisp foo'[TEXT()]' +</pre> + +are equivalent. A multi-table index number is placed before the TEXT() +specifier as the first token, when indexing into a multi-table: + + fundisp foo'[2,TEXT(...)]' + +<P> +The filter specification is placed after the TEXT() specifier, separated +by a comma, or in an entirely separate bracket: + +<pre> + fundisp foo'[TEXT(...),circle 512 512 .1]' + fundisp foo'[2,TEXT(...)][circle 512 512 .1]' +</pre> + +<H2>Text() Keyword Options</H2> + +<P> +The following is a list of keywords that can be used within the TEXT() +specifier (the first three are the most important): + +<DL> + +<P> +<DT> delims="[delims]" +<DD>Specify token delimiters for this file. Only a single parser having these +delimiters will be used to process the file. +<pre> + fundisp foo.fits'[TEXT(delims="!")]' + fundisp foo.fits'[TEXT(delims="\t%")]' +</pre> + +<P> +<DT> comchars="[comchars]" +<DD> Specify comment characters. You must include "\n" to allow blank lines. +These comment characters will be used for all standard parsers (unless delims +are also specified). +<pre> + fundisp foo.fits'[TEXT(comchars="!\n")]' +</pre> + +<P> +<DT> cols="[name1:type1 ...]" +<DD> Specify names and data type of columns. This overrides header +names and/or data types in the first data row or default names and +data types for header-less tables. +<pre> + fundisp foo.fits'[TEXT(cols="x:I,y:I,pha:I,pi:I,time:D,dx:E,dy:e")]' +</pre> +<P> +If the column specifier is the only keyword, then the cols= is not +required (in analogy with EVENTS()): +<pre> + fundisp foo.fits'[TEXT(x:I,y:I,pha:I,pi:I,time:D,dx:E,dy:e)]' +</pre> +Of course, an index is allowed in this case: +<pre> + fundisp foo.fits'[2,TEXT(x:I,y:I,pha:I,pi:I,time:D,dx:E,dy:e)]' +</pre> + +<P> +<DT> eot="[eot delim]" +<DD> Specify end of table string specifier for multi-table files. RDB +files support ^L. The end of table specifier is a string and the whole +string must be found alone on a line to signify EOT. For example: +<pre> + fundisp foo.fits'[TEXT(eot="END")]' +</pre> +will end the table when a line contains "END" is found. Multiple lines +are supported, so that: +<pre> + fundisp foo.fits'[TEXT(eot="END\nGAME")]' +</pre> +will end the table when a line contains "END" followed by a line +containing "GAME". +<P> +In the absence of an EOT delimiter, a new table will be sensed when a new +header (all alphanumeric columns) is found. + +<P> +<DT> null1="[datatype]" +<DD> Specify data type of a single null value in row 1. +Since column data types are determined by the first row, a null value +in that row will result in an error and a request to specify names and +data types using cols=. If you only have a one null in row 1, you don't +need to specify all names and columns. Instead, use null1="type" to +specify its data type. + +<P> +<DT> alen=[n] +<DD>Specify size in bytes for ASCII type columns. +FITS binary tables only support fixed length ASCII columns, so a +size value must be specified. The default is 16 bytes. + +<P> +<DT> nullvalues=["true"|"false"] +<DD>Specify whether to expect null values. +Give the parsers a hint as to whether null values should be allowed. The +default is to try to determine this from the data. + +<P> +<DT> whitespace=["true"|"false"] +<DD> Specify whether surrounding white space should be kept as part of +string tokens. By default surrounding white space is removed from +tokens. + +<P> +<DT> header=["true"|"false"] +<DD>Specify whether to require a header. This is needed by tables +containing all string columns (and with no row containing dashes), in +order to be able to tell whether the first row is a header or part of +the data. The default is false, meaning that the first row will be +data. If a row dashes are present, the previous row is considered the +column name row. + +<P> +<DT> units=["true"|"false"] +<DD>Specify whether to require a units line. +Give the parsers a hint as to whether a row specifying units should be +allowed. The default is to try to determine this from the data. + +<P> +<DT> i2f=["true"|"false"] +<DD>Specify whether to allow int to float conversions. +If a column in row 1 contains an integer value, the data type for that +column will be set to int. If a subsequent row contains a float in +that same column, an error will be signaled. This flag specifies that, +instead of an error, the float should be silently truncated to +int. Usually, you will want an error to be signaled, so that you can +specify the data type using cols= (or by changing the value of +the column in row 1). + +<P> +<DT> comeot=["true"|"false"|0|1|2] +<DD>Specify whether comment signifies end of table. +If comeot is 0 or false, then comments do not signify end of table and +can be interspersed with data rows. If the value is true or 1 (the +default for standard parsers), then non-blank lines (e.g. lines +beginning with '#') signify end of table but blanks are allowed +between rows. If the value is 2, then all comments, including blank +lines, signify end of table. + +<P> +<DT> lazyeot=["true"|"false"] +<DD>Specify whether "lazy" end of table should be permitted (default is +true for standard formats, except rdb format where explicit ^L is required +between tables). A lazy EOT can occur when a new table starts directly +after an old one, with no special EOT delimiter. A check for this EOT +condition is begun when a given row contains all string tokens. If, in +addition, there is a mismatch between the number of tokens in the +previous row and this row, or a mismatch between the number of string +tokens in the prev row and this row, a new table is assumed to have +been started. For example: +<PRE> + ival1 sval3 + ----- ----- + 1 two + 3 four + + jval1 jval2 tval3 + ----- ----- ------ + 10 20 thirty + 40 50 sixty +</PRE> +Here the line "jval1 ..." contains all string tokens. In addition, +the number of tokens in this line (3) differs from the number of +tokens in the previous line (2). Therefore a new table is assumed +to have started. Similarly: +<PRE> + ival1 ival2 sval3 + ----- ----- ----- + 1 2 three + 4 5 six + + jval1 jval2 tval3 + ----- ----- ------ + 10 20 thirty + 40 50 sixty +</PRE> +Again, the line "jval1 ..." contains all string tokens. The number of +string tokens in the previous row (1) differs from the number of +tokens in the current row(3). We therefore assume a new table as been +started. This lazy EOT test is not performed if lazyeot is explicitly +set to false. + +<P> +<DT> hcolfmt=[header column format] +<DD> Some text files have column name and data type information in the header. +For example, VizieR catalogs have headers containing both column names +and data types: +<PRE> + #Column e_Kmag (F6.3) ?(k_msigcom) K total magnitude uncertainty (4) [ucd=ERROR] + #Column Rflg (A3) (rd_flg) Source of JHK default mag (6) [ucd=REFER_CODE] + #Column Xflg (I1) [0,2] (gal_contam) Extended source contamination (10) [ucd=CODE_MISC] +</PRE> + +while Sextractor files have headers containing column names alone: + +<PRE> + # 1 X_IMAGE Object position along x [pixel] + # 2 Y_IMAGE Object position along y [pixel] + # 3 ALPHA_J2000 Right ascension of barycenter (J2000) [deg] + # 4 DELTA_J2000 Declination of barycenter (J2000) [deg] +</PRE> +The hcolfmt specification allows you to describe which header lines +contain column name and data type information. It consists of a string +defining the format of the column line, using "$col" (or "$name") to +specify placement of the column name, "$fmt" to specify placement of the +data format, and "$skip" to specify tokens to ignore. You also can +specify tokens explicitly (or, for those users familiar with how +sscanf works, you can specify scanf skip specifiers using "%*"). +For example, the VizieR hcolfmt above might be specified in several ways: +<PRE> + Column $col ($fmt) # explicit specification of "Column" string + $skip $col ($fmt) # skip one token + %*s $col ($fmt) # skip one string (using scanf format) +</PRE> +while the Sextractor format might be specified using: +<PRE> + $skip $col # skip one token + %*d $col # skip one int (using scanf format) +</PRE> +You must ensure that the hcolfmt statement only senses actual column +definitions, with no false positives or negatives. For example, the +first Sextractor specification, "$skip $col", will consider any header +line containing two tokens to be a column name specifier, while the +second one, "%*d $col", requires an integer to be the first token. In +general, it is preferable to specify formats as explicitly as +possible. + +<P> +Note that the VizieR-style header info is sensed automatically by the +funtools standard VizieR-like parser, using the hcolfmt "Column $col +($fmt)". There is no need for explicit use of hcolfmt in this case. + +<P> +<DT> debug=["true"|"false"] +<DD>Display debugging information during parsing. + +</DL> + +<H2>Environment Variables</H2> + +<P> +Environment variables are defined to allow many of these TEXT() values to be +set without having to include them in TEXT() every time a file is processed: + +<pre> + keyword environment variable + ------- -------------------- + delims TEXT_DELIMS + comchars TEXT_COMCHARS + cols TEXT_COLUMNS + eot TEXT_EOT + null1 TEXT_NULL1 + alen TEXT_ALEN + bincols TEXT_BINCOLS + hcolfmt TEXT_HCOLFMT +</pre> + +<H2>Restrictions and Problems</H2> + +<P> +As with raw event files, the '+' (copy extensions) specifier is not +supported for programs such as funtable. + +<P> +String to int and int to string data conversions are allowed by the +text parsers. This is done more by force of circumstance than by +conviction: these transitions often happens with VizieR catalogs, +which we want to support fully. One consequence of allowing these +transitions is that the text parsers can get confused by columns which +contain a valid integer in the first row and then switch to a +string. Consider the following table: +<PRE> + xxx yyy zzz + ---- ---- ---- + 111 aaa bbb + ccc 222 ddd +</PRE> +The xxx column has an integer value in row one a string in row two, +while the yyy column has the reverse. The parser will erroneously +treat the first column as having data type int: +<PRE> + fundisp foo.tab + XXX YYY ZZZ + ---------- ------------ ------------ + 111 'aaa' 'bbb' + 1667457792 '222' 'ddd' +</PRE> +while the second column is processed correctly. This situation can be avoided +in any number of ways, all of which force the data type of the first column +to be a string. For example, you can edit the file and explicitly quote the +first row of the column: +<PRE> + xxx yyy zzz + ---- ---- ---- + "111" aaa bbb + ccc 222 ddd + + [sh] fundisp foo.tab + XXX YYY ZZZ + ------------ ------------ ------------ + '111' 'aaa' 'bbb' + 'ccc' '222' 'ddd' +</PRE> +You can edit the file and explicitly set the data type of the first column: +<PRE> + xxx:3A yyy zzz + ------ ---- ---- + 111 aaa bbb + ccc 222 ddd + + [sh] fundisp foo.tab + XXX YYY ZZZ + ------------ ------------ ------------ + '111' 'aaa' 'bbb' + 'ccc' '222' 'ddd' +</PRE> +You also can explicitly set the column names and data types of all columns, +without editing the file: +<PRE> + [sh] fundisp foo.tab'[TEXT(xxx:3A,yyy:3A,zzz:3a)]' + XXX YYY ZZZ + ------------ ------------ ------------ + '111' 'aaa' 'bbb' + 'ccc' '222' 'ddd' +</PRE> +The issue of data type transitions (which to allow and which to disallow) +is still under discussion. + +<!-- =section funtext SEE ALSO --> +<!-- =text See funtools(n) for a list of Funtools help pages --> +<!-- =stop --> + +<P> +<A HREF="./help.html">Go to Funtools Help Index</A> + +<H5>Last updated: August 3, 2007</H5> + +</BODY> +</HTML> |