summaryrefslogtreecommitdiffstats
path: root/funtools/doc/idx.html
diff options
context:
space:
mode:
Diffstat (limited to 'funtools/doc/idx.html')
-rw-r--r--funtools/doc/idx.html216
1 files changed, 216 insertions, 0 deletions
diff --git a/funtools/doc/idx.html b/funtools/doc/idx.html
new file mode 100644
index 0000000..a08b10b
--- /dev/null
+++ b/funtools/doc/idx.html
@@ -0,0 +1,216 @@
+<!-- =defdoc funidx funidx n -->
+<HTML>
+<HEAD>
+<TITLE>Table Filtering with Indexes</TITLE>
+</HEAD>
+<BODY>
+
+<!-- =section funidx NAME -->
+<H2><A NAME="funidx">Funidx: Using Indexes to Filter Rows in a Table</A></H2>
+
+<!-- =section funidx SYNOPSIS -->
+<H2>Summary</H2>
+<P>
+This document contains a summary of the user interface for
+filtering rows in binary tables with indexes.
+
+<!-- =section funidx DESCRIPTION -->
+<H2>Description</H2>
+<P>
+Funtools <A HREF="./filters.html">Table Filtering</A> allows rows in a
+table to be selected based on the values of one or more columns in the
+row. Because the actual filter code is compiled on the fly, it is very
+efficient. However, for very large files (hundreds of Mb or larger),
+evaluating the filter expression on each row can take a long time. Therefore,
+funtools supports index files for columns, which are used automatically during
+filtering to reduce dramatically the number of row evaluations performed.
+The speed increase for indexed filtering can be an order of magnitude or
+more, depending on the size of the file.
+
+<P>
+The <A HREF="./programs.html#funindex">funindex</A> program creates an
+index on one or more columns in a binary table. For example, to create an index
+for the column pi in the file huge.fits, use:
+<PRE>
+ funindex huge.fits pi
+</PRE>
+This will create an index named huge_pi.idx.
+
+<P>
+When a filter expression is initialized for row evaluation, funtools
+looks for an index file for each column in the filter expression. If
+found, and if the file modification date of the index file is later
+than that of the data file, then the index will be used to reduce the
+number of rows that are evaluated in the filter. When
+<A HREF="./regions.html">Spatial Region Filtering</A> is part of the
+expression, the columns associated with the region are checked for index
+files.
+
+<P>
+If an index file is not available for a given column, then in general,
+all rows must be checked when that column is part of a filter
+expression. This is not true, however, when a non-indexed column is
+part of an AND expression. In this case, only the rows that pass the
+other part of the AND expression need to be checked. Thus, in some cases,
+filtering speed can increase significantly even if all columns are not
+indexed.
+
+<P>
+Also note that certain types of filter expression syntax cannot make
+use of indices. For example, calling functions with column names as
+arguments implies that all rows must be checked against the function
+value. Once again, however, if this function is part of an AND
+expression, then a significant improvement in speed still is possible
+if the other part of the AND expression is indexed.
+
+<P>
+For example, note below the dramatic speedup in searching a 1 Gb
+file using an AND filter, even when one of the columns (pha) has no
+index:
+
+<PRE>
+ time fundisp \
+ huge.fits'[idx_activate=0,idx_debug=1,pha=2348&&cir 4000 4000 1]' \
+ "x y pha"
+ x y pha
+ ---------- ----------- ----------
+ 3999.48 4000.47 2348
+ 3999.48 4000.47 2348
+ 3999.48 4000.47 2348
+ 3999.48 4000.47 2348
+ 3999.48 4000.47 2348
+ 3999.48 4000.47 2348
+ 3999.48 4000.47 2348
+ 3999.48 4000.47 2348
+ 3999.48 4000.47 2348
+ 3999.48 4000.47 2348
+ 3999.48 4000.47 2348
+ 3999.48 4000.47 2348
+ 3999.48 4000.47 2348
+ 3999.48 4000.47 2348
+ 3999.48 4000.47 2348
+ 3999.48 4000.47 2348
+ 42.36u 13.07s 6:42.89 13.7%
+
+ time fundisp \
+ huge.fits'[idx_activate=1,idx_debug=1,pha=2348&&cir 4000 4000 1]' \
+ "x y pha"
+ x y pha
+ ---------- ----------- ----------
+ idxeq: [INDEF]
+ idxand sort: x[ROW 8037025:8070128] y[ROW 5757665:5792352]
+ idxand(1): INDEF [IDX_OR_SORT]
+ idxall(1): [IDX_OR_SORT]
+ 3999.48 4000.47 2348
+ 3999.48 4000.47 2348
+ 3999.48 4000.47 2348
+ 3999.48 4000.47 2348
+ 3999.48 4000.47 2348
+ 3999.48 4000.47 2348
+ 3999.48 4000.47 2348
+ 3999.48 4000.47 2348
+ 3999.48 4000.47 2348
+ 3999.48 4000.47 2348
+ 3999.48 4000.47 2348
+ 3999.48 4000.47 2348
+ 3999.48 4000.47 2348
+ 3999.48 4000.47 2348
+ 3999.48 4000.47 2348
+ 3999.48 4000.47 2348
+ 1.55u 0.37s 1:19.80 2.4%
+</PRE>
+
+When all columns are indexed, the increase in speed can be even more dramatic:
+<PRE>
+ time fundisp \
+ huge.fits'[idx_activate=0,idx_debug=1,pi=770&&cir 4000 4000 1]' \
+ "x y pi"
+ x y pi
+ ---------- ----------- ----------
+ 3999.48 4000.47 770
+ 3999.48 4000.47 770
+ 3999.48 4000.47 770
+ 3999.48 4000.47 770
+ 3999.48 4000.47 770
+ 3999.48 4000.47 770
+ 3999.48 4000.47 770
+ 3999.48 4000.47 770
+ 3999.48 4000.47 770
+ 3999.48 4000.47 770
+ 3999.48 4000.47 770
+ 3999.48 4000.47 770
+ 3999.48 4000.47 770
+ 3999.48 4000.47 770
+ 3999.48 4000.47 770
+ 3999.48 4000.47 770
+ 42.60u 12.63s 7:28.63 12.3%
+
+ time fundisp \
+ huge.fits'[idx_activate=1,idx_debug=1,pi=770&&cir 4000 4000 1]' \
+ "x y pi"
+ x y pi
+ ---------- ----------- ----------
+ idxeq: pi start=9473025,stop=9492240 => pi[ROW 9473025:9492240]
+ idxand sort: x[ROW 8037025:8070128] y[ROW 5757665:5792352]
+ idxor sort/merge: pi[ROW 9473025:9492240] [IDX_OR_SORT]
+ idxmerge(5): [IDX_OR_SORT] pi[ROW]
+ idxall(1): [IDX_OR_SORT]
+ 3999.48 4000.47 770
+ 3999.48 4000.47 770
+ 3999.48 4000.47 770
+ 3999.48 4000.47 770
+ 3999.48 4000.47 770
+ 3999.48 4000.47 770
+ 3999.48 4000.47 770
+ 3999.48 4000.47 770
+ 3999.48 4000.47 770
+ 3999.48 4000.47 770
+ 3999.48 4000.47 770
+ 3999.48 4000.47 770
+ 3999.48 4000.47 770
+ 3999.48 4000.47 770
+ 3999.48 4000.47 770
+ 3999.48 4000.47 770
+ 1.67u 0.30s 0:24.76 7.9%
+</PRE>
+
+<P>
+The miracle of indexed filtering (and indeed, of any indexing) is the
+speed of the binary search on the index, which is of order log2(n)
+instead of n. (The funtools binary search method is taken from
+http://www.tbray.org/ongoing/When/200x/2003/03/22/Binary, to whom
+grateful acknowledgement is made.) This means that the larger the
+file, the better the performance. Conversely, it also means that for
+small files, using an index (and the overhead involved) can slow
+filtering down somewhat. Our tests indicate that on a file containing
+a few tens of thousands of rows, indexed filtering can be 10 to 20
+percent slower than non-indexed filtering. Of course, your mileage
+will vary with conditions (disk access speed, amount of available
+memory, process load, etc.)
+
+<p>
+Any problem encountered during index processing will result in
+indexing being turned off, and replaced by filtering all rows. You can turn
+filtering off manually by setting the idx_activate variable to 0 (in a filter
+expression) or the FILTER_IDX_ACTIVATE environment variable to 0 (in the global
+environment). Debugging output showing how the indexes are being processed can
+be displayed to stderr by setting the idx_debug variable to 1 (in a filter
+expression) or the FILTER_IDX_DEBUG environment variable to 1 (in the global
+environment).
+
+<p>
+Currently, indexed filtering only works with FITS binary tables and raw
+event files. It does not work with text files. This restriction might be
+removed in a future release.
+
+<!-- =section funidx SEE ALSO -->
+<!-- =text See funtools(n) for a list of Funtools help pages -->
+<!-- =stop -->
+
+<P>
+<A HREF="./help.html">Go to Funtools Help Index</A>
+
+<H5>Last updated: August 3, 2007</H5>
+
+</BODY>
+</HTML>