Large relational data are the primary sources of data in the real world. The analysis, exploration, and walk-through of large relational data have immediate applications in many areas. By large, we mean a data set that is too large to be loaded into main memory in its entirety. Large relational data present technical challenges to both data visualization in the front-end and database support in the back-end. Correspondingly, our research on visual data exploration has two parts:
Data aggregated at multiple resolutions should be organized into a hierarchical structure for the purpose of efficient organization and data access. A high dimensional tree index offers an excellent way to piggyback the aggregated data. Such a high dimensional tree index must meet all of the following requirements: (1) it provides a point access method (PAM) instead of a spatial access method (SAM) (because data records are assumed as points in high dimenisonal space); (2) regions represented by sibling nodes are disjoint with each other (no data point is counted more than once); (3) the region represented by a node is totally covered by union of the regions represented by its child nodes (no data point is missing). In short, the high dimensional tree index we choose to piggyback multiresolution data aggregation must be a partition-based PAM. Such a tree index has two functions: (1) to organize multiresolution data aggregation; (2) to accelerate high dimensional range queries. We call such a tree index a data aggregation tree.
We have chosen kdB-tree as the tree index. Each page in a kdB-tree represents a hyperrectangle and contains a binary tree that partitions the hyperrectangle into smaller hyperrectangles. Each leaf node of the binary tree keeps aggregate values of data records in the region represented by the node. Leaf node also contains a PageID pointer, which points to a child page. Binary tree nodes are stored in the page in a pre-order sequence so that the binary tree can be exactly reconstructed. Major database operations (data aggregation, browsing and zooming, and range queries) in visual data exploration can be properly implemented by index-only queries on such an indexing structure.
For data insertion, the original kdB-tree suffers from a cascading split problem. We have modified the page splitting strategy of kdB-tree for data insertion: when a page is full, it splits by following the first partition of the region represented by the page. This strategy avoids cascading splits at the cost of creating an unbalanced tree. In addition, if a leaf page is full and has to split, it splits along the longest dimension of the hyperrectangle it represents. This way of splitting keeps hyperrectangles as hypercubic as possible which helps maintain consistent resolutions in all dimensions.
Figure: A kdB-tree index on disk provides piggyback ride of data aggregation information, where each internal page keeps data aggregate values. Data can be visualized by accessing the aggregate values instead of a large number of individual data records. Access permission can be granted to each user till a specific depth level of the tree and therefore preserves the privacy of high resolution data and individual data records.
As an intermediate representation of large relational data between visualization client and database, data aggregation tree makes visualization tools scalable to a large number of data records. In addition, data aggregation tree supports privacy-preservation of individual data records. Permission can be given to each user according to data resolutions. The tree enables restriction to each user to access the data down to a specific resolution and, therefore, preserves the privacy of individual data records.
Data aggregation tree facilitates both range query and multiresolution data aggregation. It has potential applications in query processing, OLAP, and new data mining algorithms using density data as input. These issues are currently under investigation.
Multiresolution data aggregation represents a new input data format for visualization. We have chosen two example visualization techniques, 3D footprint splatting and parallel coordinates, to support this new input data format.
Figures: Footprint splatting of aggregated data (left) versus density-based parallel coordinates (right). Data hyperrectangles are retrieved by visiting internal nodes of data aggregation tree. In each figure, opacity of the visualization of a data hyperrectangle is a function of number of data points within the data hyperrectangle.
3D footprint splatting is implemented through the integration of several multidimensional data visualization techniques. The basic idea is to combine direct volume rendering with 3D data projection to deal with high dimensional aggregated data. A 3D projection orthogonally transforms n-D data into 3D. A user may choose interesting projections through projection pursuit. A sequence of projections can be rendered with smooth transitions by using grand tour. We have developed multidimensional footprint splatting to render high dimensional data hyperrectangles using a two-step bootstrapping technique. The first step generates a footprint by sampling the high dimensional reconstruction kernel and by applying the splatting algorithm to the samples. The second step visualizes the data by applying the same splatting algorithm to the footprints of data hypercubes. Opacity of each footprint is assigned as a function of the number of data points in the corresponding data hyperrectangle. We have found that 3D texturing of footprints greatly accelerates rendering speed and improves interactivity.
Parallel coordinates provide an excellent metaphor to visualize high dimensional data by horizontally arranging vertical coordinates, one for each dimension. Data hyperrectangles represented by internal nodes of a data aggregation tree can be perfectly displayed as horizontal bands. This is similar to the hierarchical parallel coordinates in XmdvTool. The location and extend of a band across each coordinate depend on the low and high values of the corresponding data hyperrectangle in that dimension. The opacity of each band depends on the density of data records in its corresponding data hyperrectangle. Bands of all hyperrectangles are blended to produce the final image. Zooming and drill-down are supported by accessing aggregated data at deeper levels of the tree index. Parallel coordinates allow a user to specify a horizontal data selection band (the gray band in the above figure). The data selection band represents a query region in data space. This allows the user to visualize a subset of data in various levels of detail.
3D footprint splatting and density-based parallel coordinates are two techniques that can be easily extended to work with multiresolution data aggregation. In general, multiresolution data aggregation calls for new algorithms and techniques in data visualization and user interaction. For example, 2D elevation maps and 3D isosurface rendering would help a user to quantitatively explore high dimensional data. In addition, data mining techniques could be used to guide visual exploration of data. Visual exploration would also benefit from techniques for data pre-fetching and buffer management. These issues are under investigation.
Figures: Overview-and-drill-down of the 1% PUMS (Public Use Microdata Sample) housing data of US Census 2000: (a) an overview of the 1,251,533 housing unit records; (b) outliers and missing values removed; (c) old housing units in the state of New York; (d) utility usage of the selected 3-bedroom housing units. Each of the first three screen snapshots shows a gray data selection band. Each selection band specifies a query region which is then used to retrieve a data subset that are visualized in the next screen snapshot.
Updated April 7, 2007 by Li Yang