This is the home page of the project "Projection and Interactive Exploration of Large Relational Data" in the Department of Computer Science at Western Michigan University. This project is supported by NSF Grant IIS-0414857.
Large relational data are the primary sources of data in the real world. The analysis, exploration, and walk-through of large relational data have important applications. Our research on visual data exploration has two major parts:
- The first part is on database support, which focuses on how to efficiently organize large relation data on the secondary storage to facilitate typical database operations incurred by interactive data visualization and exploration. These operations include browsing/panning, overview-and-drill-down, and brushing/picking. The efficient execution of these non-traditional database operations requires the data be aggregated (to provide fast overview of data), be organized into multiple resolutions (to facilitate zooming and drill-down), and be spatially indexed in data space (to support range queries for data selection and panning). These operations call for a new common representation of data between databases and visualization clients. Our solution to this problem is to have the data aggregated in multiple resolutions and to have the aggregated data piggybacked on internal nodes of a high dimensional tree index.
- The second part is on extending existing visualization techniques and/or creating new ones that may take data aggregation tree as input. Data aggregation tree provides a multiresolution compact representation of large relational data. We have extended and combined two primary visualization techniques: (1) 3D projection using footprint splatting guided by grand tour, (2) density-based parallel coordinates. Both work by visualizing data aggregation information at multiple resolutions.
A client-server software tool is developed as part of this project. The server runs a data management system KDBMS, which manages data aggregation trees and provides multi-resolution data to clients. The client visualization tool supports scatter plot, footprint splatting, and parallel coordinates. Multiple visualization clients can get data from a server using TCP/IP connections. The client tool supports several graphical user interactions, including overview-and-drill-down by allowing users to interactively specify subsets of data for further visualization. The visualization client can dynamically load various libraries including visualization libraries and data connection libraries. New visualization techniques using multiresolution data from KDBMS can be easily integrated into the client.
The software tool
Data aggregated at multiple resolutions should be organized into a hierarchical structure for efficient data organization and data access. A high dimensional tree index we choose to piggyback multiresolution data aggregation must provide a partition-based point access method. Such a tree index has two functions: (1) to organize multiresolution data aggregation; (2) to accelerate high dimensional range queries. We have chosen kdB-tree, which is a harddisk extension of kd-tree, as the tree index for its simplicity. Each node in a kd-tree represents a hyperrectangle and contains a binary tree that partitions the hyperrectangle into smaller hyperrectangles. Each node keeps aggregate values of data records in the region represented by the node. Binary tree nodes are packed into a disk page in a pre-order sequence so that the binary tree can be exactly reconstructed. Major database operations (data aggregation, browsing and zooming, and range queries) in visual data exploration can be properly implemented by index-only queries on such an indexing structure.
For data insertion, the original kdB-tree suffers from a cascading split problem. We have modified the page splitting strategy of the kdB-tree for data insertion: when a page is full, it splits by following the first partition of the region represented by the page. This strategy avoids cascading splits at the cost of creating an unbalanced tree. In addition, if a leaf page is full and has to split, it splits along the longest dimension of the hyperrectangle it represents. This way of splitting helps to keep hyperrectangles as hypercubic as possible.
A kdB-tree index on disk provides piggyback ride of data aggregation information, where each internal page keeps data aggregate values. Data can be visualized by accessing the aggregate values instead of a large number of individual records.
As an intermediate representation of large relational data between visualization client and database, data aggregation tree makes visualization tools scalable to a large number of data records. In addition, data aggregation tree supports privacy-preservation of individual data records. Access permission can be granted to each user till a specific depth level of the tree and therefore preserves the privacy of high resolution data and individual data records.
Data aggregation tree facilitates both range query and multiresolution data aggregation. It has potential applications to accelerate data aggregation queries and OLAP queries. It can be used as data input for efficient mining of large data sets. These issues are parts of our future work.
This research project consists of three technical components: data visualization, database support, and data embedding. Please refer to the Publications section for related references.
In the data visualization component, we have been developing a visualization tool that takes multi-resolution aggregated data as data input. Two interactive visualization techniques, density-based parallel coordinates and footprint splatting with grand tour, have been extended to support the rendering of data aggregated at multiple resolutions. The tool supports overview-and-drill-down of large relational data and allows users to interactively specify subsets of data for further visualization, possibly at more detailed resolutions. The visualization tool, with further development, can be used by industry and other agencies for scalable interactive data visualization and exploration. We have also developed techniques for pruning and visualizing frequent itemsets and many-to-many association rules. Future work in this component includes usability study and better GUI design.
In the database component, we have studied multi-resolution data aggregation as a common representation of data between database and visualization tools. Data aggregated at multiple resolutions are piggybacked onto internal nodes of a k-d-B tree. The k-d-B tree structure is extended to improve query performance and node fan-outs while keeping data aggregation information. We have conducted experiments on both synthetic and real world data sets. Performances of data access and index maintenance have been tested. Future work includes study of better indexing mechanism and data mining techniques using multi-resolution data aggregation as input.
In the data embedding component, a set of algorithms and methods are developed for building connected neighborhood graphs and for locally isometric data embedding. Incremental methods are developed to project large data sets and data streams.
This project is funded in part by the National Science Foundation. Li Yang leads the project as the principal investigator at Western Michigan University. Ph.D. students worked on the project include Mustafa Sanver, Dongfang Zhao, and Danyang Hua. Mustafa Sanver has been working on the data visualization and the database components. He has ported C codes into C++, implemented the dynamic visualization framework, and created the KDBMS database management system over the kdB-tree indexed structure. Dongfang Zhao worked on the data embedding component and was supported in the 2005-06 academic year. Danyang Hua is working on the database component and has been supported since the beginning of 2007.
- Better GUI and query interface design; documentation for end users and developers,
- Conducting user study,
- Performance experiments on large data sets up to a few terabytes,
- Data clustering and other data mining algorithms using multi-resolution data aggregation as data input,
- Ongoing data embedding research.
go to top