"Collaborative Research: ASTERIX: A Highly Scalable Parallel Platform for Semistructured Data Management and Analysis "

Funded by the National Science Foundation

PI: Vassilis J. Tsotras
Award Number: IIS-0910859

Duration: 08/15/2009 through 07/31/2013

This is a collaborative project with:
IIS-0910989.
PI: Mike Carey, co-PI: Chen Li
University of California, Irvine
and
IIS-0910820
PI: Alin Deutsch, co-PI: Yannis Papakonstantinou
University of California, San Diego

AsterixDB Web Page:
https://asterixdb.apache.org/

Students:
Mariam Salloum
Md. Mahbub Hasan
Jarod Wen

Marcos Vieira

Wenyu Huo

Eldon Carman

Michael Rice

Project Summary:

Over the past 10-15 years, the evolution of the human side of the Web (powered by HTML and HTTP) has revolutionized the way that most of us find things, buy things, and interact with our friends and colleagues, both within and across organizations. Behind the scenes, semistructured data formats and Web services are having a similar impact on the machine side of the Web. In semistructured data formats, of which XML is the de facto standard, information normally contained in a database schema or type definition is contained within the data, making it self-describing. XML is enriching the information on the Web and our ability to find it and interchange it meaningfully, as are RDF and JSON. Many industry verticals have created XML-based standards to support inter-organization data exchange and processes, and XML-based backbones such as enterprise services busses (ESBs) have gained significant adoption in industry in support of Service-Oriented Architecture (SOA) initiatives. XML is increasingly being used for document markup as well, which was its original purpose, and the Web-service-driven Software as a Service (SaaS) trend is changing the way that many organizations will access and use large software applications in the future. As a result, current indications are that the IT world will soon be awash in a sea of semistructured data – much of it XML data – and that semistructured data and services will likely play an increasingly prominent role in the IT landscape for many years to come.

In anticipation of the semistructured information explosion, this proposal targets the problems of ingesting, storing, indexing, processing, managing, and monitoring vast quantities of semistructured data with the emphasis being on vastness, i.e., scale. The project involves challenges related to parallel databases, semistructured data management, and data-intensive computing. To that end, the proposal brings together a team of five researchers, drawn from three UC campuses, with expertise spanning structured, semistructured, and unstructured data.

Research Activities – Year 1

As part of our task to design the ASTERIX pub/sub system we examined parallelism during filtering. Publish-subscribe systems present the state of the art in information dissemination to multiple users. Such systems have evolved from simple topic-based to the current XML-based systems. XML-based pub-sub systems provide users with more flexibility by allowing the formulation of complex queries on the content as well as the structure of the streaming messages. Messages that match a given user query are forwarded to the user. As the amount of published content continues to grow, current software-based systems will not scale. Pub-sub systems can exploit parallelism to improve their performance. Instead of the typical multicore parallelism, we considered filtering XML streams using novel hardware architectures, like FPGAs and GPUs. FPGAs provide very high throughput for sequential tasks by exploring on-chip parallelism. Filtering an XML stream falls in this category. This work led into two papers. [1] addresses the problem of supporting simple XPath profiles on such a filtering environment. [2] solves the case where the user profiles are complex twigs.

We also worked on a unified approach to three basic problems in structural query processing, namely: XML filtering, XML stream processing (tuple extraction), and XML query processing. Previous approaches were shown to be efficient for one or two of these problems, but were either inefficient or not suitable for the third problem. We instead propose a unified approach used to devise efficient algorithms for all three problems. We represent the queries and XML documents using a sequential encoding, referred to as Node Encoded Tree Sequences (NETS). We then provide algorithms that can address all three problems efficiently, using the NETS sequences. This work has led to one paper under submission [3].

Research Activities – Year 2

During the second year of the ASTERIX effort, we concentrated on the following research activities:

(i) Query Result Diversification. Many database and information retrieval applications have recently started to incorporate capabilities to rank elements with respect to both relevance to the query as well as diversity features, i.e., the retrieved elements should be as relevant as possible to the query, and, at the same time, the result set should be as diverse as possible. This is very useful when a query returns many possible results, most of them however can be variations of the same result. Diversity enables the user to quickly view the most diverse results from the large result set. While addressing relevance is relatively simple, and has been heavily studied, diversity is a harder problem to solve. We first presented a common framework, where we adapted, implemented and evaluated several existing methods for diversifying query results. Using this framework we presented the first thorough experimental evaluation of the various diversification techniques [4]. We also proposed two new approaches, namely the Greedy with Marginal Contribution (GMC) and the Greedy Randomized with Neighborhood Expansion (GNE) methods. Our experimental results show that while the proposed methods have higher running times, they achieve precision very close to the optimal, while also providing the best result quality. While GMC is deterministic, the randomized approach (GNE) can achieve better result quality if the user is willing to tradeoff running time.

We then examined how diversification can be applied on queries over semistructured data [5]; the problem is more difficult since the result contains documents and the diversity must now be computed on the document structures as well. The tree edit distance is the standard choice to measure the distance between two hierarchical structures, but, is too expensive for large result sets. Moreover, the generalized tree edit distance ignores the context of the query and also the document content, resulting in poor diversification. We proposed a novel algorithm for meaningful diversification that considers both the structural context of the query and the content of the matched results while computing edit distance. Our algorithm is an order of magnitude faster than the tree edit distance with an elegant worst case guarantee. We also presented a novel algorithm to find the top-k diverse subset of matches. Our algorithm skips unnecessary distance computations and works in time linear on the size of the result-set.

(ii) Parallel Filtering of semistructured data using specialized hardware. We continued our work on filtering XML data. Despite their very high throughput, FPGAs require extensive update time while their physical resource availability is also limited. We have thus also considered exploiting the parallelism found in XPath filtering systems using instead GPUs, which are favorable platforms due to the massive parallelism found in their hardware architecture, alongside the flexibility and programmability of software. By utilizing properties of the GPU memory hierarchy we can match thousands of user profiles at high throughput, requiring minimal update time. An extensive experimental evaluation showed an average speedup of 10x (up to 2.5 orders of magnitude) versus the state of the art software approaches [6].

(iii) Querying large semistructured data depositories. When querying a large XML data source, a query may need to examine many documents for possible matches; the user on the other hand prefers to get the results as soon as possible. We thus consider two optimization problems, namely: optimal 'ordering' and 'selection' of candidate documents for query answering. The first problem deals with finding a sequence of documents within the data source which minimizes the time to the first k matches (for some constant k which is less than the total number of matches). The second problem deals with finding a subset of documents that maximize the expected number of document matches for a given upper bound on total processing time.

In collaboration with ATT and Google researchers, we developed OASIS [7], an online query answering system for a set of independent but overlapping sources, which is composed of four components/problems: source selection, source overlap estimation, source ordering, and online-source probing. The source selection component chooses a subset of data sources that are relevant to the user. The overlap estimation component utilizes partial quantitative information in the form of probabilistic knowledge, such as information about overlap between data sources and coverage of data sources, and uses a maximum entropy (MaxEnt) framework. The source ordering component orders the set of sources such that answers are delivered as fast as possible while minimizing the cost metric. The source probing component dynamically chooses the set of additional probabilistic constraints that can be computed to improve the source ordering.

Research Activities – Year 3

During the third year of the ASTERIX effort [8], we concentrated on the following research activities:

(i) Query Result Diversification. We continued our work on diversifying query results. In particular, we concentrated on two problems: (a) adaptive diversification, and (b) computing diverse results over a distributed environment. The problem of query result diversification is modeled, in its most general form, as a bi-criteria optimization problem, which uses a trade-off parameter to tune the relative effect of relevance and diversity factors during ranking. The use of a trade-off parameter is helpful in tailoring the effect of diversity to specific query scenarios. For example, the impact of the diversity factor can be increased for highly ambiguous queries so as to include more diverse elements in the result set whereas for very specific (non-ambiguous) queries, this factor can be decreased to prevent inclusion of results of lesser relevance. Previous works on computing diverse results, balance relevance and diversity in a predefined, fixed way. However, this is suboptimal, since for different search tasks there is a different ideal balance of relevance and diversity. Further, this balance changes at each query step (e.g., refine the query or view next page of results). We thus proposed a principled method for adaptive diversification [9] of query results that minimizes the user effort to find the desired results, by dynamically balancing the relevance and diversity at each query step. We introduce a navigation cost model, and prove that estimating the ideal amount of diversification is NP-Hard. We further proposed efficient approximate algorithms to select a near-optimal subset of the query results to output at each step, to minimize the expected user effort. We experimentally evaluated our algorithms and showed that they outperform state-of-the-art ranking methods.

Previous works on diversification have concentrated in the uni-processor case. With the advance of map-reduce based environments, it is highly probable that the data (over which we want to compute diversity) reside in different nodes. It is thus important to be able to compute diversification in a distributed environment. The difficulty of the problem emanates from the fact that the optimal top-k diversity requires to identify the k most diverse results (with respect to a score function); this is a NP-Hard problem since one has to compare the scores between all result pairs. Using the MapReduce framework we considered two distinct approaches to solve the distributed diversification problem, one that focuses on optimizing disk I/O and one that optimizes for network I/O [10]. Our approaches are iterative in nature, allowing the user to continue refining the diversification process if more time is available. We also developed a cost model to predict the run-time for both approaches based on the network and disk characteristics. We implemented our approaches on a cluster of 40 cores and showed that they are scalable and produce the same quality results as the state-of-the-art uniprocessor algorithms.

(ii) Implementing Temporal Support for ASTERIX. During this year we implemented various temporal features for temporal data management. Both instance types (Time, Date and Datetime, with timezone support) and range types (Interval and Duration) are built in for various temporal data. Temporal data can be created by either through insertion/update queries, or importing a data set containing temporal types. We have also implemented useful temporal features, like temporal arithmetic operations and all Allen's interval relations. Users can write native temporal queries using user-friendly temporal features instead of error-prone type casting and complex condition checks.

Furthermore, we examined two research problems with respect to temporal support, namely: (a) computing top-k keyword searches over versioned data, and (b) temporal querying over branched versions. Most web search engines will answer a query by ranking all known documents at the (current) time the query is posed. There are applications however (for example customer behavior analysis, crime investigation, etc.) that would need to efficiently query these sources as of some past time, that is, retrieve the results as if the user was posing the query in a past time instant, thus accessing only data known as of that time. Top-k ranking and searching over versioned documents considers not only keyword constraints but also the time dimension, most commonly, a time point or time range of interest. We proposed [11] a novel data organization and two indexing solutions: the first one partitions data along ranking positions, while the other maintains the full ranking order through the use of a multiversion ordered list. We presented an experimental comparison for both time point and time interval constraints. For time-interval constraints, different querying definitions, such as aggregation functions and consistent top-k queries were evaluated. Experimental evaluations on large real world datasets demonstrated the advantages of the newly proposed data organization and indexing approaches.

Transaction-time databases have been proposed for storing and querying the history of a database. While past work concentrated on managing the data evolution assuming a static schema, recent research has considered data changes under a linearly evolving schema. An ordered sequence of schema versions is maintained and the database can restore/query its data under the appropriate past schema. There are however many applications leading to a branched schema evolution where data can evolve in parallel, under different concurrent schemas. In [12] we considered the issues involved in managing the history of a database that follows a branched schema evolution. To maintain easy access to any past schema, we used an XML-based approach with an optimized sharing strategy. As for accessing the data, we explored branched temporal indexing techniques and presented efficient algorithms for evaluating two important queries made possible by our novel branching environment: the vertical historical query and the horizontal historical query. Moreover, we showed that our methods can support branched schema evolution which allows version merging. Experimental evaluations showed the efficiency of our storing, indexing, and query processing methodologies.

(iii) Other Research. We also worked on identifying spatiotemporal bursts over document streams. Currently, thousands of documents are made available to the users via the web or microblogs on a daily basis. Given a term t, a burst is generally exhibited when an unusually high frequency is observed for term t. The problem of burst identiﬁcation has been studied either in the temporal domain or in the spatial domain. In [13] we presented the first work to simultaneously track and measure spatiotemporal term burstiness. We proposed two alternative approaches for mining spatiotemporal burstiness patterns, STComb and STLocal. The two approaches are complementary, providing valuable insight on spatiotemporal burstiness from different perspectives. In addition, we used the mined burstiness information toward an efficient document-search engine: given a user’s query of terms, our engine returns a ranked list of documents discussing influential events with a strong spatiotemporal impact. We demonstrated the efficiency of our methods with an extensive experimental evaluation on real and synthetic datasets.

Research Activities – Year 4

During the fourth (no-cost extension) year of the ASTERIX effort the work at UCR has focused on the following major activities: (i) complete the temporal support for AsterixDB, (ii) provide index support and parallel execution for VXQuery on top of Hyracks, and (iii) implement GroupBy aggregation both at the local and the global level.

(i) Implementing Temporal Support for ASTERIX. This year we further extended the feature set on top of the existing temporal implementations. Since all temporal types are natively supported in AsterixDB, we extended our indexing framework to enable index structures over comparable temporal types (data, time, datetime, yearMonthDuration and dayTimeDuration). We also added the time-window support so that users can create tumbling temporal windows and apply group-by operations over the items within each window. We also added a flexible temporal data parser so that non-standard temporal data can be parsed and imported into AsterixDB to utilize the native support.

(ii) VXQuery implementation on top of Asterix’ Hyracks engine. We developed (through the Apache Foundation's VXQuery project as part of Google's Summer of Code) an XQuery engine that runs in a distributed environment by building it on top of the ASTERIX stack. First, the XQuery engine was developed so as to process queries in the new stack. Many XQuery functions were needed to link the distributed layer with the XQuery interface. The VXQuery software now supports basic queries in a single process. Further, we incorporated the Apache Lucene indexing on top of VXQuery. Steven Jacobs (PhD student at UCR) created an alternate version of the Asterix collection function that uses the Lucene store. Experimental evaluations showed execution times that were as small as 10% when compared to a straightforward approach that simply adds metadata to each XML file with the path information in a Dewey-decimal manner. We also provided an efficient parallelization for VXQuery. Instead of starting from scratch, Eldon Carman (PhD student at UCR) worked on establishing various rewrite rules that enable Hyracks (through Algebricks) to optimize a VXQuery query and thus be able to take advantage of parallelization. In particular we worked on rewrite rules for various basic Algebricks operators (like AGGREGATE, ASSIGN, DATASCAN, NEST, UNNEST and SUBPLAN).

(iii) Local and Global Group-By Aggregation. Aggregation has been an important operation since the early days of relational databases. Today's Big Data applications bring further challenges when processing aggregation queries, demanding adaptive aggregation algorithms that can process large volumes of data relative to a potentially limited memory budget (especially in multiuser settings). Despite its importance, the design and evaluation of aggregation algorithms has not received the same attention that other basic operators, such as joins, have received in the literature. As a result, when considering which aggregation algorithm(s) to implement in AsterixDB, we faced a lack of “off the shelf" answers that we could simply read about and then implement based on prior performance studies. We thus first revisited the engineering of efficient local aggregation algorithms for use in Big Data platforms. Efficient implementation of the aggregation operator for a Big Data platform is non-trivial and that many factors, including memory usage, spilling strategy, and I/O and CPU cost, should be considered. Further, we introduced precise cost models that can help in choosing an appropriate algorithm based on input parameters including memory budget, grouping key cardinality, and data skew. We proposed [14] two new algorithm variants, the Hash-Sort algorithm and the Pre-Partitioning algorithm which are now included in the latest release of AsterixDB, together with the classic sort-merge algorithm (due to its good performance when aggregating sorted data). Further we extended [15] the local aggregation work on a clustered environment (global aggregation), where more factors like per-machine workload balancing and network costs are considered.

(iv) Other Research. We created a demo of STEM (Spatio-Temporal Miner), a system for finding spatiotemporal burstiness patterns in a collection of spatially distributed frequency streams [16]. STEM implements the full functionality required to mine spatiotemporal burstiness patterns from virtually any collection of geostamped streams. Examples of such collections include document streams (e.g. online newspapers), geo-aware microblogging platforms (e.g. Twitter). We continued our work on filtering using specialized hardware. We first considered complex pattern trajectory queries (described as regular expressions over a spatial alphabet that can be implicitly or explicitly anchored to the time domain). In addition, our pattern queries may contain variables (which can substantially enhance the flexibility and expressive power of pattern queries). In [17] (best paper award) we showed how to perform such filtering using an FPGA architecture. Our experimental results showed that the FPGA approach outperformed the state-of-the art CPU-based approach by over three orders of magnitude. Next, in [18] we showed how to perform holistic (no post-processing) evaluation of thousands of complex twig-style XPath queries in a streaming (single-pass) fashion, using GPUs. Our XML filtering results using specialized hardware are summarized in [19].

Publications:

[1] Roger Moussalli, Mariam Salloum, Walid A. Najjar, Vassilis J. Tsotras: Accelerating XML Query Matching through Custom Stack Generation on FPGAs. HiPEAC 2010: 141-155, link.

[2] Roger Moussalli, Mariam Salloum, Walid A. Najjar, Vassilis J. Tsotras: Massively parallel XML twig filtering using dynamic programming on FPGAs. ICDE 2011: 948-959, link.

[3] Mariam Salloum and Vassilis J. Tsotras, "A Unified Approach for Structural XML Processing", under submission.

[4] Marcos R. Vieira, Humberto Luiz Razente, Maria Camila Nardini Barioni, Marios Hadjieleftheriou, Divesh Srivastava, Caetano Traina Jr., Vassilis J. Tsotras: On query result diversification. ICDE 2011: 1163-1174, link.

[5] Mahbub Hasan, Abdullah Mueen, Vassilis J. Tsotras, Eamonn J. Keogh: Diversifying query results on semi-structured data. CIKM 2012: 2099-2103, link.

[6] Roger Moussalli, Robert Halstead, Mariam Salloum, Walid A. Najjar, Vassilis J. Tsotras: Efficient XML Path Filtering Using GPUs. ADMS@VLDB 2011: 9-18, link.

[7] Mariam Salloum, Xin Luna Dong, Divesh Srivastava, Vassilis J. Tsotras: Online Ordering of Overlapping Data Sources. PVLDB 7(3): 133-144 (2013), link.

[8] Alexander Behm, Vinayak R. Borkar, Michael J. Carey, Raman Grover, Chen Li, Nicola Onose, Rares Vernica, Alin Deutsch, Yannis Papakonstantinou, Vassilis J. Tsotras: ASTERIX: towards a scalable, semistructured data platform for evolving-world models. Distributed and Parallel Databases 29(3): 185-216 (2011), link.

[9] M. Hasan, A. Kashyap, V. Hristidis, V.J. Tsotras, "Adaptive Diversification of Query Results", submitted for publication (2013).

[10] Mahbub Hasan, Abdullah Mueen, Vassilis J. Tsotras, “Distributed Diversification of Large Datasets”, IEEE International Conference on Cloud Engineering (IC2E 2014), Boston MA, March 2014, to appear.

[11] Wenyu Huo, Vassilis J. Tsotras: A Comparison of Top-k Temporal Keyword Querying over Versioned Text Collections. DEXA (2) 2012: 360-374, link.

[12] Wenyu Huo, Vassilis J. Tsotras: Querying Transaction-Time Databases under Branched Schema Evolution. DEXA (1) 2012: 265-280, link.

[13] Theodoros Lappas, Marcos R. Vieira, Dimitrios Gunopulos, Vassilis J. Tsotras: On The Spatiotemporal Burstiness of Terms. PVLDB 5(9): 836-847 (2012), link.

[14] Jian Wen , Vinayak R. Borkar , Michael J. Carey, Vassilis J. Tsotras, “Revisiting Aggregation for Data Intensive Applications: A Performance Study”, submitted for publication (2013), link.

[15] Jian Wen, “Revisiting Aggregation Techniques for Data Intensive Applications”, PhD Thesis, UC Riverside, 2013.

[16] Theodoros Lappas, Marcos R. Vieira, Dimitrios Gunopulos, Vassilis J. Tsotras: STEM: a spatio-temporal miner for bursty activity. SIGMOD Conference 2013: 1021-1024, link.

[17] Roger Moussalli, Marcos R. Vieira, Walid A. Najjar, Vassilis J. Tsotras: Stream-Mode FPGA Acceleration of Complex Pattern Trajectory Querying. SSTD 2013: 201-222, link.

[18] Ildar Absalyamov, Roger Moussalli, Vassilis J. Tsotras, Walid A. Najjar: High-Performance XML Twig Filtering using GPUs. ADMS@VLDB 2013: 13-24, link.

[19] Roger Moussalli, Mariam Salloum, Robert Halstead, Walid A. Najjar, Vassilis J. Tsotras (2013). A Study on Parallelizing XML Path Filtering Using Accelerators. ACM Transactions on Embedded Computing Systems (TECS), accepted for publication.