Funded by the National Science Foundation
PI: Vassilis J. Tsotras
Award Number: IIS-0910859
Duration: 08/15/2009 through 07/31/2013
This is a
collaborative project with:
IIS-0910989.
PI: Mike Carey, co-PI: Chen Li
University of California, Irvine
and
IIS-0910820
PI: Alin
Deutsch, co-PI: Yannis Papakonstantinou
University of California, San Diego
AsterixDB Web Page:
https://asterixdb.apache.org/
Students:
Mariam Salloum
Md. Mahbub
Hasan
Jarod Wen
Eldon Carman
Michael Rice
Project
Summary:
Over the past 10-15 years, the evolution of the human
side of the Web (powered by HTML and HTTP) has revolutionized the way that most
of us find things, buy things, and interact with our friends and colleagues,
both within and across organizations. Behind the scenes, semistructured
data formats and Web services are having a similar impact on the machine side
of the Web. In semistructured data formats, of which
XML is the de facto standard, information normally contained in a database
schema or type definition is contained within the data, making it
self-describing. XML is enriching the information on the Web and our ability to
find it and interchange it meaningfully, as are RDF and JSON. Many industry
verticals have created XML-based standards to support inter-organization data
exchange and processes, and XML-based backbones such as enterprise services
busses (ESBs) have gained significant adoption in industry in support of
Service-Oriented Architecture (SOA) initiatives. XML is increasingly being used
for document markup as well, which was its original purpose, and the
Web-service-driven Software as a Service (SaaS) trend is changing the way that
many organizations will access and use large software applications in the
future. As a result, current indications are that the IT world will soon be
awash in a sea of semistructured data – much of it
XML data – and that semistructured data and services
will likely play an increasingly prominent role in the IT landscape for many
years to come.
In anticipation of the semistructured
information explosion, this proposal targets the problems of ingesting,
storing, indexing, processing, managing, and monitoring vast quantities of semistructured data with the emphasis being on vastness,
i.e., scale. The project involves challenges related to parallel databases, semistructured data management, and data-intensive
computing. To that end, the proposal brings together a team of five
researchers, drawn from three UC campuses, with expertise spanning structured, semistructured, and unstructured data.
Research Activities – Year 1
As part of our task to design the ASTERIX pub/sub
system we examined parallelism during filtering. Publish-subscribe systems
present the state of the art in information dissemination to multiple users.
Such systems have evolved from simple topic-based to the current XML-based
systems. XML-based pub-sub systems provide users with more flexibility by
allowing the formulation of complex queries on the content as well as the
structure of the streaming messages. Messages that match a given user query are
forwarded to the user. As the amount of published content continues to grow,
current software-based systems will not scale. Pub-sub systems can exploit
parallelism to improve their performance. Instead of the typical multicore
parallelism, we considered filtering XML streams using novel hardware
architectures, like FPGAs and GPUs. FPGAs provide very high throughput for
sequential tasks by exploring on-chip parallelism. Filtering an XML stream falls
in this category. This work led into two papers. [1] addresses
the problem of supporting simple XPath profiles on
such a filtering environment. [2] solves the case
where the user profiles are complex twigs.
We also worked on a unified approach to three basic
problems in structural query processing, namely: XML filtering, XML stream
processing (tuple extraction), and XML query processing. Previous approaches
were shown to be efficient for one or two of these problems, but were either
inefficient or not suitable for the third problem. We instead propose a unified
approach used to devise efficient algorithms for all three problems. We
represent the queries and XML documents using a sequential encoding, referred
to as Node Encoded Tree Sequences (NETS). We then provide algorithms that can
address all three problems efficiently, using the NETS sequences. This work has
led to one paper under submission [3].
Research Activities – Year 2
During the second year of the ASTERIX effort, we
concentrated on the following research activities:
(i) Query Result Diversification. Many database and information
retrieval applications have recently started to incorporate capabilities to
rank elements with respect to both relevance to the query as well as diversity
features, i.e., the retrieved elements should be as relevant as possible to the
query, and, at the same time, the result set should be as diverse as possible.
This is very useful when a query returns many possible results,
most of them however can be variations of the same result. Diversity enables
the user to quickly view the most diverse results from the large result set.
While addressing relevance is relatively simple, and has been heavily studied,
diversity is a harder problem to solve. We first presented a common framework,
where we adapted, implemented and evaluated several existing methods for
diversifying query results. Using this framework we presented the first
thorough experimental evaluation of the various diversification techniques [4].
We also proposed two new approaches, namely the Greedy with Marginal
Contribution (GMC) and the Greedy Randomized with Neighborhood Expansion (GNE)
methods. Our experimental results show that while the proposed methods have
higher running times, they achieve precision very close to the optimal, while
also providing the best result quality. While GMC is deterministic, the
randomized approach (GNE) can achieve better result quality if the user is
willing to tradeoff running time.
We then examined how diversification can be applied on
queries over semistructured data [5]; the problem is
more difficult since the result contains documents and the diversity must now
be computed on the document structures as well. The tree edit distance is the
standard choice to measure the distance between two hierarchical structures,
but, is too expensive for large result sets. Moreover, the generalized tree
edit distance ignores the context of the query and also the document content,
resulting in poor diversification. We proposed a novel algorithm for meaningful
diversification that considers both the structural context of the query and the
content of the matched results while computing edit distance. Our algorithm is
an order of magnitude faster than the tree edit distance with an elegant worst
case guarantee. We also presented a novel algorithm to find the top-k diverse
subset of matches. Our algorithm skips unnecessary distance computations and
works in time linear on the size of the result-set.
(ii) Parallel
Filtering of semistructured data using specialized
hardware. We continued our work on filtering XML data. Despite their very
high throughput, FPGAs require extensive update time while their physical
resource availability is also limited. We have thus also considered exploiting
the parallelism found in XPath filtering systems
using instead GPUs, which are favorable platforms due to the massive
parallelism found in their hardware architecture, alongside the flexibility and
programmability of software. By utilizing properties of the GPU memory
hierarchy we can match thousands of user profiles at high throughput, requiring
minimal update time. An extensive experimental evaluation showed an average speedup of 10x (up to 2.5 orders of magnitude) versus the
state of the art software approaches [6].
(iii) Querying large semistructured data depositories. When querying a large XML data
source, a query may need to examine many documents for possible matches; the
user on the other hand prefers to get the results as soon as possible. We thus
consider two optimization problems, namely: optimal 'ordering' and 'selection'
of candidate documents for query answering. The first problem deals with
finding a sequence of documents within the data source which minimizes the time
to the first k matches (for some constant k which is less than the total number
of matches). The second problem deals with finding a subset of documents that
maximize the expected number of document matches for a given upper bound on
total processing time.
In collaboration with ATT and Google researchers, we
developed OASIS [7], an online query answering system for a set of independent
but overlapping sources, which is composed of four components/problems: source
selection, source overlap estimation, source ordering, and online-source
probing. The source selection component chooses a subset of data sources that
are relevant to the user. The overlap estimation component utilizes partial
quantitative information in the form of probabilistic knowledge, such as information
about overlap between data sources and coverage of data sources, and uses a
maximum entropy (MaxEnt) framework. The source
ordering component orders the set of sources such that answers are delivered as
fast as possible while minimizing the cost metric. The source probing component
dynamically chooses the set of additional probabilistic constraints that can be
computed to improve the source ordering.
Research Activities – Year 3
During the third year of the ASTERIX effort [8], we
concentrated on the following research activities:
(i) Query Result Diversification. We continued our work on
diversifying query results. In particular, we concentrated on two problems: (a)
adaptive diversification, and (b) computing diverse results over a distributed
environment. The problem of query result diversification is modeled, in its
most general form, as a bi-criteria optimization problem, which uses a
trade-off parameter to tune the relative effect of relevance and diversity
factors during ranking. The use of a trade-off parameter is helpful in
tailoring the effect of diversity to specific query scenarios. For example, the
impact of the diversity factor can be increased for highly ambiguous queries so
as to include more diverse elements in the result set whereas for very specific
(non-ambiguous) queries, this factor can be decreased to prevent inclusion of
results of lesser relevance. Previous works on computing diverse results,
balance relevance and diversity in a predefined, fixed way. However, this is
suboptimal, since for different search tasks there is a different ideal balance
of relevance and diversity. Further, this balance changes at each query step
(e.g., refine the query or view next page of results). We thus proposed a
principled method for adaptive diversification
[9] of query results that minimizes the user effort to find the desired
results, by dynamically balancing the relevance and diversity at each query
step. We introduce a navigation cost model, and prove that estimating the ideal
amount of diversification is NP-Hard. We further proposed efficient approximate
algorithms to select a near-optimal subset of the query results to output at
each step, to minimize the expected user effort. We experimentally evaluated
our algorithms and showed that they outperform state-of-the-art ranking
methods.
Previous works on diversification have concentrated in
the uni-processor case. With the advance of
map-reduce based environments, it is highly probable that the data (over which
we want to compute diversity) reside in different nodes. It is thus important
to be able to compute diversification in a distributed environment. The
difficulty of the problem emanates from the fact that the optimal top-k
diversity requires to identify the k most diverse results (with respect to a
score function); this is a NP-Hard problem since one has to compare the scores
between all result pairs. Using the MapReduce
framework we considered two distinct approaches to solve the distributed
diversification problem, one that focuses on optimizing disk I/O and one that
optimizes for network I/O [10]. Our approaches are iterative in nature,
allowing the user to continue refining the diversification process if more time
is available. We also developed a cost model to predict the run-time for both
approaches based on the network and disk characteristics. We implemented our
approaches on a cluster of 40 cores and showed that they are scalable and
produce the same quality results as the state-of-the-art uniprocessor
algorithms.
(ii) Implementing
Temporal Support for ASTERIX. During this year we implemented various
temporal features for temporal data management. Both instance types (Time, Date
and Datetime, with timezone
support) and range types (Interval and Duration) are built in for various
temporal data. Temporal data can be created by either through insertion/update
queries, or importing a data set containing temporal types. We have also
implemented useful temporal features, like temporal arithmetic operations and
all Allen's interval relations. Users can write native temporal queries using
user-friendly temporal features instead of error-prone type casting and complex
condition checks.
Furthermore, we examined two research problems with
respect to temporal support, namely: (a) computing top-k keyword searches over
versioned data, and (b) temporal querying over branched versions. Most web
search engines will answer a query by ranking all known documents at the
(current) time the query is posed. There are applications however (for example
customer behavior analysis, crime investigation, etc.) that would need to
efficiently query these sources as of some past time, that is, retrieve the
results as if the user was posing the query in a past time instant, thus
accessing only data known as of that time. Top-k ranking and searching over
versioned documents considers not only keyword constraints but also the time
dimension, most commonly, a time point or time range of interest. We proposed
[11] a novel data organization and two indexing solutions: the first one
partitions data along ranking positions, while the other maintains the full
ranking order through the use of a multiversion
ordered list. We presented an experimental comparison for both time point and
time interval constraints. For time-interval constraints, different querying
definitions, such as aggregation functions and consistent top-k queries were
evaluated. Experimental evaluations on large real world datasets demonstrated
the advantages of the newly proposed data organization and indexing approaches.
Transaction-time databases have been proposed for
storing and querying the history of a database. While past work concentrated on
managing the data evolution assuming a static schema, recent research has
considered data changes under a linearly evolving schema. An ordered sequence
of schema versions is maintained and the database can restore/query its data
under the appropriate past schema. There are however many applications leading
to a branched schema evolution where data can evolve in parallel, under
different concurrent schemas. In [12] we considered the issues involved in
managing the history of a database that follows a branched schema evolution. To
maintain easy access to any past schema, we used an XML-based approach with an
optimized sharing strategy. As for accessing the data, we explored branched
temporal indexing techniques and presented efficient algorithms for evaluating
two important queries made possible by our novel branching environment: the
vertical historical query and the horizontal historical query. Moreover, we
showed that our methods can support branched schema evolution which allows
version merging. Experimental evaluations showed the efficiency of our storing,
indexing, and query processing methodologies.
(iii) Other Research. We also worked on identifying
spatiotemporal bursts over document streams. Currently, thousands of documents
are made available to the users via the web or microblogs on a daily basis.
Given a term t, a burst is generally exhibited when an unusually high frequency
is observed for term t. The problem of burst identification has been
studied either in the temporal domain or in the spatial domain. In [13] we
presented the first work to simultaneously track and measure spatiotemporal
term burstiness. We proposed two alternative
approaches for mining spatiotemporal burstiness
patterns, STComb and STLocal.
The two approaches are complementary, providing valuable insight on
spatiotemporal burstiness from different
perspectives. In addition, we used the mined burstiness
information toward an efficient document-search engine: given a user’s query of
terms, our engine returns a ranked list of documents discussing influential
events with a strong spatiotemporal impact. We demonstrated the efficiency of
our methods with an extensive experimental evaluation on real and synthetic
datasets.
Research Activities – Year 4
During the fourth (no-cost extension) year of the
ASTERIX effort the work at UCR has focused on the following major activities: (i) complete the temporal support for AsterixDB,
(ii) provide index support and parallel execution for VXQuery
on top of Hyracks, and (iii) implement GroupBy aggregation both at the local and the global level.
(i)
Implementing Temporal Support for ASTERIX. This year we further extended the feature set on top
of the existing temporal implementations. Since all temporal types are natively
supported in AsterixDB, we extended our indexing
framework to enable index structures over comparable temporal types (data,
time, datetime, yearMonthDuration
and dayTimeDuration). We also added the time-window
support so that users can create tumbling temporal windows and apply group-by
operations over the items within each window. We also added a flexible temporal
data parser so that non-standard temporal data can be parsed and imported into AsterixDB to utilize the native support.
(ii) VXQuery implementation on top of Asterix’ Hyracks engine. We developed (through the Apache
Foundation's VXQuery project as part of Google's
Summer of Code) an XQuery engine that runs in a distributed environment by
building it on top of the ASTERIX stack. First, the XQuery engine was developed
so as to process queries in the new stack. Many XQuery functions were needed to
link the distributed layer with the XQuery interface. The VXQuery
software now supports basic queries in a single process. Further, we
incorporated the Apache Lucene indexing on top of VXQuery. Steven Jacobs (PhD student at UCR) created an
alternate version of the Asterix collection function
that uses the Lucene store. Experimental evaluations
showed execution times that were as small as 10% when compared to a
straightforward approach that simply adds metadata to each XML file with the
path information in a Dewey-decimal manner. We also provided an efficient parallelization
for VXQuery. Instead of starting from scratch, Eldon
Carman (PhD student at UCR) worked on establishing various rewrite rules that
enable Hyracks (through Algebricks)
to optimize a VXQuery query and thus be able to take
advantage of parallelization. In particular we worked on rewrite rules for
various basic Algebricks operators (like AGGREGATE,
ASSIGN, DATASCAN, NEST, UNNEST and SUBPLAN).
(iii) Local and Global
Group-By Aggregation. Aggregation has been an important operation since the early days of
relational databases. Today's Big Data applications bring further challenges
when processing aggregation queries, demanding adaptive aggregation algorithms
that can process large volumes of data relative to a potentially limited memory
budget (especially in multiuser settings). Despite its importance, the design
and evaluation of aggregation algorithms has not received the same attention
that other basic operators, such as joins, have received in the literature. As
a result, when considering which aggregation algorithm(s) to implement in AsterixDB, we faced a lack of “off the shelf" answers
that we could simply read about and then implement based on prior performance
studies. We thus first revisited the engineering of efficient local aggregation
algorithms for use in Big Data platforms. Efficient implementation of the
aggregation operator for a Big Data platform is non-trivial and that many
factors, including memory usage, spilling strategy, and I/O and CPU cost,
should be considered. Further, we introduced precise cost models that can help
in choosing an appropriate algorithm based on input parameters including memory
budget, grouping key cardinality, and data skew. We proposed [14] two new
algorithm variants, the Hash-Sort algorithm and the Pre-Partitioning algorithm
which are now included in the latest release of AsterixDB,
together with the classic sort-merge algorithm (due to its good performance
when aggregating sorted data). Further we extended [15] the local aggregation
work on a clustered environment (global aggregation), where more factors like
per-machine workload balancing and network costs are considered.
(iv) Other Research. We created a demo of STEM (Spatio-Temporal Miner), a system for finding spatiotemporal
burstiness patterns in a collection of spatially
distributed frequency streams [16]. STEM implements the full functionality
required to mine spatiotemporal burstiness patterns
from virtually any collection of geostamped streams.
Examples of such collections include document streams (e.g. online newspapers),
geo-aware microblogging platforms (e.g. Twitter). We continued our work on
filtering using specialized hardware. We first considered complex pattern
trajectory queries (described as regular expressions over a spatial alphabet
that can be implicitly or explicitly anchored to the time domain). In addition,
our pattern queries may contain variables (which can substantially enhance the
flexibility and expressive power of pattern queries). In [17] (best paper award) we showed how to perform
such filtering using an FPGA architecture. Our
experimental results showed that the FPGA approach outperformed the
state-of-the art CPU-based approach by over three orders of magnitude. Next, in
[18] we showed how to perform holistic (no post-processing) evaluation of
thousands of complex twig-style XPath queries in a
streaming (single-pass) fashion, using GPUs. Our XML filtering results using
specialized hardware are summarized in [19].
Publications:
[1] Roger Moussalli, Mariam Salloum, Walid A. Najjar, Vassilis
J. Tsotras: Accelerating XML Query Matching through
Custom Stack Generation on FPGAs. HiPEAC 2010:
141-155, link.
[2] Roger Moussalli, Mariam Salloum, Walid A. Najjar, Vassilis J. Tsotras: Massively parallel XML twig filtering using
dynamic programming on FPGAs. ICDE 2011: 948-959, link.
[3] Mariam Salloum
and Vassilis J. Tsotras,
"A Unified Approach for Structural XML Processing", under submission.
[4] Marcos R. Vieira, Humberto Luiz Razente, Maria Camila Nardini Barioni, Marios Hadjieleftheriou, Divesh
Srivastava, Caetano Traina Jr., Vassilis
J. Tsotras: On query result diversification. ICDE
2011: 1163-1174, link.
[5] Mahbub Hasan, Abdullah Mueen, Vassilis J. Tsotras, Eamonn
J. Keogh: Diversifying query results on semi-structured data. CIKM 2012:
2099-2103, link.
[6] Roger Moussalli, Robert
Halstead, Mariam Salloum, Walid
A. Najjar, Vassilis J. Tsotras: Efficient XML Path Filtering Using GPUs. ADMS@VLDB
2011: 9-18, link.
[7] Mariam Salloum, Xin Luna
Dong, Divesh Srivastava, Vassilis
J. Tsotras: Online Ordering of Overlapping Data
Sources. PVLDB 7(3): 133-144 (2013), link.
[8] Alexander Behm, Vinayak R. Borkar, Michael J.
Carey, Raman Grover, Chen Li, Nicola Onose, Rares Vernica, Alin Deutsch, Yannis Papakonstantinou, Vassilis J. Tsotras: ASTERIX: towards a scalable, semistructured
data platform for evolving-world models. Distributed and Parallel Databases
29(3): 185-216 (2011), link.
[9] M. Hasan, A. Kashyap, V.
Hristidis, V.J. Tsotras,
"Adaptive Diversification of Query Results", submitted for
publication (2013).
[10] Mahbub
Hasan, Abdullah Mueen, Vassilis
J. Tsotras, “Distributed Diversification of Large
Datasets”, IEEE International Conference on Cloud Engineering (IC2E 2014), Boston MA, March 2014, to appear.
[11] Wenyu Huo, Vassilis J. Tsotras: A Comparison of Top-k Temporal Keyword Querying
over Versioned Text Collections. DEXA (2) 2012: 360-374, link.
[12] Wenyu Huo, Vassilis J. Tsotras: Querying Transaction-Time Databases under Branched
Schema Evolution. DEXA (1) 2012: 265-280, link.
[13] Theodoros Lappas, Marcos R. Vieira, Dimitrios
Gunopulos, Vassilis J. Tsotras: On The Spatiotemporal Burstiness of Terms. PVLDB 5(9): 836-847 (2012), link.
[14] Jian Wen , Vinayak R. Borkar , Michael J.
Carey, Vassilis J. Tsotras,
“Revisiting Aggregation for Data Intensive Applications: A Performance Study”,
submitted for publication (2013), link.
[15] Jian Wen, “Revisiting
Aggregation Techniques for Data Intensive Applications”, PhD Thesis, UC
Riverside, 2013.
[16] Theodoros Lappas, Marcos R. Vieira, Dimitrios
Gunopulos, Vassilis
J. Tsotras: STEM: a spatio-temporal
miner for bursty activity. SIGMOD Conference 2013:
1021-1024, link.
[17] Roger Moussalli, Marcos
R. Vieira, Walid A. Najjar,
Vassilis J. Tsotras:
Stream-Mode FPGA Acceleration of Complex Pattern Trajectory Querying. SSTD
2013: 201-222, link.
[18] Ildar Absalyamov, Roger Moussalli, Vassilis J. Tsotras, Walid A. Najjar: High-Performance
XML Twig Filtering using GPUs. ADMS@VLDB 2013: 13-24, link.
[19] Roger Moussalli, Mariam
Salloum, Robert Halstead, Walid
A. Najjar, Vassilis J. Tsotras (2013). A Study on Parallelizing
XML Path Filtering Using Accelerators. ACM Transactions on Embedded
Computing Systems (TECS), accepted for publication.