Journal Contents

2018:
Volume 5 Issue 1, 2018 [Full Issue PDF will be made online]

2017:
Volume 4 Issue 1, 2017 [Full Issue PDF will be made online] [Paper 1] [Paper 2]
Volume 4 Issue 2, 2017 [Full PDF will be made online]
Volume 4 Issue 3, 2017 [Full PDF will be made online]
Volume 4 Issue 4, 2017 [Full PDF will be made online]

2016:
Volume 3 Issue 1, 2016 [Full Issue PDF] [Paper 1] [Paper 2] [Paper 3] [Paper 4]
Volume 3 Issue 2, 2016 [Full Issue PDF] [Paper 1] [Paper 2] [Paper 3] [Paper 4]
Volume 3 Issue 3, 2016 [Full Issue PDF will be made online] [Paper 1] [Paper 2]
Volume 3 Issue 4, 2016 [Full PDF will be made online]

2015:
Volume 2 Issue 4, 2015 [Full Issue PDF] [Paper 1] [Paper 2] [Paper 3] [Paper 4]
Volume 2 Issue 3, 2015 [Full Issue PDF] [Paper 1] [Paper 2] [Paper 3] [Paper 4]
Volume 2 Issue 2, 2015 [Full Issue PDF] [Paper 1] [Paper 2] [Paper 3] [Paper 4]
Volume 2 Issue 1, 2015 [Full Issue PDF] [Paper 1] [Paper 2] [Paper 3] [Paper 4]

2014:
Volume 1 Issue 1, 2014 [Full Issue PDF] [Paper 1] [Paper 2] [Paper 3] [Paper 4]

==========================================================================================

Volume 4 Issue 1, 2017 [Full Issue PDF will be made online]

  • Big Data Workflows: A Reference Architecture and the DATAVIEW System

  • Should be cited as: [ paper pdf download]

    A. Kashlev, S. Lu, A. Mohan, "Big Data Workflows: A Reference Architecture and the DATAVIEW System", Services Transactions on Big Data (STBD), 4(1), 2017, pp. 1-19, doi: 10.29268/stbd.2016.4.1.1.

    Abstract:

    The big data era is here, a natural result of the digital revolution of the last few decades. The emergence of big data in virtually all areas of life raises a fundamental question - how can we turn large volumes of bits and bytes into insights and possibly values? The answer to this question is often hindered by three big data challenges: volume, velocity, and variety. While scientific workflows have been used extensively in structuring complex scientific data analysis processes, they fall short in meeting the three big data challenges on the one hand, and in leveraging the dynamic resource provisioning capability of cloud computing on the other hand. To address such limitations, we propose and develop the concept of big data workflow as the next generation of data-centric workflow technologies. In this paper we: 1) identify the key challenges for running big data workflows in the cloud; 2) propose a reference architecture for big data workflow management systems (BDWFMSs) that addresses these challenges, 3) develop DATAVIEW, a big data workflow management system, to validate our proposed reference architecture, 4) design and run two big data workflows in the automotive and astronomy domains to showcase applications of our DATAVIEW system.

  • Features that Distinguish Drivers: Big Data Analytics of Naturalistic Driving Data

  • Should be cited as: [ paper pdf download]

    B. Wallace, F. Knoefel, R. Goubran, M. M. Porter, A. Smith, S. Marshall, "Features that Distinguish Drivers: Big Data Analytics of Naturalistic Driving Data", Services Transactions on Big Data (STBD), 4(1), 2017, pp. 20-32, doi: 10.29268/stbd.2016.4.1.2.

    Abstract:

    The unique behaviours of drivers have many emerging applications. These include the personalization of automated/self- driving vehicles so the owners are more comfortable with them, and the identification of changing driving behaviours that may be associated with aging or disease. This paper explores measures of driving behaviours that might allow for the differentiation of drivers based on their individual driving characteristics. An emerging challenge within longitudinal studies of drivers is to distinguish between different drivers of a shared vehicle. It also has application in the insurance industry where insurance risk and associated owner premium depends on the diversity or lack thereof of drivers for a vehicle such as a vehicle driven/never driven by secondary drivers that have higher risk driving behaviours. In this paper, a big data set of driving data for 14 older drivers is analyzed - a single year of data includes over 250,000 km and almost 5000 hours of driving for the 14 drivers. A set of 162 trip level calculated features are analyzed to determine their ability to be used to distinguish between two drivers of a vehicle. The results show that features based on road choice and driver chosen velocity provide the best performance individually and in feature pairs with 2 features providing error rates less than 5% for some driver pairs. The set of features that provided the best performance differed for each driver pair and was found to include features from measures of a driver’s road choice, velocity and velocity ratio in addition to the features measuring trip similarity to two phase acceleration and deceleration relationships for the driver. The best error rate obtained was 1.5% for a driver pair. On the other hand, the results suggest that a number of features and feature groups do not allow for older driver differentiation. For instance, overnight driving and high rates of acceleration are not sufficiently exhibited by these drivers to be useful.

=======================

Volume 3 Issue 3, 2016 [Full Issue PDF will be made online]

  • Tree Matching on Parallel Machines using Data Shaping

  • Should be cited as: [ paper pdf download]

    P. Shukla, A. K. Somani, "Tree Matching on Parallel Machines using Data Shaping ", Services Transactions on Big Data (STBD), 3(3), 2016, pp. 1-16, doi: 10.29268/stbd.2016.3.3.1.

    Abstract:

    Real time big data analytics has become important to meet the business as well as other decision-making needs in many complex applications. A significant portion of such data is available and stored in semi-structured form. A tree-based organization is commonly used in such cases. Tree matching is a core component for many applications such as fraud detection, spam filtering, information visualization and extraction, user authentication, natural language processing, XML databases, bioinformatics, etc. Comparing ordered (unordered) trees is compute-intensive, in particular for Big Data. To facilitate comparison of ordered trees, in this paper we address the problem of shaping the semi-structured data to enable time efficient processing on contemporary hardware such as a GPGPU (General Purpose Graphics Processing Unit) and INTEL MIC (a multi-core processor). Specifically, our data shaping approach enables pre-computation of partial edit distance values in parallel. We also develop processor-specific techniques keeping in mind compute requirements of various constituent stages of PTED. We evaluate our work using real world data sets. Our experimental results show that our SIMT-based PTED-GPU (Parallel Tree Edit Distance using GPU) implementation shows speedup of up to 12X when compared to the state-of-the-art in tree edit distance (TED) computation. In addition, our techniques when ported on CPUs are scalable. Finally, we discuss appropriateness of compute platforms w.r.t. various constituent stages of proposed PTED.

  • QDrill: Distributing the Undistributable for Big Data Analytics

  • Should be cited as: [ paper pdf download]

    S. Khalifa, P. Martin, R. Young "QDrill: Distributing the Undistributable for Big Data Analytics ", Services Transactions on Big Data (STBD), 3(3), 2016, pp. 17-27, doi: 10.29268/stbd.2016.3.3.2.

    Abstract:

    Adapting classification algorithms to handle the huge volumes of Big Data is a problem that is usually addressed by rewriting the classification algorithms to run in a distributed fashion using parallel frameworks like Hadoop or Spark. While this approach can result in fast algorithms, it is time consuming and can be very challenging to implement for all algorithms. To overcome this challenge, we previously introduced QDrill, an open-source framework for distributed analytics that avoids this process of rewriting the algorithms. QDrill adds analytics capabilities to Apache Drill, a schema-free SQL query engine for non-relational storage, to distribute the execution of existing single-node classification algorithms without rewriting them. However, QDrill did not support all classification algorithms since not all algorithms are the same. Some algorithms are “UnDistributable” if they require loading the entire dataset in the memory of a single node to train a classifier. The UnDistributable algorithms represent more than 50% of the classification algorithms. They are not supported by the previous QDrill version and they are usually not implemented in the distributed data mining libraries (e.g. Mahout). In this work, we extend QDrill to address the challenges of distributing the UnDistributables. As a proof-of-concept, QDrill distributes WEKA’s UnDistributable algorithms, thus supporting 100% of WEKA’s classification algorithms without any algorithms rewrites. Our empirical studies show that the models produced from our proposed solution have similar, and sometime even better, accuracy (less misclassifications) than training a single classifier on the entire training dataset. The proposed solution also has significantly faster training and scoring times for large datasets compared to the single-node versions.

=======================

Volume 3 Issue 2, 2016 [Full Issue PDF]

  • Empirical Evaluation of Big Data Analytics using Design of Experiment: Case Studies on Telecommunication Data

  • Should be cited as: [ paper pdf download]

    S. Singh, Y. Liu, W. Ding, Z. Li, "Empirical Evaluation of Big Data Analytics using Design of Experiment: Case Studies on Telecommunication Data ", Services Transactions on Big Data (STBD), 3(2), 2016, pp. 1-20, doi: 10.29268/stbd.2016.3.2.1.

    Abstract:

    Data analytics involves the process of data collection, data analysis, and report generation. Data mining workflow tools usually orchestrate this process. The data analysis step in this process further consists a series of machine learning algorithms. There exists a variety of data mining tools and machine learning algorithms. Each tool or algorithm has its own set of features that become factors to affect both functional and non- functional attributes of the system of data analytics. Given domain-specific requirements of data analytics, understanding the effects of these factors and their combinations provide a guideline of selecting workflow tools and machine learning algorithms. In this paper, we develop an empirical evaluation method based on the principle of Design of Experiment. We apply this method to evaluate data mining tools and machine learning algorithms towards building big data analytics for telecommunication monitoring data. Two case studies are conducted to provide insights of relations between the requirements of data analytics and the choice of a tool or algorithm in the context of data analysis workflows. The demonstration also shows that our evaluation method can facilitate the replication of this evaluation study, and can conveniently be expanded for evaluating other tools and algorithms.

  • Parallel Processing of Top-K Trajectory Similarity Queries with GPGPUs

  • Should be cited as: [ paper pdf download]

    E. Leal, L. Gruenwald, J. Zhang, "Parallel Processing of Top-K Trajectory Similarity Queries with GPGPUs ", Services Transactions on Big Data (STBD), 3(2), 2016, pp. 21-35, doi: 10.29268/stbd.2016.3.2.2.

    Abstract:

    Through the use of location-sensing devices, it has been possible to collect very large datasets of trajectories. These datasets make it possible to issue spatio-temporal queries with which users can gather information about the characteristics of the movements of objects, derive patterns from that information, and understand the objects themselves. Among such spatio-temporal queries that can be issued is the top-K trajectory similarity query. This query finds many applications, such as bird migration analysis in ecology and trajectory sharing in social networks. However, the large size of the trajectory query sets and databases poses significant computational challenges. In this work, we propose a parallel GPGPU algorithm Top-KaBT that is specifically designed to reduce the size of the candidate set generated while processing these queries, and in doing so strives to address these computational challenges. The experiments show that the state of the art top-K trajectory similarity query processing algorithm on GPGPUs, TKSimGPU, achieves a 6.44X speedup in query processing time when combined with our algorithm and a 13X speedup over a GPGPU algorithm that uses exhaustive search. The experiments also show that the time overhead incurred by Top-KaBT is very small because it only represents 2.5% of the average query execution time spent by TKSimGPU when combined with Top-KaBT.

  • Deviated Expectation based Classification Method for Stock Price Prediction

  • Should be cited as: [ paper pdf download]

    S. Ruan, J. Y. M. Lai, X. Chen, X. Zhang, " Deviated Expectation based Classification Method for Stock Price Prediction ", Services Transactions on Big Data (STBD), 3(2), 2016, pp. 36-46, doi: 10.29268/stbd.2016.3.2.3.

    Abstract:

    Nowadays, with the fast development of social media, the instant financial news can be quickly spread over the Internet and consequently results in the strong vibration of stock market within a short period of time. To capture the effect of financial news, various classification algorithms are proposed in financial data mining domain. Unfortunately, it has been shown that the positive news might not necessarily drive up the stock price. It is believed that there must exist some underlying events, called anchor events in this paper, which drive the stock market up or down. Therefore, this paper proposed the classification should be performed not only based on the textual features of financial news, but also the semantic distance between the testing set of news and the set of news of anchor events. Accordingly, such semantic distance is measured as deviated expectation calculated based on the distance between two set of extracted topics. The proposed method involves two steps, i.e., calculating whether the testing news is consistent with anchor events or not, and calculating how far the testing news is from the anchor events. We then evaluate our method as well as the state-of-the-art classification algorithms on some real data sets. The promising experimental results have demonstrated that the proposed method is superior to the state-of-the-art classification algorithms in terms of classification accuracy.

  • Mutual Community Detection across Multiple Partially Aligned Social Networks

  • Should be cited as: [ paper pdf download]

    J. Zhang, S. Jin, P. S. Yu, " Mutual Community Detection across Multiple Partially Aligned Social Networks ", Services Transactions on Big Data (STBD), 3(2), 2016, pp. 47-69, doi: 10.29268/stbd.2016.3.2.4.

    Abstract:

    To enjoy more social network services, users nowadays are usually involved in multiple online social networks simultaneously. Networks that involve some common users are named as multiple “partially aligned networks”. In this paper, we want to detect communities of multiple partially aligned networks simultaneously, which is formally defined as the “Mutual Community Detection” problem. To solve the mutual community detection problem, a novel community detection method, MCD (Mutual Community Detector), is proposed in this paper. MCD can detect social community structures of users in multiple partially aligned networks at the same time with full considerations of (1) characteristics of each network, and (2) information of the shared users across aligned networks. In addition, to handle large scale aligned networks, we extend method MCD and propose MCD-SCALE. MCD-SCALE applies a distributed multilevel k-way partitioning method to divide the networks into k partitions sequentially. Extensive experiments conducted on two real-world partially aligned heterogeneous social networks demonstrate that MCD and MCD-SCALE can solve the “Mutual Community Detection” problem very well.

=======================

Volume 3 Issue 1, 2016 [Full Issue PDF]

  • An Experimental Investigation of Mobile Network Traffic Prediction Accuracy

  • Should be cited as: [ paper pdf download]

    A. Y. Nikravesh, S. A. Ajila, C.-H. Lung, "An Experimental Investigation of Mobile Network Traffic Prediction Accuracy ", Services Transactions on Big Data (STBD), 3(1), 2016, pp. 1-16, doi: 10.29268/stbd.2016.3.1.1.

    Abstract:

    The growth in the number of mobile subscriptions has led to a substantial increase in the mobile network bandwidth demand. The mobile network operators need to provide enough resources to meet the huge network demand and provide a satisfactory level of Quality-of-Service (QoS) to their users. However, in order to reduce the cost, the network operators need an efficient network plan that helps them provide cost effective services with a high degree of QoS. To devise such a network plan, the network operators should have an in-depth insight into the characteristics of the network traffic. This paper applies the time-series analysis technique to decomposing the traffic of a commercial trial mobile network into components and identifying the significant factors that drive the traffic of the network. The analysis results are further used to enhance the accuracy of predicting the mobile traffic. In addition, this paper investigates the accuracy of machine learning techniques – Multi-Layer Perceptron (MLP), Multi-Layer Perceptron with Weight Decay (MLPWD), and Support Vector Machines (SVM) – to predict the components of the commercial trial mobile network traffic. The experimental results show that using different prediction models for different network traffic components increases the overall prediction accuracy up to 17%. The experimental results can help the network operators predict the future resource demands more accurately and facilitate provisioning and placement of the mobile network resources for effective resource management.

  • An Investigation of Mobile Network Traffic Data and Apache Hadoop Performance

  • Should be cited as: [ paper pdf download]

    M. Si, C.-H. Lung, S. Ajila, W. Ding, "An Investigation of Mobile Network Traffic Data and Apache Hadoop Performance ", Services Transactions on Big Data (STBD), 3(1), 2016, pp. 17-31, doi: 10.29268/stbd.2016.3.1.2.

    Abstract:

    Since the emergence of mobile networks, the number of mobile subscriptions has continued to increase year after year. To efficiently assign mobile network resources such as spectrum (which is expensive), the network operator needs to critically process and analyze information and develop statistics about each base station and the traffic that passes through it. This paper presents an application of data analytics by focusing on processing and analyzing two datasets from a commercial trial mobile network. A detailed description that uses Apache Hadoop and the Mahout Machine learning library to process and analyze the datasets is presented. The analysis provides insights about the resource usage of network devices. This information is of great importance to network operators for efficient and effective management of resources and for supporting high-quality of user experience. Furthermore, an investigation has been conducted that evaluates the impact of executing the Mahout clustering algorithms with various system and workload parameters on a Hadoop cluster. The results demonstrate the value of performance data analysis. Specifically, the execution time can be significantly reduced using data pre-processing, some machine learning techniques, and Hadoop. The investigation provides useful information for the network operators for future real-time data analytics.

  • On Developing the RaaS

  • Should be cited as: [ paper pdf download]

    C.-C. Y, H. Chen, L.-J. Zhang, X.-N. Li, H. Liang, " On Developing the RaaS ", Services Transactions on Big Data (STBD), 3(1), 2016, pp. 32-43, doi: 10.29268/stbd.2016.3.1.3.

    Abstract:

    Choice is a pervasive feature of social life that profoundly affects us. Ranking results can be used as a reference to help people make a correct choice. But there are two problems. One problem is that fixed ranking results instead of the ranking methods are provided to people by service providers as a reference when making choice at most time. For example, TIMES World University Rankings can be used as a reference when choosing a college. However, in the numerous factors that affect objects ranking, people have their own understanding on the effect of each factor on objects ranking. Using mobile phone-selection as a practical case, some people think performance of a mobile phone is more important, while others hold the view that appearance of a mobile phone is more attractive. What’s more, there are many ranking methods proposed, such as The Technique for Order of Preference by Similarity to Ideal Solution (TOPSIS) and expert marking.Using only one kind of ranking methods for object ranking may lead to over objective or subjective ranking results. Although various ranking algorithms are studied, very little is known about the detailed development and deployment of the ranking services. This paper proposes a comprehensive solution of Ranking as a Service (RaaS), with the manifold contributions: Firstly, we use combination weighting method in RaaS and it can overcome the defects of subjective and objective weighting methods. Secondly, we develop ranking service APIs that bring convenience to people when making choices. Thirdly, ranking service provides ranking results for people according to their own understanding on the effect of each factor on objects ranking. Fourthly, this paper is arguably the first one that proposes using Ranking as a Service. Finally, we evaluate

  • Evaluations of Big Data Processing

  • Should be cited as: [ paper pdf download]

    D. S. Terzi, U. Demirezen, S. Sagiroglu, " Evaluations of Big Data Processing ", Services Transactions on Big Data (STBD), 3(1), 2016, pp. 44-53, doi: 10.29268/stbd.2016.3.1.4.

    Abstract:

    Big data phenomenon is a concept for large, heterogeneous and complex data sets and having many challenges in storing, preparing, analyzing and visualizing as well as techniques and technologies for making better decision and services. Uncovered hidden patterns, unknown or unpredicted relations and secret correlations are achieved via big data analytics. This might help companies and organizations to have new ideas, get richer and deeper insights, broaden their horizons, get advantages over their competitors, etc. To make big data analytics easy and efficient, a lot of big data techniques and technologies have been developed. In this article, the chronological development of batch, real-time and hybrid technologies, their advantages and disadvantages have been reviewed. A number of criticism have been focused on available processing techniques and technologies. This paper will be a roadmap for researchers who work on big data analytics.

=======================

Volume 2 Issue 4, 2015 [Full Issue PDF]

  • An Asynchronous Method for Write Optimization of Column-Store Databases in Map-Reduce

  • Should be cited as: [ paper pdf download]

    F. Yu, E. S. Jones, W. Xiong, M. Hamdi, W.-C. Hou, "An Asynchronous Method for Write Optimization of Column-Store Databases in Map-Reduce ", International Journal of Big Data (IJBD), 2(4), 2015, pp. 1-9, doi: 10.29268/stbd.2015.2.4.1.

    Abstract:

    Column-store databases feature a faster data reading speed compared with traditional row-based databases. How-ever, optimizing write operations in a column-store database is a well-known challenge. Most existing works on write performance optimization focus on main-memory column-store databases. In this work, we extend the re-search on column-store databases in the Map-Reduce environment. We propose a data storage format called Timestamped Binary Association Table (or TBAT) without the need of global indexing. Based on TBAT, a new update method, called Asynchronous Map-Only Update (or AMO Update), is designed to replace the traditional update. A significant improvement in speed performance is shown in experiments when comparing the AMO up-date with the traditional update.

  • 5C, A New Model of Defining Big Data

  • Should be cited as: [ paper pdf download]

    L.-J. Zhang, J. Zeng, "5C, A New Model of Defining Big Data ", International Journal of Big Data (IJBD), 2(4), 2015, pp. 10-23, doi: 10.29268/stbd.2015.2.4.2.

    Abstract:

    Big data as an emerging paradigm has revolutionized the IT, which is embodied with features of 4Vs (volume, velocity, variety, veracity). However, with the rising of the digital economy 2.0, 4Vs features merely give the non-functional criteria and they cannot precisely depict the essence of big data and fail to address how to apply the big data in actual scenarios. Therefore, to advance the down-to-earth application of big data in the digital economy 2.0, we propose a brand new model–5C (creator, channel, center, context, consumer), which will fundamentally redefine the requirements of big data and give a novel methodology for big data. In the proposed 5C model, creator elucidates the body of creation of big data. Channel refers to the transmitting issues of big data. Center denotes how to make data capitalized and further form valuable assets. Context is the application scenario of big data based on different platforms. Consumer involves users of big data under specific contexts. To demonstrate the proposed 5C model, we use big data architecture of amazon and business opportunity map of Kingdee as the cases to illustrate what we propose. We also summarize the successful rules for people to embrace big data under the digital economy 2.0. Finally, we conclude that the presented 5C model can be fully used as basic principles and guideline for leading the construction of enterprise big data.

  • BDOA: Big Data Open Architecture

  • Should be cited as: [ paper pdf download]

    L.-J. Zhang, H.Chen, " BDOA: Big Data Open Architecture ", International Journal of Big Data (IJBD), 2(4), 2015, pp. 24-48, doi: 10.29268/stbd.2015.2.4.3.

    Abstract:

    The initiatives of drafting the Big Data Open Architecture (BDOA) started in the late 2014 and Services Society (S2) formally announced the Body of Knowledge of Big Data (BoK) project on the conference of IEEE SERVICES 2015. Many volunteers from all over the world joined the BoK project and made several comments and suggestions. At the end of 2015, the Big Data technologies are no longer brand new to academic researchers as well as to the industrial world, but have gradually become mature. In the last five years, it is interesting that almost every technology vendors and startups talk about Big Data. Subsequently, it looks like that both the scientific world and industry world do expect a Big Data open architecture. The paper proposes the Big Data Open Architecture (BDOA). We hope that the detailed definition and explanation of the high-level architecture building blocks could facilitate researchers, engineers, teachers, and students not only to have an overview of BDOA but also to bestow much insights into Big Data. In the paper, we first have an overview of the open architecture. Secondly, we illustrate the high-level Architecture Building Blocks (ABBs) of each level of the open architecture followed by BDOA-based engineering practices. Finally, we summarize the paper and shed some light on future research directions.

  • Data Value Chain and Service Ecosystem - A Way to Achieve Service Computing Supporting "Internet+"

  • Should be cited as: [ paper pdf download]

    L.-J. Zhang, " Data Value Chain and Service Ecosystem - A Way to Achieve Service Computing Supporting "Internet+" ", International Journal of Big Data (IJBD), 2(4), 2015, pp. 49-56, doi: 10.29268/stbd.2015.2.4.4.

    Abstract:

    "Internet +", the application of the internet and other information technology in conventional industries, is the basic strategic direction of enterprise development. The paper proposes that service ecosystem based on data value chain, which is supported by service computing technology, is one of the most effective solutions to realize this strategy. The paper analyzes the development of service computing, explores the formation of data value chain and presents the actual "Internet +" scenario application. Armed with the origin, structure and application of the proposal, we hope that the future business and research can be inspired greatly.

=======================

Volume 2 Issue 3, 2015 [Full Issue PDF]

  • An approach for leveraging personal cloud storage services for team collaboration

  • Should be cited as: [ paper pdf download]

    Z. Cheng, Z. Zhou, K. Ning, L.J. Zhang, T. Rahman, J. Min, "An approach for leveraging personal cloud storage services for team collaboration", International Journal of Big Data (IJBD), 2(3), 2015, pp. 1-14, doi: 10.29268/stbd.2015.2.3.1.

    Abstract:

    With the rapid development of cloud computing technology, cloud-based team collaboration applications are becoming popular on the Web to enlarge storage space and facilitate the team collaboration. Among all the required features for a typical team collaboration application, shared storage for referred documents or produced artifacts by the team is a must-have one. However, existing shared storage solutions for team collaboration applications are far from satisfaction. Some of them rely on self-built storage infrastructure. Consequently, when the application becomes more powerful and more storage space is required, which could be a big burden, especially for small or medium vendors. With the prevalence of personal cloud storage services, such as Dropbox and Google Drive, more team collaboration applications allow users to share files from their personal cloud-storage spaces through external shared links, which can partly solve the problem. However, this method is not convenient for team collaboration, neither safe enough. This paper presents an approach to leverage third-part personal cloud-storage services to provide shared storage for team collaboration applications. Compared to existing approaches, our approach provides sophisticated mechanisms to make sure it’s more convenient and safer for the team work. It brings benefits in three folds: for users, it improves the utilization of personal cloud storage space; for vendors of personal cloud storage service, it helps attract users to use their services; for vendors of team collaboration applications, it reduces the burden of developing self-built storage infrastructure. The approach has been tested in kAct, a task-based team collaboration application provided by Kingdee, and the results are promising.

  • A comprehensive overview of open source big data platforms and frameworks

  • Should be cited as: [ paper pdf download]

    P. Almeida, J. Bernardino, "A comprehensive overview of open source big data platforms and frameworks", International Journal of Big Data (IJBD), 2(3), 2015, pp. 15-33, doi: 10.29268/stbd.2015.2.3.2.

    Abstract:

    Big Data is the paradigm that represents the ability to analyze and cross-reference large amounts of data generated by computational systems and turn them into useful knowledge. This potential is one solution organizations can use to answer the challenge of getting closer to their users. Organization managers face the challenge of understanding the Big Data concept and the business strategies inherent to its use. The high number of challenges that need to be addressed creates a high number of proposed technical solutions that most times only overlap existing ones. Frequently managers face these issues as their organizations race against the competitors for a market share, without having resources to embrace not only Big Data but also other options that can give competitive advantage. Therefore, organization owners and managers must be educated on deployed platforms that can make them understand the benefits that can be achieved on short term. In this paper we aim to provide an overview of using Big Data with Open Source tools. We explain the Big Data concept, the potential value and the organizational strategies that must be studied in order to determine which benefits organizations can win from it. We analyze the strengths and drawbacks of five open source frameworks for distributed data programming – Hadoop, Spark, Storm, Flink and H2O – and seven open source platforms for Big Data Analytics – Mahout, MOA, R Project, Vowpal Wabbit, Pegasus, GraphLab Create and MLLib. There is no single platform that truly embodies a one size fits all solution, so this paper aims to help decision makers by providing as much information as possible and quantifying some tradeoffs.

  • Distributed SPARQL Querying Over Big RDF Data Using Presto-RDF

  • Should be cited as: [ paper pdf download]

    M. Mammo, M. Hassan, S. K. Bansal, "Distributed SPARQL Querying Over Big RDF Data Using Presto-RDF", International Journal of Big Data (IJBD), 2(3), 2015, pp. 34-49, doi: 10.29268/stbd.2015.2.3.3.

    Abstract:

    The processing of large volumes of RDF data requires an efficient storage and query processing engine that can scale well with the volume of data. The initial attempts to address this issue focused on optimizing native RDF stores as well as conventional relational database management systems. But as the volume of RDF data grew to exponential proportions, the limitations of these systems became apparent and researchers began to focus on using big data analysis tools, most notably Hadoop, to process RDF data. Various studies and benchmarks that evaluate these tools for RDF data processing have been published. In the past two and half years, however, heavy users of big data systems, like Facebook, noted limitations with the query performance of these big data systems and began to develop new distributed query engines for big data that do not rely on map-reduce. Facebook’s Presto is one such example. This paper proposes an architecture based on Presto, Presto-RDF, that can be used to process big RDF data. We evaluate the performance of Presto in processing big RDF data against Apache Hive. A comparative analysis was also conducted against 4store, a native RDF store. To evaluate the performance Presto for big RDF data processing, a map-reduce program and a compiler, based on Flex and Bison, were implemented. The map-reduce program loads RDF data into HDFS while the compiler translates SPARQL queries into a subset of SQL that Presto (and Hive) can understand. The evaluation was done with RDF datasets of size 10, 20, and 30 million triples. The results of the experiments show that Presto-RDF has a much higher performance than Hive and can be used to process big RDF data.

  • The Development and Deployment of Large-File Upload Services

  • Should be cited as: [ paper pdf download]

    H. Chen, L.J. Zhang, B. Hu, S. Long, L. Luo, C. Xing, "The Development and Deployment of Large-File Upload Services, "International Journal of Big Data (IJBD), 2(3), 2015, pp. 50-64, doi: 10.29268/stbd.2015.2.3.4.

    Abstract:

    The popularity of enterprise cloud storage is rapidly growing. A number of Internet service vendors and providers, such as Google, Baidu and Microsoft, entered this emerging market and released a variety of cloud storage services. These services allow people to access work documents and files all over the world at anytime. Interestingly, with the prevalence of mobile Internet, rich media becomes regular and popular. More and more people use cloud storage for keeping their personal photos, music and movies. Nevertheless, the size of the media files is often beyond the upper limit that normal form-based file upload service allows hence dedicated large-file upload services are required to be developed and deployed. Although many cloud vendors offer versatile cloud storage services, very little is known about the detailed development and deployment of the large-file upload services. This paper proposes a complete solution of large-file upload service, with the contributions in manifold: Firstly, we do not limit the maximum size of a large file that can be uploaded. This is extremely practical for storing huge database resource files generated from ERP tools. Secondly, we developed large-file upload service APIs that have very strict verification of correctness, to reduce the risk of data inconsistency, which has better safety. Thirdly, we extend the service developed recently for team collaboration with the capability of handling large files. Fourthly, this paper is arguably the first one that formalizes the testing and deployment procedures of large-file upload services with the help of Docker. In general, most large-file upload services are exposed to the public, facing security and performance issues, which brings much concern. With the proposed Docker-based deployment strategy, we can replicate the large-file upload service agilely and locally, to satisfy massive private or local deployment of KDrive. Finally, we evaluate and analyze the proposed strategies and technologies in accordance to the experimental results.

=======================

Volume 2 Issue 2, 2015 [Full Issue PDF]

  • Architecture for Intelligent Big Data Analysis based on Automatic Service Composition

  • Should be cited as: [ paper pdf download]

    T. H. Akila S. Siriweera, Incheon Paik, Banage T. G. S. Kumara, C. K. Koswatta, "Architecture for Intelligent Big Data Analysis based on Automatic Service Composition", International Journal of Big Data (IJBD), 2(2), 2015, pp. 1-14, doi: 10.29268/stbd.2015.2.2.1.

    Abstract:

    Big Data contains massive information, which are generating from heterogeneous, autonomous sources with distributed and anonymous platforms. Since, it raises extreme challenge to organizations to store and process these data. Conventional pathway of store and process is happening as collection of manual steps and it is consuming various resources. An automated real-time and online analytical process is the most cognitive solution. Therefore it needs state of the art approach to overcome barriers and concerns currently facing by the Big Data industry. In this paper we proposed a novel architecture to automate data analytics process using Nested Automatic Service Composition (NASC) and CRoss Industry Standard Platform for Data Mining (CRISPDM) as main based technologies of the solution. NASC is well defined scalable technology to automate multidisciplined problems domains. Since CRISP-DM also a well-known data science process which can be used as innovative accumulator of multi-dimensional data sets. CRISP-DM will be mapped with Big Data analytical process and NASC will automate the CRISP-DM process in an intelligent and innovative way.

  • Real-Time Optimization for Disaster Response: A Mathematical Programming Approach

  • Should be cited as: [ paper pdf download]

    Helsa Heishun Chan, Kwai L. Wong, Yong-Hong Kuo, Janny M. Y. Leung, Kelvin K. F. Tsoi, Helen M. Meng, "Real-Time Optimization for Disaster Response: A Mathematical Programming Approach", International Journal of Big Data (IJBD), 2(2), 2015, pp. 15-27, doi: 10.29268/stbd.2015.2.2.2.

    Abstract:

    Disasters are sudden and calamitous events that can cause severe and pervasive negative impacts on society and huge human losses. Governments and humanitarian organizations have been putting tremendous efforts to avoid and reduce the negative consequences due to disasters. In recent years, information technology and big data have played an important role in disaster management. While there has been much work on disaster information extraction and dissemination, real-time optimization for decision support for disaster response is rarely addressed in big data research. With big data as an enabler, optimization of disaster response decisions from a systems perspective would facilitate the coordination among governments and humanitarian organizations to transport emergency supplies to affected communities in a more effective and efficient way when a disaster strikes. In this paper, we propose a mathematical programming approach, with real-time disaster-related information, to optimize the post-disaster decisions for emergency supplies delivery. Since timeliness is key in a disaster relief setting, we propose a rounding-down heuristic to obtain near-optimal solutions for the provision of rapid and effective response. We also conduct two computational studies. The first one is a case study of Iran that aims to examine the characteristics of the solutions provided by our solution methodology. The second one is to evaluate the computational performance, in terms of effectiveness and efficiency, of the proposed rounding-down heuristic. Computational results show that our proposed approach
    can obtain near-optimal solutions in a short period of time for large and practical problem sizes. This is an extended work of Kuo et al., 2015, which has been published in the Proceedings of IEEE International Congress on Big Data (Big Data Congress) 2015.

  • Task And Data Allocation Strategies for Big Data Workflows

  • Should be cited as: [ paper pdf download]

    Mahdi Ebrahimi, Aravind Mohan, Andrey Kashlev, Shiyong Lu, Robert G. Reynolds, "Task And Data Allocation Strategies for Big Data Workflows", International Journal of Big Data (IJBD), 2(2), 2015, pp. 28-42, doi: 10.29268/stbd.2015.2.2.3.

    Abstract:

    The makespan of a big data workflow is the time elapsed between the start of the first task and the completion of the last task in the workflow. This time includes the delivery of the final data product to the desired location within the network. Due to the large number of inputs and intermediate outputs of a big data workflow activity, the makespan of the workflow is significantly influenced by how its tasks and datasets are allocated in a distributed computing environment. Therefore, reducing makespans of big data workflows can be achieved by incorporating a data and task allocation strategy into an execution planning phase performed by a workflow management system. This creates a pressing need for an investigation of such strategies. To address this need, this paper provides a formal definition of the makespan minimization problem for big data workflows and proposes efficient workflow execution planning strategies. In particular, two algorithms, WEP-A and WEP-B, following different strategies are proposed. WEP-A follows a phased approach to the generation of an execution plan whereas WEP-B uses an evolutionary algorithm-based optimization strategy to find a valid plan with the shortest makespan. Both of these strategies are evaluated through extensive
    simulation experiments by varying workflow graphs and resources in the workflow environment. The results of the experiments demonstrate that WEP-B performs better than WEP-A on a set of benchmark examples. For more complex and large workflows, the improvements due to evolutionary optimization in WEP-B are likely to be even more pronounced.

  • Automated Predictive Big Data Analytics Using Ontology Based Semantics

  • Should be cited as: [ paper pdf download]

    Mustafa V. Nural, Michael E. Cotterell, Hao Peng, Rui Xie, Ping Ma, John A. Miller, "Automated Predictive Big Data Analytics Using Ontology Based Semantics", International Journal of Big Data (IJBD), 2(2), 2015, pp. 43-56, doi: 10.29268/stbd.2015.2.2.4.

    Abstract:

    Predictive analytics in the big data era is taking on an ever increasingly important role. Issues related to choice on modeling technique, estimation procedure (or algorithm) and efficient execution can present significant challenges. For example, selection of appropriate and optimal models for big data analytics often requires careful investigation and considerable expertise which might not always be readily available. In this paper, we propose to use semantic technology to assist data analysts and data scientists in selecting appropriate modeling techniques and building specific models as well as the rationale for the techniques and models selected. To formally describe the modeling techniques, models and results, we developed the Analytics Ontology that supports inferencing for semi-automated model selection. The SCALATION framework, which currently supports over thirty modeling techniques for predictive big data analytics is used as a testbed for evaluating the use of semantic technology.

=======================

Volume 2 Issue 1, 2015 [Full Issue PDF]

  • Comparing NoSQL Databases with a Relational Database: Performance and Space

  • Should be cited as: [ paper pdf download]

    João Ricardo Lourenço, Bruno Cabral, Jorge Bernardino, Marco Vieira, "Comparing NoSQL Databases with a Relational Database: Performance and Space", International Journal of Big Data (IJBD), 2(1), 2015, pp. 1-14, doi: 10.29268/stbd.2015.2.1.1.

    Abstract:

    The continuous information growth in current organizations has created a need for adaptation and innovation in the field of data storage. Alternative technologies such as NoSQL have been heralded as the solution to the ever-growing data requirements of the corporate world, but these claims have not been backed by many real world studies. Current benchmarks evaluate database performance by executing specific queries over mostly synthetic data. These artificial scenarios, then, prevent us from easily drawing conclusions for the real world and appropriately characterize the performance of databases in a real system. To counter this, we used a real world enterprise system with real corporate data to evaluate the performance and space characteristics of popular NoSQL databases and compare them to SQL counterparts. We present one of the first write-heavy evaluations using enterprise software and big data. We tested Cassandra, MongoDB, Couchbase Server and MS SQL Server, comparing their performance and total used space while handling demanding and large write requests from a real company with an electrical measurement enterprise system.

  • Directions for Big Data Graph Analytics Research

  • Should be cited as: [ paper pdf download]

    John A. Miller, Lakshmish Ramaswamy, Krys J. Kochut, and Arash Fard, "Directions for Big Data Graph Analytics Research", International Journal of Big Data (IJBD), 2(1), 2015, pp. 15-27, doi: 10.29268/stbd.2015.2.1.2.

    Abstract:

    In the era of big data, interest in analysis and extraction of information from massive data graphs is increasing rapidly. This paper examines the field of graph analytics from a query processing point of view. Whether it be determination of shortest paths or finding patterns in a data graph matching a query graph, the issue is to find interesting characteristics or information content from graphs. Many of the associated problems can be abstracted to problems on paths or problems on patterns. Unfortunately, seemingly simple problems, such as finding patterns in a data graph matching a query graph are surprisingly difficult (e.g., dual simulation has cubic complexity and subgraph isomorphism is !"-hard). In addition, the iterative nature of algorithms in this field makes the simple MapReduce style of parallel and distributed processing less effective. Still, the need to provide answers even for very large graphs is driving the research. Progress, trends and directions for future research are presented.

  • Towards Real-time Streaming Analytics based on Cloud Computing

  • Should be cited as: [ paper pdf download]

    Sangwhan Cha and Monica Wachowicz, "Towards Real-time Streaming Analytics based on Cloud Computing", International Journal of Big Data (IJBD), 2(1), 2015, pp. 28-40, doi: 10.29268/stbd.2015.2.1.3.

    Abstract:

    Nowadays, streaming data overflows from a diversity of sources and technologies, making traditional data analytics technologies unsuitable to handle the latency of data processing relative to the growing demand for high processing speed and algorithmically scalability. Real-time streaming data analytics is needed to allow applications to analyze streaming data effectively and efficiently. The open source software Apache Storm, which is a distributed computation system for processing streaming data in real-time, has been widely used for building applications to analyze streaming data in real-time because it is fast, scalable, fault tolerant and reliable. This paper proposes a cloud-based architecture based on Apache Storm for supporting an entire streaming data analytics workflow, which consists of data ingestion, data processing, data visualization and data storing.

  • An Approach For Time-Aware Domain-Based Analysis Of Users’ Trustworthness In Big Social Data

  • Should be cited as: [ paper pdf download]

    Bilal. Abu-Salih, Pornpit. Wongthongtham, Dengya. Zhu, Shihadeh. Alqrainy, "An Approach For Time-Aware Domain-Based Analysis Of Users’ Trustworthness In Big Social Data", International Journal of Big Data (IJBD), 2(1), 2015, pp. 41-56, doi: 10.29268/stbd.2015.2.1.4.

    Abstract:

    In Online Social Networks (OSNs) there is a need for better understanding of social trust in order to improve the analysis process and mining credibility of social media data. Given the open environment and fewer restrictions associated with OSNs, the medium allows legitimate users as well as spammers to publish their content. Hence, it is essential to measure users’ credibility in various domains and accordingly define influential users in a particular domain(s). Most of the existing approaches of trustworthiness evaluation of users in OSNs are generic-based approaches. There is a lack of domain-based trustworthiness evaluation mechanisms. In OSNs, discovering users’ influence in a specific domain has been motivated by its significance in a broad range of applications such as personalized recommendation systems and expertise retrieval. The aim of this paper is to present an approach to analysing domain-based user’s trustworthiness in OSNs. We provide a novel distinguishing measurement for users in a set of knowledge domains. Domains are extracted from the user’s content using semantic analysis. In order to obtain the level of trustworthiness, a metric incorporating a number of attributes extracted from content analysis and user analysis is consolidated and formulated by considering temporal factor. We show the accuracy of the proposed algorithm by providing a fine-grained trustworthiness analysis of users and their domains of interest in the OSNs using big data Infrastructure.

=======================

Volume 1 Issue 1, 2014 [Full Issue PDF]


  • Distributed and Scalable Graph Pattern Matching: Models and Algorithms

  • Should be cited as: [ paper pdf download]

    Arash Fard, M. Usman Nisar, John A. Miller, Lakshmish Ramaswamy, "Distributed and Scalable Graph Pattern Matching: Models and Algorithms", International Journal of Big Data (IJBD), 1(1), 2014, pp. 1-14, doi: 10.29268/stbd.2014.1.1.1.

    Abstract:

    Graph pattern matching is a fundamental operation for many applications, and it is exhaustively studied in its classical forms. Nevertheless, there are newly emerging applications, like analyzing hyperlinks of the web graph and analyzing associations in a social network, that need to process massive graphs in a timely manner. Regarding the extremely large size of these graphs and knowledge they represent, not only new computing platforms are needed, but also old models and algorithms should be revised. In recent years, a few pattern matching models have been introduced that can promise a new avenue for pattern matching research on extremely massive graphs. Moreover, several graph processing frameworks like Pregel have recently sought to harness shared nothing clusters for processing massive graphs through a vertex-centric, Bulk Synchronous Parallel (BSP) programming model. However, developing scalable and efficient BSP-based algorithms for pattern matching is very challenging on these platforms because this problem does not naturally align with a vertex-centric programming paradigm. This paper introduces a new pattern matching model, called tight simulation, which outperforms the previous models in its family in terms of scalability while preserving their important properties. It also presents a novel distributed algorithm based on the vertex-centric programming paradigm for this pattern matching model and several others in the family of graph simulation as well. Our algorithms are fine tuned to consider the challenges of pattern matching on massive data graphs. Furthermore, we present an extensive set of experiments involving massive graphs (millions of vertices and billions of edges) to study the effects of various parameters on the scalability and performance of the proposed algorithms.

  • A Throughput Driven Task Scheduler for Batch Jobs in Shared MapReduce Environments

  • Should be cited as: [ paper pdf download]

    Xite Wang, Derong Shen, Ge Yu, Tiezhang Nie, Yue Kou, "A Throughput Driven Task Scheduler for Batch Jobs in Shared MapReduce Environments", International Journal of Big Data (IJBD), 1(1), 2014, pp. 15-24, doi: 10.29268/stbd.2014.1.1.2.

    Abstract:

    MapReduce is one of the most popular parallel data processing systems, and it has been widely used in many fields. As one of the most important techniques in MapReduce, task scheduling strategy is directly related to the system performance. However, in multi-user shared MapReduce environments, the existing task scheduling algorithms cannot provide high system throughput when processing batch jobs. Therefore, in this paper, a novel scheduling technique, Throughput-Driven task scheduling algorithm (TD scheduler) is proposed. Firstly, based on the characteristics of shared MapReduce environments, we propose the framework of TD scheduler. Secondly, we classify the jobs into six states. Jobs in different states have different scheduling priorities. We also give the rules of state conversion, which can ensure the fairness of resource allocation and avoid wasting system resources. Thirdly, we design the detailed strategies for job selection and task assignment. The strategies can effectively improve the ratio of local task assignment and avoid hotspots. Fourthly, we show that our TD scheduler can be applied to the heterogeneous MapReduce cluster with small modifications. Finally, the performance of TD scheduler is verified through plenty of simulation experiments. The experimental results show that our proposed TD scheduler can effectively improve the system throughput for batch jobs in shared MapReduce environments.

  • Personalization Big Data vs. Privacy Big Data

  • Should be cited as: [ paper pdf download]

    Benjamin Habegger, Omar Hasan, Lionel Brunie, Nadia Bennani, Harald Kosch, Ernesto Damiani, "Personalization Big Data vs. Privacy Big Data", International Journal of Big Data (IJBD), 1(1), 2014, pp. 25-35, doi: 10.29268/stbd.2014.1.1.3.

    Abstract:

    Personalization is the process of adapting the output of a system to a user’s context and profile. User information such as geographical location, academic and professional background, membership in groups, interests, preferences, opinions, etc. may be used in the process. Big data analysis techniques enable collecting accurate and rich information for user profiles in particular due to their ability to process unstructured as well as structured information in high volumes from multiple sources. Accurate and rich user profiles are important for personalization. For example, such data are required for recommender systems, which try to predict elements that a user has not yet considered. However, the information used for personalization can often be considered private, which raises privacy issues. In this paper, we discuss personalization with big data analysis techniques and the associated privacy challenges. We illustrate these aspects through the ongoing EEXCESS project. We provide a concrete example of a personalization service, proposed as part of the project, that relies on big data analysis techniques.

  • A Forecasting Approach for Data Allocation in Scalable Database Systems

  • Should be cited as: [ paper pdf download]

    Shun-Pun Li, Man-Hon Wong, "A Forecasting Approach For Data Allocation In Scalable Database Systems", International Journal of Big Data (IJBD), 1(1), 2014, pp. 36-48, doi: 10.29268/stbd.2014.1.1.4.

    Abstract:

    In cloud computing environment, database systems have to serve a large number of tenants instantaneously and handle applications with different load characteristics. To provide a high quality of services, scalable distribute database systems with self-provisioning are required. The number of working nodes is adjusted dynamically based on user demand. Data fragments are reallocated frequently for node number adjustment and load balancing. The problem of data allocation is different from that in traditional distributed database systems, and therefore existing algorithms may not be applicable. In this paper, we first formally define the problem of data allocation in scalable distributed database systems. Then, we propose a data allocation algorithm, which makes use of time series models to perform short-term load forecasting. For online applications, probably, there are observable access patterns and peak load hours. With an accurate load forecasting, node number adjustment and fragment reallocation can be performed in advance to avoid node overloading and performance degradation due to fragment migrations. In addition, excessive working nodes can be minimized for resource-saving. For verifying the feasibility of our forecasting approach, time series analysis is conducted on real access logs. Simulations are performed to evaluate and compare the proposed algorithm with the traditional threshold-based algorithm.

    Contact Information

    If you have any questions or queries on the Services Transactions on Big Data, please send email to ijbd AT servicessociety.org.