Abstract
Storing, loading and transforming big data is a challenging concept. While data modeling becomes more and more complex, data scientists face the issue of not having the technical know-how on setting up big data processing engines. The process is inefficient, ineffective and often leads to going back to old methods,
... read more
such as local R scripts and single-server approaches, leaving the big out of big data.
While methods for knowledge discovery and data mining such as CRISP-DM and SEMMA are plentiful, methods for the creation of big data processing workflows are not. This research proposes a method for the creation of such big data processing workflows, where data scientists use pre-defined steps and reference manuals to create a workflow that meets their modeling and computing needs.
Literature- and market review shows that tools and techniques are widely available, and providers of computing power and distributed storage are as well. Microsoft, Amazon, and other cloud computing providers all offer platforms where clusters can be set up and run distributed computing engines, such as Hadoop and Spark. These engines run on top of distributed file systems, also provided by these vendors. The benefits of distributed computing and the soft- and hardware available in the field are evident; efficient, reliable and scalable computing power without having to heavily invest in on-premise servers. One of the significant downsides is; instructions, methods, and research on how to leverage the power of these tools and techniques are nearly non-existent.
The developed CRISP-DCW method uses a development cycle, like that of CRISP-DM. The cycle consists of three phases; Problem Context, Design, and Implementation. Each phase has activities with specified deliverables. The Problem Context phase describes and documents the Context & Goals of the project. After which the Design phase starts, Input, Output, and Processing are described and designed according to the goals and constraints specified in the previous phase. After finishing the Design phase, the Implementation phase starts, where evaluation of the developed workflow takes place, and the deployment is specified. The method provides detailed actions to be made, their dependencies and associated deliverables for each activity, creating a well-documented and step-by-step guide for creating big data processing workflows. The method is evaluated in daily practice at a Dutch organization, providing a big data modeling problem in the context of online ad serving. The method is used to create and document the process of developing a big data processing workflow. A combination of Hadoop shell actions and Spark actions is chained in a Directed Acyclic Graph, using the Oozie workflow engine, and is deployed in a cloud environment. The effectiveness of the workflow is confirmed, and performance for different cluster set-ups is evaluated.
Future research and development opportunities for this method are; more extensive validation in a real-world environment, and including more big data processing techniques. The first implementation of the method is successfully validated, but effectiveness and efficiency of the method can be further validated by implementing the method at different companies and in different contexts and performing a statistical difference making experiment.
show less