PORTFOLIO
THESIS
Thesis Topic: Stream Processing of Financial Tick Data with In-Order Guarantees [Full Text]
Thesis Topic: Migrating State Between Jobs in Apache Spark [Full Text]
Abstract: Data related to all types of societal activity are nowadays being produced and available in high volumes and velocity, in the form of data streams. This thesis focuses on financial tick data generated by stock exchanges and the necessity for stream processing analytics to assist traders in identifying trading opportunities. We design and implement the Tick Analysis Platform (TAP), a streaming analytics application that performs event aggregation and complex event processing to compute trend indicators and detect patterns, enabling the identification of buy/sell opportunities for traders. Based on the need to process streaming data as rapidly as possible, we investigate techniques for scalable stream processing, with an additional guarantee, namely the in-order processing of such data based on sequencing information available in batches of incoming data. The solutions designed and implemented in this thesis, S-TAP (Single-source TAP) and P-TAP (Parallel-source TAP), progressively enhance the scalability of TAP to achieve high performance on a cluster of multi-core servers while ensuring the accuracy of results via the in-order guarantees. An additional challenge investigated by this thesis is efficient fault-tolerance mechanisms to achieve low down-times during recovery of data analysis jobs. This is achieved by aligning the deployment of recovery tasks with the location of externally-stored checkpoint replicas, taking advantage of data locality where possible. The solutions implemented and demonstrated in this thesis advance the state of the art in scalable streaming analytics of financial tick data that are also rapidly recoverable in the face of failures.
Abstract: Nowadays, data is being generated at an unprecedented rate and impacts every aspect of our everyday life. As this amount increases, more and more organizations try to incorporate techniques to handle that data in real-time and evolve their business strategy. One critical challenge is ensuring fault-tolerance and high availability in our data. On different occasions, the heterogeneous systems responsible for data processing must disrupt their operation and update their infrastructure. In some other cases, system failures can occur. Therefore, migration techniques that prevent data loss are getting increasingly important. In this thesis, we propose a state migration algorithm implemented on Apache Spark's Structured Streaming API. This powerful API offers a fast, scalable solution for processing complex workloads and ensures fault tolerance through its checkpointing mechanism. The algorithm handles state among different jobs and covers various scenarios where users might wish to split, merge, or remotely deploy workflows in each job with no data loss. In that way, users have complete control over workflow operators and can impact their execution at will. Additionally, to prove that our implementation works, we used Rapidminer Studio workflow designer to present complete and detailed test-cases for the cases mentioned above.
SURVEY
Working with Big Data on Cloud: A Survey [Full Text]
Abstract: Nowadays, data is being generated at an unprecedented rate and greatly impacts our lives by providing useful insights to organizations that shape our everyday experiences. As this amount increases, more and more organizations incorporate tools to efficiently handle the analysis of these data. The tools for processing large volumes of data are known as Big Data Frameworks. Many different frameworks exist that offer different features and solutions to tackle each task at hand. Cloud computing has proven to combine well with Big Data Frameworks as it delivers a reliable, fault-tolerant, available and scalable environment for organizations to perform their analysis. Despite the fact that this pairing presents many advantages, there are still challenges yet to be addressed that raise concerns. This paper, provides an overview of big data and cloud computing describing the existing relationship between them and mentions the challenges while intergrating big data to cloud. This paper also presents the currently most popular Big Data Frameworks and their key differences based on some fundamental features.
PROJECTS
Million-Song-Dataset-Analysis-using-ML-models-on-Big-Data [Link] [Report]
Setup Cluster Scripts For Big Data Tools [Link]
Basic Query Optimizer [Link]
Portreto [Link]
Description: Music is often considered a reflection of the society and is a particularly interesting topic for researchers in order to examine the societal culture and value of each generation. For a human being it is relatively easy to determine whether a song belongs in a certain era or not, but for machines such problems are not trivial. Using the Million Song Dataset, a collection of audio features and metadata, I evaluated different classification algorithms and their ability to predict whether a song dates before or after the year 2000 and achieved a best score of 0.775 using the ROC-AUC metric.
While the key challenge in this project is to accurately determine whether songs date before/after 2000, the format of the data poses also significant challenge since the raw Dataset consists files in binary .h5 format. Processing those files, especially in a distributed environment, is non-trivial. So the implementation focuses on parsing those binary files in a distributed manner with Apache Spark, and the Machine Learning implementation is used more like a proof-of-concept.
Description: Implemented algorithm that dynamically computes a random sampling for a single aggregation, multiple group by using Apache Flink and Apache Kafka. The algorithm implementation was based on the paper Random Sampling for Group-By Queries.
The suggested approach parses a stream of twice and perfoms sampling for group by queries. Given that the stream must be parsed twice, the implementation works only on bounded streams. On the first pass the metrics calculated are: average, standard deviation,γi statistics for each stratum(Group By) and total γ (sum of all γi values).In the second pass, we can perform sampling using all statistics precomputed in first pass. Apache Kafka is utilized to produce-consume data in-between stages. The initial stream of data can be parsed from .csv files.
Description: A collection of .bash scripts to automatically setup both Standalone and Cluster versions of Apache Flink and HDFS. The scripts are highly parameterizable (CLI parameters) and also include building from source.
Description: Implemented a basic Cost-Based Query Optimizer in Scala using Apache Spark framework.
The process begins with an SQL Query as the starting point. Initially, the implemented mechanism evaluates the cost of each table by counting the number of entries, with more entries indicating higher table costs. Subsequently, it proceeds to identify all potential combinations of joined tables. To achieve the optimal solution—finding joins with minimal cost—the problem is approached as a graph problem. Here, tables are treated as vertices and their joins as edges. The objective is to trace the Hamiltonian path, which represents the optimal sequence of joins.
Description: Implemented the Smith–Waterman (SW) algorithm, a dynamic programming algorithm designed to determine the optimal alignment between two sequences by specifying penalties for mismatches and gaps. Initially, I implemented the SW algorithm in C, then enhanced its performance through parallel execution models, leveraging POSIX threads (pthreads) and the OpenMP Application Programming Interface (API). I explored varying levels of granularity, employing both fine-grained and coarse-grained parallelization techniques to accelerate the algorithm.
Description: Portreto is a distributed image-oriented web application which is implemented in multitier architecture.
The primary goal of this project is to build a distributed application which does not have a single point of failure. In order to do this, the design involves stateless microservices runing in Docker containers. Μicroservices are written in python 3.7 using Django. Moreover, Docker Swarm in utilized for replication and load balancing purposes. The microservices used are: Web service (presentation layer), Authentication service (provides valid authentication tokens in order to work stateless), Application service (logic tier), Storage service (holds users’ photos), Database (users’ information exist here. E.g. usernames, emails, etc..), Zookeeper