In the new era where IoT, big data, and analytics meet on the cloud, enormous amounts of data, generated by IoT, are streamed and stored on the cloud for continuous analysis by analytic workloads. The scale of the throughput and size of the data processing, combined with the scale and dynamic nature of the cloud resources, is creating new challenges as well as new opportunities. This is where our work is focused.

We are creating the 'pipes' through which the data is streamed from its source to destination (e.g. from IoT frameworks to object storage, from storage to analytics applications) faster and smarter - optimizing both the streaming velocity and data transformation to make it analytics-ready.

We work to improve the performance of analytic workloads on the cloud (such as those using Apache Spark) so they better exploit the scale and dynamic nature of the cloud and the data. Cloud operations at scale generates new types of problems which impact continuous availability and are difficult to debug; To help with this, we are developing a diagnostic framework which integrates root cause analysis with first-aid recovery operations, at scale. An important aspect is adapting to new economic models associated with cloud resources - we are working on cost optimization and a smart cloud brokering service.

Our major activities include:

Manager

Ofer Biran, Manager Cloud and Data Technologies, IBM Research - Haifa


Links


Past Activities

Spark File Filters

We developed File Filters that extend the existing Spark SQL partitioning mechanism to dynamically filter irrelevant objects during query execution. Our approach handles any data format supported by Spark SQL (Parquet, JSON, csv etc.). Unlike pushdown compatible formats such as Parquet, which require touching each object to determines its relevance, we don’t access irrelevant objects. This is essential for object storage, where every REST request incurs significant overhead and financial cost. Our pluggable interface for developing and deploying File Filters is easy to use. One example that we implemented filters objects according to their metadata, illustrated below.


Motivation

The Internet of Things is swamping our planet with data, while IoT use cases drive demand for analytics on extremely scalable and low cost storage. Enter Spark SQL over Object Storage (e.g. Amazon S3, IBM Cloud Object Storage) – highly scalable and low cost storage which provides RESTful APIs to store and retrieve objects and their metadata.

In our world of separately managed micro-services, for any given query, limiting the flow of data from object storage to Spark is critical to avoid shipping huge datasets. Existing techniques involve partitioning the data and using specialized formats such as Parquet which support pushdown. Partitioning can be inflexible because of its static nature - only one partitioning hierarchy is possible and it can’t be changed without rewriting the entire dataset. But now File Filters for Spark is changing all that.

Contact

Paula Ta-Shma

A Bridge from Message Hub to Object Storage

Message Hub is IBM Bluemix’s hosted Kafka service, enabling application developers to leave the work of running and maintaining a Kafka cluster to messaging experts and focus on their application logic. We extended Message Hub with a bridge from Message Hub to the Bluemix Object Storage service, which persists a topic’s messages as analytics-ready objects.

With the Object Storage bridge, data pipelines from Message Hub to Object Storage can be easily set up and managed to generate Apache Spark conformant data, which can be analyzed directly by the IBM Data Science Experience using Spark as a Service.

A typical IoT data pipeline is shown below where IoT device data is collected by the Watson IoT Platform, and then streamed to Object Storage via Message Hub and the Object Storage bridge. This data is then analyzed using Spark. A use case scenario demonstrating this kind of architecture and the use of the Object Storage bridge is described in Paula Ta-Shma’s blog.

Contact

Paula Ta-Shma

Stocator

Stocator (Apache License 2.0), a unique object store connector for Apache Spark. Stocator is written in Java, and implements an HDFS interface to simplify its integration with Spark. There is no need to modify Spark’s code to use the new connector: it can be compiled with Spark via a Maven build, or provided as an input package without the need to re-compile Spark.

Stocator works smoothly with the Hadoop Mapreduce Client Core module, but uses an object store approach. Apache Spark continues working with Hadoop Mapreduce Client Core and Spark can use various existing Hadoop output committers. Stocator intercepts internal requests and adapts them to the object store before they access the object store.

For example, when Stocator receives a request to create a temporary object or directory, it creates the final object instead of a temporary one. Stocator also uses streaming to upload objects. Uploading an object via streaming, without knowing its content length in advance, eliminates the need to create the object locally, calculate its length, and only then upload it to the object store. Swift is unique in its ability to support streaming, as compared to other object store solutions available on the market.

For more info on Stocator visit https://github.com/SparkTC/stocator

Contact

Gil Vernik

Seamless Storage Driver

Abstracting the storage layers from data scientists and developers speeds up the development process, shortens the learning curve and generates more focused code. We developed the seamless storage driver, an Apache Spark driver that abstracts away the gap between real-time and historical data in the Watson IoT Platform. It enables transparent analytics over all data using native Spark SQL. The driver optimizes data access and supports querying data as it migrates between storage systems, without data loss or duplication. An application developer can then access the entire dataset without being aware of where its various parts are stored. For example, recent data might be stored in various Cloudant databases and older data on IBM Cloud Object Storage.

More information is available in the following tutorial.

Serverless Data Pipelines for IoT

We are applying the capabilities of a ‘serverless’ platform - OpenWhisk to the domain of IoT. We demonstrate interesting use-cases of in-flight transformation of data ingested from IoT devices, as part of a larger pipeline involving additional data services, as depicted in the following picture:


Example of IoT pipeline in IBM Bluemix including OpenWhisk transformation

You can read more about this in the following blog.

We are also exploring approaches to push the ‘serverless’ technologies to the cloud edge, closer to the IoT devices, by extending OpenWhisk to work with open technologies as Node-RED, Docker and resin.io. This provides enablement for ‘Edge Analytics’ scenarios - addressing the needs to process data close to the source.


Example architecture of IoT gateway running Node-RED and OpenWhisk actions

You can read more about this in the following blog.

Contact

Alex Glikson

Spark Performance / Test Harness

Apache Spark is a fascinating big data platform that combines ease-of-use for developers and admins with computation performance. We focus on improving the throughput of Spark-as-a-service, in scenarios involving cluster sharing and multi-tenancy. We explore adaptations of clustering techniques to Spark, such as malleable/moldable workload scheduling, back-filling, and others. There are many interesting challenges and trade-offs involved, such as guaranteeing high throughput and workload latency/turnaround vs total throughput.


For the purposes of our research, we created Test Harness (TH), a unique tool for experimenting with multi-tenant Spark service performance. It allows designing experiments that encompass all the factors of a multi-tenancy setting: tenants, SLAs, workloads, datasets, and most important – the arrival process (trace) by which workloads and datasets are submitted to the service by tenants. Then, TH allows executing experiments and finally collecting the Spark and infrastructure metrics in an orderly fashion and reasoning about them – first through high-level goal metrics (scores) and then drilling down through bundled tools.

TH serves several important use-cases, including analyzing service performance under particular scenarios, and managing performance regression tests for a Spark service that is being developed. Last, being independent of any particular service architecture, TH can be used for competitive analysis between different Spark services, using the same set of experiments.


Contact

Erez Hadad

ACCO - Advanced Cloud Cost Optimizer

ACCO is a multi-cloud decision support tool that taps into the customer cloud usage data and continuously produces recommendations on how to decrease the operational costs of elastic cloud workloads.

One particularly important issue in optimizing the bottom line for cloud customers is smart use of sustained discounts offered by virtually all cloud providers in one form or another to set incentives for the customers to indicate their future anticipated use. Usually, sustained usage discounts are offered in exchange for longer term commitment by the customers so they pay for a full utilization of services and resources irrespectively of how they will be used in the future. Obviously, to take advantage of the sustained usage discounts some means to forecast future usage is required. Also, chargeback schemes employed by various providers differ significantly, so it is far from trivial to optimize costs in a multi-cloud settings.

ACCO uses the historic billing information of customers to forecast future use with high fidelity. It applies optimization techniques to generate various recommendations for cost savings with respect to the providers sustained usage discount schemes.

A prototypical implementation of ACCO exists and we welcome partners to perform PoC studies.


Contact

David Breitgand

Industry 4.0 Manufacture Cloud for South Africa

Manufacturing becomes programmable end-to-end across the entire value chain: starting from sourcing of the suppliers of designs, parts, and raw materials; to electronically submitting orders that can be automatically processed on the manufacturer side; to assembly; and to reconfiguration of supply chain relationships on demand. Future manufacturing will be software defined, implying a rich ecosystem of services, systems, manufacturing machines, interconnected into a global manufacturing fabric analogous to how information services are interconnected today on the Internet. Collectively, these services will facilitate Software Defined Manufacturing, opening a plethora of new business opportunities. We focus on developing a robust, cost efficient, secure, and extensible cloud based Manufacturing as a Service (MaaS) ICT platform for B2B collaborative manufacturing. We build upon recent advances in Factories of the Future and take them to the next level of performance, security, dependability, and cost-efficiency.

It is of paramount importance to extend benefits of the “fourth industrial revolution” to emerging markets. On one hand, these markets have a potential to leapfrog, on the other hand, if they do not act fast, systematically and decisively they risk losing manufacturing jobs that constitute an important pillar of their economies. Using our SDM vision we are helping foster collaboration between the South African government Department of Technology and Industry (DTI) and IBM South Africa to foster Private Public Partnership to develop a Software Defined Manufacturing platform that will bring Black Industrialist SMEs into the era of Industry 4.0.


Contact

David Breitgand

Design of HPC Hybrid Cloud for Helix Nebula

IBM has passed the first phase of Helix Nebula, the European Science Cloud (HNSciC) Pre-Commercial Procurement (PCP) tender, issued by CERN (the European Organization for Nuclear Research) on behalf of ten key European Scientific Research Centers disseminated in Europe, including CNRS (Centre National de la Recherche Scientifique) and DESY (Deutsches Elektronen-Synchrotron).

The objective of HNSciCloud (http://www.helix-nebula.eu/) is the establishment of a European hybrid cloud platform to support the deployment of high-performance computing and big-data capabilities for scientific research. We were part of the technical team that prepared the bid proposal, contributing our expertise on all aspects of cloud infrastructure, and subsequently we led a group of technical experts on the design phase itself

Read more here about Building an open seamless science cloud.


Ezra Silvera on the bid winner’s ceremony in Lyon


Contact

Ezra Silvera