Data and machine-learning analytics are proliferating into just about every industry, with tasks becoming ever-more complex. Larger datasets and more systems designed for AI research are fantastic—but as these workflows become more involved, researchers are spending more and more time configuring their setups than getting data science done.
A new project from our team, presented at the 2021 Ray Summit, aims to help.
Today, we’re announcing About CodeFlare: CodeFlare simplifies the integration, scaling and acceleration of complex multi-step analytics and machine learning pipelines on the hybrid multi-cloud.CodeFlare, an open-source framework for simplifying the integration and efficient scaling of big data and AI workflows onto the hybrid cloud. CodeFlare is built on top of Ray, an emerging open-source distributed computing framework for machine learning applications. CodeFlare extends the capabilities of Ray by adding specific elements to make scaling workflows easier.
To create a machine learning model today, researchers and developers have to train and optimize the model first. This might involve data cleaning, feature extraction, and model optimization. CodeFlare simplifies this process using a Python-based interface for what’s called a pipeline—by making it simpler to integrate, parallelize and share data. The goal of our new framework is to unify pipeline workflows across multiple platforms without requiring data scientists to learn a new workflow language.
CodeFlare pipelines run with ease on IBM’s new serverless platform IBM Cloud Code Engine, and Red Hat OpenShift. It allows users to deploy it just about anywhere, extending the benefits of serverless to data scientists and AI researchers. It also makes it easier to integrate and bridge with other cloud-native ecosystems by providing adapters to event-triggers (such as the arrival of a new file), and load and partition data from a wide range of sources, such as cloud object storages, data lakes, and distributed filesystems.
CodeFlare should also mean developers aren’t having to duplicate their efforts or struggle to figure out what colleagues have done in the past to get a certain pipeline to run. With CodeFlare, we aim to give data scientists richer tools and APIs that they can use with more consistency, allowing them to focus more on their actual research than the configuration and deployment complexity.
We expect our framework to save developers significant time and effort in creating pipelines deployed to hybrid cloud.
And we are already seeing it. For example, when one user applied the framework to analyze and optimize approximately 100,000 pipelines for training machine learning models, CodeFlare cut the time it took to execute each pipeline from 4 hours to 15 minutes. With other users, we’ve seen CodeFlare shave off months of developer time, and allow them to tackle larger data problems than before.
We’re open-sourcing CodeFlare, along with a series of technical blog posts on how it works and what you need to get started using it. And this is just the start of where we plan to go with CodeFlare. We’ve started applying this tech to things we’re building at IBM, in our own AI research. We’ll continue to evolve CodeFlare to support increasingly more complex pipelines. We’re planning on providing enhanced fault-tolerance and consistency, as well as improving integration and data management for external sources, and adding support for pipeline visualization.
We can’t wait to see how the community uses CodeFlare for their projects and giving them more time to do what they actually love doing—the data science.