Deep Dive Sessions Data Operations & Engineering Archive

The deep dive debates the question of how the somehow elitist „playing around“ of data scientists with machine learning models becomes a productive and stable application for the everyday business. Specifically, using three real-life case studies from the speakers consulting experience, various approaches and technical components are shown that enable the deployment of ML models: First, you will learn how a sales forecasting model of a delivery service created in KNIME was implemented with KNIME server. Secondly, a credit scoring model created in R becomes productive in a Databricks / Azure cloud environment. How would that have looked alternatively with the Cloudera Data Science Workbench in an on-premises Hadoop environment? Finally it is shown how a fraud detection model in Python was deployed as a web service using open-source components (Flask, Kubernetes, Dockers). The pros and cons and the hidden pitfalls are outlined beyond the colorful presentations of software vendors.

Everyone hears about machine learning (ML) & artificial intelligence (AI) while you are building the models. You spend weeks/months working on something, prototyping and when things „are done“, it needs to be deployed in production ASAP – and that is just the tip of the iceberg. We use ML models when we need to find patterns without explicitly programming machines to do so. Data scientists usually do not have a software engineering background, testing ML is tricky and all the other problems related to ML in production, ML components can drink from the same source of the devops movement. Do we need to talk about CI/CD for ML? Yes, please, but we need to talk also about Continuous Evaluation! How can we test and debug ML? Create a safe environment for data scientists is important, but why exactly? How can we package, deploy and serve ML models? By the end of this talk, you will understand more about ML lifecycles, the AI hype and feel more comfortable to answer these questions and help your organization move faster. Thiago also promises a ML testing and building demo… may the demo gods help us!

Data science is rapidly becoming the primary catalyst for product innovation. However, most of the projects are stuck in the Proof-of-Concept (POC) phase. Christian and René had the chance to be part of GfK’s journey from a traditional market research company to a prescriptive data analytics provider. In order to build end-to-end data-driven products successfully, it is necessary to blend what existing frameworks like SCRUM and CRISP provide with the best practices from software engineering. You will learn about how they gradually established a data science development lifecycle that overcomes the POC-trap by considering production realities from day 1. Leveraging core concepts like KPI-driven development and micro-services they are able to successfully develop, deploy, scale and maintain data science models in production.

In the exploration phase of agile projects, different solution paths are spiked to get an idea on which would be the most suitable one. However, by timeboxing the amount of effort put into every path, you might easily fall into the „greedy“-trap of choosing the easiest approach. More suitable technologies/models that would be too time consuming to fit into the timebox are never considered.

This talk is about reusing results/models of different approaches across other projects and teams to avoid this trap and to speed up the exploration phase. This can be achieved by offering former spike implementations behind an API as a service. If those services are dockerized, tagged, documented, and thus preserved, future exploration phases can shortcut by leveraging the results of already existing models.

Exemplarily, a machine learning classification problem is showcased that uses images on a website rather than text due to circumvent multi-language-problems. It covers the bootstrap-phase of gathering and pre-processing image data, applying transfer learning to the collected data, and offering the resulting neural network via an endpoint.