Ayon Roy

Oct 12, 2020 • WeAreDevelopers LIVE

PySpark - Combining Machine Learning & Big Data

How do you apply machine learning when your dataset is too big for a single machine? Discover PySpark's powerful, distributed ML pipelines.

#1about 3 minutes

Combining big data and machine learning for business insights

The exponential growth of data necessitates combining big data processing with machine learning to personalize user experiences and drive revenue.

#2about 3 minutes

An introduction to the Apache Spark analytics engine

Apache Spark is a unified analytics engine for large-scale data processing that provides high-level APIs and specialized libraries like Spark SQL and MLlib.

#3about 4 minutes

Understanding Spark's core data APIs and abstractions

Spark's data abstractions evolved from the low-level Resilient Distributed Dataset (RDD) to the more optimized and user-friendly DataFrame and Dataset APIs.

#4about 11 minutes

How the Spark cluster architecture enables parallel processing

Spark's architecture uses a driver program to coordinate tasks across a cluster manager and multiple worker nodes, which run executors to process data in parallel.

#5about 5 minutes

Using Python with Spark through the PySpark library

PySpark provides a Python API for Spark, using the Py4J library to communicate between the Python process and Spark's core JVM environment.

#6about 5 minutes

Exploring the key features of the Spark MLlib library

Spark's MLlib offers a comprehensive toolkit for machine learning, including pre-built algorithms, featurization tools, pipelines for workflow management, and model persistence.

#7about 4 minutes

The standard workflow for machine learning in PySpark

A typical machine learning workflow in Spark involves using DataFrames, applying Transformers for feature engineering, training a model with an Estimator, and orchestrating these steps with a Pipeline.

#8about 3 minutes

Pre-built algorithms and utilities available in Spark MLlib

MLlib includes a variety of common, pre-built algorithms for classification, regression, and clustering, such as logistic regression, SVM, and K-means clustering.

Admir Ag123
Vienna, Austria

Intermediate

JavaScript

TypeScript

Andrew Comp
Vienna, Austria

Senior

PHP

JavaScript

+1

Saby Company
Delebio, Italy

Junior

Java

Node.js

Overview of the data and machine learning tech stack

01:29 MIN

Overview of the data and machine learning tech stack

Empowering Retail Through Applied Machine Learning

Q&A: Raw data formats and comparing dbt to Spark

01:29 MIN

Q&A: Raw data formats and comparing dbt to Spark

Enjoying SQL data pipelines with dbt

Q&A on parallel computing, data versioning, and security

05:53 MIN

Q&A on parallel computing, data versioning, and security

DevOps for Machine Learning

The production architecture and technology stack for AML AI

03:27 MIN

The production architecture and technology stack for AML AI

Detecting Money Laundering with AI

Presenting live web scraping demos at a developer conference

01:57 MIN

Presenting live web scraping demos at a developer conference

Tech with Tim at WeAreDevelopers World Congress 2024

Key takeaways for modern data processing

01:31 MIN

Key takeaways for modern data processing

Convert batch code into streaming with Python

Going beyond standard aggregations in Spark

01:46 MIN

Going beyond standard aggregations in Spark

Let's Get Aggregated: Custom UDAFs in Spark

Comparing methods for machine learning with databases

02:59 MIN

Comparing methods for machine learning with databases

Using WebAssembly for in-database Machine Learning

Featured Partners

Overview of Machine Learning in Python

Overview of Machine Learning in Python

Adrian Schmitt

about 2 years ago • WeAreDevelopers LIVE

Alibaba Big Data and Machine Learning Technology

Alibaba Big Data and Machine Learning Technology

Dr. Qiyang Duan

about 5 years ago • WeAreDevelopers LIVE

Data Science in Retail

Data Science in Retail

Julian Joseph

about 3 years ago • WeAreDevelopers LIVE

Fully Orchestrating Databricks from Airflow

Fully Orchestrating Databricks from Airflow

Alan Mazankiewicz

about 4 years ago • WeAreDevelopers LIVE

Convert batch code into streaming with Python

Convert batch code into streaming with Python

Bobur Umurzokov

about 2 years ago • WeAreDevelopers LIVE

Python-Based Data Streaming Pipelines Within Minutes

Python-Based Data Streaming Pipelines Within Minutes

Bobur Umurzokov

about a year ago • WeAreDevelopers LIVE

Introduction to Azure Machine Learning

Introduction to Azure Machine Learning

Jose Luis Latorre Millas

about 4 years ago • WeAreDevelopers LIVE

Detecting Money Laundering with AI

Detecting Money Laundering with AI

Stefan Donsa & Lukas Alber

about 5 years ago • WeAreDevelopers LIVE

Related Articles

View all articles

CH

Chris Heilmann

Coffee with Developers - Maria Apazoglou - Making AI understandable for all in production

Hello and welcome to another edition of Coffee with Developers. Today, we're excited to share an intriguing conversation with Maria Apazoglou, a leading figure in the AI space at Thomson Reuters. Maria's career journey, insights on AI, and the exciti...

Coffee with Developers - Maria Apazoglou - Making AI understandable for all in production

DD

Dilek Demir

Data Science & more: The Lopez dilemma

Catwalk, Data Science, Hollywood, Google Images, Haute Couture, StackOverflow, Comfort Zone, Dota 2 and Versace – all these topics are connected and influenced by each other. Read here how and why!In 2000 Jennifer Lopez's green Versace dress went vi...

Data Science & more: The Lopez dilemma

CH

Chris Heilmann

Exploring AI: Opportunities and Risks for Developers

In today's rapidly evolving tech landscape, the integration of Artificial Intelligence (AI) in development presents both exciting opportunities and notable risks. This dynamic was the focus of a recent panel discussion featuring industry experts Kent...

Exploring AI: Opportunities and Risks for Developers

CH

Chris Heilmann

Dev Digest 134 - Where pixels sing?

News and ArticlesWeAreDevelopers LIVE Data and Security Day is on Wednesday, 25/09/2024. Learn about OPC UA Updates, Best Practices for Using GitHub Secrets, Passwordless Web 1.5, Emerging AI Security Risks, Data Privacy in LLMs and get a chance to t...

Dev Digest 134 - Where pixels sing?

From learning to earning

Jobs that call for the skills explored in this talk.

Machine Learning & Data Engineer

vengine GmbH
Hamburg, Germany

Junior

Intermediate

Python

Data Engineer (m/w/d) mit Fokus auf Databricks

Steadforce GmbH
Munich, Germany

Intermediate

Python

Senior Python Engineer

CONTIAMO GMBH
Berlin, Germany

Senior

Python

Docker

TypeScript

PostgreSQL

Data Engineer (f/m/d) - AI

smartclip Europe GmbH
Hamburg, Germany

Intermediate

Senior

ETL

Java

Scala

PySpark Software Engineer - Databricks, Azure, Data Engineering

RM IT Professional Resources AG
Zürich, Switzerland

€187-208K

Senior

PySpark

ML Data Engineer - Object Detection & Active Learning

autonomous-teaming

Remote

NoSQL

NumPy

Pandas

Docker

ML Data Engineer - Object Detection & Active Learning

autonomous-teaming

Remote

NoSQL

NumPy

Pandas

Docker

Python Developer in Data Mining

QYOBO GmbH

Python Developer | AI tool

Kwery

Remote

€54K

Svelte

TypeScript