Ayon Roy

PySpark - Combining Machine Learning & Big Data

How do you apply machine learning when your dataset is too big for a single machine? Discover PySpark's powerful, distributed ML pipelines.

PySpark - Combining Machine Learning & Big Data
#1about 3 minutes

Combining big data and machine learning for business insights

The exponential growth of data necessitates combining big data processing with machine learning to personalize user experiences and drive revenue.

#2about 3 minutes

An introduction to the Apache Spark analytics engine

Apache Spark is a unified analytics engine for large-scale data processing that provides high-level APIs and specialized libraries like Spark SQL and MLlib.

#3about 4 minutes

Understanding Spark's core data APIs and abstractions

Spark's data abstractions evolved from the low-level Resilient Distributed Dataset (RDD) to the more optimized and user-friendly DataFrame and Dataset APIs.

#4about 11 minutes

How the Spark cluster architecture enables parallel processing

Spark's architecture uses a driver program to coordinate tasks across a cluster manager and multiple worker nodes, which run executors to process data in parallel.

#5about 5 minutes

Using Python with Spark through the PySpark library

PySpark provides a Python API for Spark, using the Py4J library to communicate between the Python process and Spark's core JVM environment.

#6about 5 minutes

Exploring the key features of the Spark MLlib library

Spark's MLlib offers a comprehensive toolkit for machine learning, including pre-built algorithms, featurization tools, pipelines for workflow management, and model persistence.

#7about 4 minutes

The standard workflow for machine learning in PySpark

A typical machine learning workflow in Spark involves using DataFrames, applying Transformers for feature engineering, training a model with an Estimator, and orchestrating these steps with a Pipeline.

#8about 3 minutes

Pre-built algorithms and utilities available in Spark MLlib

MLlib includes a variety of common, pre-built algorithms for classification, regression, and clustering, such as logistic regression, SVM, and K-means clustering.

Related jobs
Jobs that call for the skills explored in this talk.

d

Saby Company
Delebio, Italy

Junior

job ad

Saby Company
Delebio, Italy

Intermediate

Featured Partners

Related Articles

View all articles
DD
Dilek Demir
Data Science & more: The Lopez dilemma
Catwalk, Data Science, Hollywood, Google Images, Haute Couture, StackOverflow, Comfort Zone, Dota 2 and Versace – all these topics are connected and influenced by each other. Read here how and why!In 2000 Jennifer Lopez's green Versace dress went vi...
Data Science & more: The Lopez dilemma
CH
Chris Heilmann
Exploring AI: Opportunities and Risks for Developers
In today's rapidly evolving tech landscape, the integration of Artificial Intelligence (AI) in development presents both exciting opportunities and notable risks. This dynamic was the focus of a recent panel discussion featuring industry experts Kent...
Exploring AI: Opportunities and Risks for Developers
CH
Chris Heilmann
Dev Digest 134 - Where pixels sing?
News and ArticlesWeAreDevelopers LIVE Data and Security Day is on Wednesday, 25/09/2024. Learn about OPC UA Updates, Best Practices for Using GitHub Secrets, Passwordless Web 1.5, Emerging AI Security Risks, Data Privacy in LLMs and get a chance to t...
Dev Digest 134 - Where pixels sing?

From learning to earning

Jobs that call for the skills explored in this talk.