We'd love to work with you!

By answering these 7 questions, we can find out if we're a match

Cloudera Data Scientist Training

Course Details Find Out More
Tuition (CAD) N/A
Tuition (USD) 3195.00

This four-day workshop covers enterprise data science and machine learning using Apache Spark in Cloudera Data Science Workbench (CDSW). Participants use Spark SQL to load, explore, cleanse, join, and analyze data and Spark MLlib to specify, train, evaluate, tune, and deploy machine learning pipelines. They dive into the foundations of the Spark architecture and execution model necessary to effectively configure, monitor, and tune their Spark applications. Participants also learn how Spark integrates with key components of the Cloudera platform such as HDFS, YARN, Hive, Impala, and Hue as well as their favorite Python or R packages.

Who Can Benefit

  • The workshop is designed for data scientists who use Python or R to work with small datasets on a single machine and who need to scale up their data science and machine learning workflows to large datasets on distributed clusters. Data engineers, data analysts, developers, and solution architects who collaborate with data scientists will also find this workshop valuable. Workshop participants walk through an end-to-end data science and machine learning workflow based on realistic scenarios and datasets from a fictitious technology company. The material is presented through a sequence of brief lectures, interactive demonstrations, extensive hands-on exercises, and lively discussions. The demonstrations and exercises are conducted in Python (with PySpark) using Cloudera Data Science Workbench (CDSW). Supplemental examples using R (with sparklyr) are provided.

Skills Gained

  • How to use Apache Spark to run data science and machine learning workflows at scale
  • How to use Spark SQL and DataFrames to work with structured data
  • How to use MLlib, Spark’s machine learning library
  • How to use PySpark, Spark’s Python API
  • How to use sparklyr, a dplyr-compatible R interface to Spark
  • How to use Cloudera Data Science Workbench (CDSW)
  • How to use other Cloudera platform components including HDFS, Hive,
  • Impala, and Hue


  • Workshop participants should have a basic understanding of Python or R and some experience exploring and analyzing data and developing statistical or machine learning models. Knowledge of Hadoop or Spark is not required.

Course Content

Course Outline

  • Data Science Overview
  • Cloudera Data Science Workbench
  • (CDSW)
  • Science Workbench
  • Workbench Works
  • Workbench
  • Case Study
  • Apache Spark
  • Summarizing and Grouping
  • DataFrames
  • Window Functions
  • Exploring DataFrames
  • Apache Spark Job Execution
  • Processing Text and Training and
  • Evaluating Topic Models
  • Training and Evaluating Recommender Models
  • Running a Spark Application from(CDSW)
  • Columns of a DataFrame
  • Inspecting a Spark SQL DataFrame
  • Transforming DataFrames
  • Monitoring, Tuning, and Configuring Spark Applications
  • Machine Learning Overview
  • Training and Evaluating Regression Models
  • Working with Machine Learning Pipelines
  • Deploying Machine Learning Pipelines
  • Transforming DataFrame Columns
  • Complex Types
  • User-Defined Functions
  • Reading and Writing Data
  • Combining and Splitting DataFrames
  • Training and Evaluating Classification Models
  • Tuning Algorithm Hyperparameters Using Grid Search
  • Training and Evaluating Clustering Models
  • Overview of sparklyr
  • Introduction to Additional CDSW Features

Find Out More

  • This field is for validation purposes and should be left unchanged.