MLlib (DataFrame-based)¶
Pipeline APIs¶
Abstract class for transformers that transform one dataset into another. |
|
Abstract class for transformers that take one input column, apply transformation, and output the result as a new column. |
|
Abstract class for estimators that fit models to data. |
|
|
Abstract class for models that are fitted by estimators. |
Estimator for prediction tasks (regression and classification). |
|
Model for prediction tasks (regression and classification). |
|
|
A simple pipeline, which acts as an estimator. |
|
Represents a compiled pipeline with transformers and fitted models. |
Parameters¶
|
A param with self-contained documentation. |
|
Components that take parameters. |
Factory methods for common type conversion functions for Param.typeConverter. |
Feature¶
|
Binarize a column of continuous features given a threshold. |
|
LSH class for Euclidean distance metrics. |
|
Model fitted by |
|
Maps a column of continuous features to a column of feature buckets. |
|
Chi-Squared feature selection, which selects categorical features to use for predicting a categorical label. |
|
Model fitted by |
|
Extracts a vocabulary from document collections and generates a |
|
Model fitted by |
|
A feature transformer that takes the 1D discrete cosine transform of a real vector. |
|
Outputs the Hadamard product (i.e., the element-wise product) of each input vector with a provided “weight” vector. |
|
Feature hashing projects a set of categorical or numerical features into a feature vector of specified dimension (typically substantially smaller than that of the original feature space). |
|
Maps a sequence of terms to their term frequencies using the hashing trick. |
|
Compute the Inverse Document Frequency (IDF) given a collection of documents. |
|
Model fitted by |
|
Imputation estimator for completing missing values, using the mean, median or mode of the columns in which the missing values are located. |
|
Model fitted by |
|
A |
|
Implements the feature interaction transform. |
|
Rescale each feature individually to range [-1, 1] by dividing through the largest maximum absolute value in each feature. |
|
Model fitted by |
|
LSH class for Jaccard distance. |
|
Model produced by |
|
Rescale each feature individually to a common range [min, max] linearly using column summary statistics, which is also known as min-max normalization or Rescaling. |
|
Model fitted by |
|
A feature transformer that converts the input array of strings into an array of n-grams. |
|
Normalize a vector to have unit norm using the given p-norm. |
|
A one-hot encoder that maps a column of category indices to a column of binary vectors, with at most a single one-value per row that indicates the input category index. |
|
Model fitted by |
|
PCA trains a model to project vectors to a lower dimensional space of the top |
|
Model fitted by |
|
Perform feature expansion in a polynomial space. |
|
|
|
RobustScaler removes the median and scales the data according to the quantile range. |
|
Model fitted by |
|
A regex based tokenizer that extracts tokens either by using the provided regex pattern (in Java dialect) to split the text (default) or repeatedly matching the regex (if gaps is false). |
|
Implements the transforms required for fitting a dataset against an R model formula. |
|
Model fitted by |
|
Implements the transforms which are defined by SQL statement. |
|
Standardizes features by removing the mean and scaling to unit variance using column summary statistics on the samples in the training set. |
|
Model fitted by |
|
A feature transformer that filters out stop words from input. |
|
A label indexer that maps a string column of labels to an ML column of label indices. |
|
Model fitted by |
|
A tokenizer that converts the input string to lowercase and then splits it by white spaces. |
|
Feature selector based on univariate statistical tests against labels. |
|
Model fitted by |
|
Feature selector that removes all low-variance features. |
|
Model fitted by |
|
A feature transformer that merges multiple columns into a vector column. |
|
Class for indexing categorical feature columns in a dataset of Vector. |
|
Model fitted by |
|
A feature transformer that adds size information to the metadata of a vector column. |
|
This class takes a feature vector and outputs a new feature vector with a subarray of the original features. |
|
Word2Vec trains a model of Map(String, Vector), i.e. |
|
Model fitted by |
Classification¶
|
This binary classifier optimizes the Hinge Loss using the OWLQN optimizer. |
|
Model fitted by LinearSVC. |
|
Abstraction for LinearSVC Results for a given model. |
|
Abstraction for LinearSVC Training results. |
|
Logistic regression. |
|
Model fitted by LogisticRegression. |
|
Abstraction for Logistic Regression Results for a given model. |
|
Abstraction for multinomial Logistic Regression Training results. |
|
Binary Logistic regression results for a given model. |
Binary Logistic regression training results for a given model. |
|
|
Decision tree learning algorithm for classification.It supports both binary and multiclass labels, as well as both continuous and categorical features.. |
|
Model fitted by DecisionTreeClassifier. |
|
Gradient-Boosted Trees (GBTs) learning algorithm for classification.It supports binary labels, as well as both continuous and categorical features.. |
|
Model fitted by GBTClassifier. |
|
Random Forest learning algorithm for classification.It supports both binary and multiclass labels, as well as both continuous and categorical features.. |
|
Model fitted by RandomForestClassifier. |
|
Abstraction for RandomForestClassification Results for a given model. |
Abstraction for RandomForestClassificationTraining Training results. |
|
BinaryRandomForestClassification results for a given model. |
|
BinaryRandomForestClassification training results for a given model. |
|
|
Naive Bayes Classifiers. |
|
Model fitted by NaiveBayes. |
|
Classifier trainer based on the Multilayer Perceptron. |
Model fitted by MultilayerPerceptronClassifier. |
|
Abstraction for MultilayerPerceptronClassifier Results for a given model. |
|
Abstraction for MultilayerPerceptronClassifier Training results. |
|
|
Reduction of Multiclass Classification to Binary Classification. |
|
Model fitted by OneVsRest. |
|
Factorization Machines learning algorithm for classification. |
|
Model fitted by |
|
Abstraction for FMClassifier Results for a given model. |
|
Abstraction for FMClassifier Training results. |
Clustering¶
|
A bisecting k-means algorithm based on the paper “A comparison of document clustering techniques” by Steinbach, Karypis, and Kumar, with modification to fit Spark. |
|
Model fitted by BisectingKMeans. |
|
Bisecting KMeans clustering results for a given model. |
|
K-means clustering with a k-means++ like initialization mode (the k-means|| algorithm by Bahmani et al). |
|
Model fitted by KMeans. |
|
Summary of KMeans. |
|
GaussianMixture clustering. |
|
Model fitted by GaussianMixture. |
|
Gaussian mixture clustering results for a given model. |
|
Latent Dirichlet Allocation (LDA), a topic model designed for text documents. |
|
Latent Dirichlet Allocation (LDA) model. |
|
Local (non-distributed) model fitted by |
|
Distributed model fitted by |
|
Power Iteration Clustering (PIC), a scalable graph clustering algorithm developed by Lin and Cohen.From the abstract: PIC finds a very low-dimensional embedding of a dataset using truncated power iteration on a normalized pair-wise similarity matrix of the data.. |
Functions¶
|
Converts a column of array of numeric type into a column of pyspark.ml.linalg.DenseVector instances |
|
Converts a column of MLlib sparse/dense vectors into a column of dense arrays. |
|
Given a function which loads a model and returns a predict function for inference over a batch of numpy inputs, returns a Pandas UDF wrapper for inference over a Spark DataFrame. |
Vector and Matrix¶
|
A dense vector represented by a value array. |
|
A simple sparse vector class for passing data to MLlib. |
Factory methods for working with vectors. |
|
|
|
|
Column-major dense matrix. |
|
Sparse Matrix stored in CSC format. |
Recommendation¶
|
Alternating Least Squares (ALS) matrix factorization. |
|
Model fitted by ALS. |
Regression¶
|
Accelerated Failure Time (AFT) Model Survival Regression |
|
Model fitted by |
|
Decision tree learning algorithm for regression.It supports both continuous and categorical features.. |
|
Model fitted by |
|
Gradient-Boosted Trees (GBTs) learning algorithm for regression.It supports both continuous and categorical features.. |
|
Model fitted by |
|
Generalized Linear Regression. |
|
Model fitted by |
|
Generalized linear regression results evaluated on a dataset. |
Generalized linear regression training results. |
|
|
Currently implemented using parallelized pool adjacent violators algorithm. |
|
Model fitted by |
|
Linear regression. |
|
Model fitted by |
|
Linear regression results evaluated on a dataset. |
|
Linear regression training results. |
|
Random Forest learning algorithm for regression.It supports both continuous and categorical features.. |
|
Model fitted by |
|
Factorization Machines learning algorithm for regression. |
|
Model fitted by |
Statistics¶
Conduct Pearson’s independence test for every feature against the label. |
|
Compute the correlation matrix for the input dataset of Vectors using the specified method. |
|
Conduct the two-sided Kolmogorov Smirnov (KS) test for data sampled from a continuous distribution. |
|
|
Represents a (mean, cov) tuple |
Tools for vectorized statistics on MLlib Vectors. |
|
|
A builder object that provides summary statistics about a given column. |
Tuning¶
Builder for a param grid used in grid search-based model selection. |
|
|
K-fold cross validation performs model selection by splitting the dataset into a set of non-overlapping randomly partitioned folds which are used as separate training and test datasets e.g., with k=3 folds, K-fold cross validation will generate 3 (training, test) dataset pairs, each of which uses 2/3 of the data for training and 1/3 for testing. |
|
CrossValidatorModel contains the model with the highest average cross-validation metric across folds and uses this model to transform input data. |
|
Validation for hyper-parameter tuning. |
|
Model from train validation split. |
Evaluation¶
Base class for evaluators that compute metrics from predictions. |
|
|
Evaluator for binary classification, which expects input columns rawPrediction, label and an optional weight column. |
|
Evaluator for Regression, which expects input columns prediction, label and an optional weight column. |
Evaluator for Multiclass Classification, which expects input columns: prediction, label, weight (optional) and probabilityCol (only for logLoss). |
|
Evaluator for Multilabel Classification, which expects two input columns: prediction and label. |
|
|
Evaluator for Clustering results, which expects two input columns: prediction and features. |
|
Evaluator for Ranking, which expects two input columns: prediction and label. |
Frequency Pattern Mining¶
|
A parallel FP-growth algorithm to mine frequent itemsets. |
|
Model fitted by FPGrowth. |
|
A parallel PrefixSpan algorithm to mine frequent sequential patterns. |
Image¶
Internal class for pyspark.ml.image.ImageSchema attribute. |
|
Internal class for pyspark.ml.image.ImageSchema attribute. |
Distributor¶
|
A class to support distributed training on PyTorch and PyTorch Lightning using PySpark. |
Utilities¶
Base class for MLWriter and MLReader. |
|
Helper trait for making simple |
|
|
Specialization of |
Helper trait for making simple |
|
|
Specialization of |
Utility class that can save ML instances in different formats. |
|
Base class for models that provides Training summary. |
|
Object with a unique ID. |
|
Mixin for instances that provide |
|
|
Utility class that can load ML instances. |
Mixin for ML instances that provide |
|
|
Utility class that can save ML instances. |