Python Packages for Data Science

This blog is created to record the Python packages of data science found in daily practice or reading, covering the whole process of machine learning from visualization and pre-processing to model training and deployment.

This post is kept updating.

Visualization

Scikit-plot

  • The quickest and easiest way to plot machine learning result, built upon scikit-learn and matplotlib
  • Metrics Module – evaluation metrics, e.g. confusion matrix, ROC, etc.
  • Estimators Module – learning curve and features importance
  • Clusterer Module – elbow plot
  • Decomposition Module – PCA 2D projection and PCA component explained variances

Altair

  • Declarative statistical visualization, just like JMP but in Python
  • Example:
     # only need to define x, y and legend
    alt.Chart(cars).mark_circle().encode(x='Horsepower',
                                       y='Miles_per_Gallon',
                                       color='Origin')
    

    altair

Visdom

  • alive data visualization dashboard Visdom

VisualDL

  • Deep learning visualization tool supporting PaddlePaddle, PyTorch and MXNet, while Tensorflow is using Tenserboard
  • Graph / scalar / image / histogram

Data Cleansing

datacleaner

  • Work with Pandas DataFrame
  • Automatically complete the basic cleansing as below
    • Optionally drops any row with a missing value
    • Replaces missing values with the mode (for categorical variables) or median (for continuous variables) on a column-by-column basis
    • Encodes non-numerical variables (e.g., categorical variables with strings) with numerical equivalents

Feature Selection

Facets

  • A visualization tool for descriptive statistical information of features, and relationship between any pair of features
  • To help you fast understanding the features and make decision for feature selection and engineering

sklearn-genetic

Feature Engineering

Featuretools

  • Perform automated feature engineering with Deep Feature Synthesis (DFS)

Auto Machine Learning

Leading research group on autoML Machine Learning for Automated Algorithm Design

DEAP

TPOT

  • Search for the best pipeline of machine learning by genetic programming

    TPOT

MLBox

auto-sklearn

  • Linux only
  • An automated machine learning toolkit and a drop-in replacement for a specific scikit-learn estimator
  • Supporting classifers / regressors / preprocessers
    import autosklearn.classification
    cls = autosklearn.classification.AutoSklearnClassifier()  ## search the best one among all classifier
    cls.fit(X_train, y_train)
    predictions = cls.predict(X_test)
    

Hyperparameter Searching

Hyperopt

hyperopt-sklearn

  • Hyperopt-based model selection among machine learning algorithms in scikit-learn, without passing search space
  • Support classifiers and regressors

Stacking Architecture

mlxtend

  • Ensemble stacking for classifier or regressor, including standard version and CV version
  • Also has other useful tool for data science, such as feature selector

StackNet

  • Stacking in neural network way: replace the neurons of neural network (linear regression) with any supervised learning algorithm, and trained only from forward prop due to no gradient
  • Restacking Restacking

Probabilistic Machine Learning

PyMC3

  • Bayesian statistical modeling and Probabilistic Machine Learning which focuses on advanced Markov chain Monte Carlo and variational fitting algorithms.
  • GitHub Readme includes tutorial for learning Bayesian statistics using PyMC3

Specific Data Types

Image

Detectron from Facebook

  • Pre-trained object detection with object masking, instead of object bounding

FastPhotoStyle

  • Art style transfer algorithm by NVIDIA

Time Series Data

Working with Time Series Data in Python

  • List of Python packages about time series data

Nature Language Processing

SpaCy

DeepSpeech

  • Tensorflow implementation of Speech-to-Text synthesis from Baidu

TextBlob

Audio Data

Pydub

  • Audio engineering, such as synthesizing audio training dataset by combining background noise and target sound
  • Found in Deep Learning Specialization Course 5 Week 3 assignment

JazzML

  • Computational Jazz Improvisation
  • Found in Deep Learning Specialization Course 5 Week 1 assignment

Spatial Data

Rasterio

Deployment

Flask

Binder 2.0

Mobile Machine Learning

TuriCreate

Baidu Mobile ML

Others

Tensorflow Project Template

  • Fast create model with good OOP designed code

JupyterLab

  • Jupyter Notebook with Matlab like interface

    JupyterLab

tqdm

  • progression monitor 76%|████████████████████████████ | 7568/10000 [00:33<00:10, 229.00it/s]

Docker

Prettier Python Plugin

  • Make your Python arrange nicer and more professional

Commercial Tools

KNIMI

Orange

  • GUI interface for data science, like Klarity ACE
PREVIOUSMachine Learning - Andrew Ng @ Coursera
NEXT15分钟创建个人博客 @ GitHub Pages