Benchmarking Performance

The promise of AutoML training is to make machine learning accessible to the everyday predictor, but the reality is that each system has different levels of complexity and cost to use. The high barrier to entry has resulted in a distinct lack of performance benchmarking.

We tested Google Cloud AI, Microsoft Azure AutoML, Amazon Sagemaker Autopilot, and Akkio on a number of open source real-world datasets. Models were benchmarked based on achieved accuracy and F1 scores, as well as training time and cost.

Accuracy is the global measurement of how often a model is correct when it predicts an outcome in the validation set - a random selection of training data (typically 20%) that is held back and run against the model to measure performance. Accuracy is a good performance metric but can sometimes be misleading. We found accuracy performance to be generally similar across the benchmarked set.

F1 score is a combined measurement of precision and recall. Precision is the percentage of time the model is actually correct when it predicts an outcome (True Positives / Predicted Positives). Recall is portion of total outcomes the model correctly predicts (True Positives / Actual Positives). F1 score ranges from a low of 0 to a high of 1, and is useful in evaluating relative performance between models. Like accuracy - F1 scores were similar across the board.

Training cost is important because you have to commit to it up-front, before you know if the model you build will even work. Its also very likely that you will want to make some adjustments to your data and retrain (for example you accidentally include a causal variable). Training speed enables you to iterate rapidly and avoid expending a lot of time down unproductive paths. Here Akkio is the clear winner with training times around 100x faster, and no cost to train models.

We selected a wide range of open datasets for the benchmark. The datasets range in size from under 1K records to around 300K records. They include a variety of data types including categories, text, dates and numbers. Several of the datasets have anonymized features (where the true backing information is private and obfuscated with PCA). The model targets include both binary and multi-category classifications. You can download the datasets here:

Dataset name

problem type

size

Best performance

australian ->

Binary Classification

<1K rows

Akkio

bank-marketing ->

Binary Classification

50K rows

Azure

biodegradation ->

Binary Classification

1K rows

Akkio

blood-transfusion ->

Binary Classification

<1K rows

Akkio

credit-g ->

Binary Classification

1K rows

Akkio

creditcard ->

Binary Classification

Binary Classification

Multi-class Classification

2K rows

Akkio

hill-valley ->

Binary Classification

1K rows

Sagemaker

instrument-reviews ->

Multi-class Classification

10K rows

Azure

news-labeling ->

Multi-class Classification

13K rows

Google Cloud

real-fake-jobs ->

Binary Classification

17K rows

Google Cloud

wilt ->

Binary Classification

5K rows

Akkio

Overall, the model performance in terms of both accuracy and F1 score was reasonably similar between Akkio, Microsoft Azure, Google Cloud, and Amazon Sagemaker. Akkio is the only no-code solution in the benchmark set, and it pulls ahead on both training time (1 minute per model) and cost (free) - which unlocks the ability for a new class of non-technical business users to apply machine learning to their workflows.

Put agents to work today

Transform your campaign workflows with powerful AI that delivers measurable results.

Book a meeting

By clicking “Accept”, you agree to the storing of cookies on your device to enhance site navigation, analyze site usage, and assist in our marketing efforts. View our Privacy Policy for more information.

Deny Accept

Finding the Best AutoML Platforms

Model Accuracy

F1 Scores

Training Time and Cost

Dataset Library

Summary of Results

Put agents to work today