Experiment, analyze, and train models as you like in one click.
- Experiment tracking made easy, see the magic of your model while it’s still running
- Compare the runs, learning curves, and all training metrics to see what works best for you.
- Log each and every training so that you can always trace your best findings back, then reproduce your favorite experiments.
Model module makes predictions with machine learning algorithms on enriched and structured data.
The machine learning modeling is started with the "New Experiment" button. After the necessary data and parameters are entered in the specified fields, the model is ready for training.
To train a machine learning model, the first step is to prepare the processed dataset for use in the training process. This involves several tasks, which are outlined below:
- Data specification: In the data section of the platform, provide the cleaned and preprocessed dataset that will be used to train the model. This dataset should be free of errors, inconsistencies, and missing values. Ensure that the data is in a suitable format, such as a CSV file, a DataFrame, or another compatible format supported by the platform.
- Label selection: Specify the target variable, also known as the label, which the model will learn to predict or classify. This label is typically a column in the dataset that contains the ground truth or the outcome we want the model to learn. For example, in a binary classification problem, the label could be a column indicating whether a customer made a purchase or not.
- Problem type definition: Choose the appropriate problem type for the task at hand, such as regression, classification, or clustering. This selection informs the Octai platform which algorithms and techniques to use during the training process.
Once these steps are completed, the model is ready for training.
However, if you wish to further customize the model or fine-tune specific parameters, follow the steps below:
- Advanced parameter selection: Click on the "Select More Parameters and Train" option, which allows you to modify additional settings, such as the choice of algorithms, hyperparameter values, feature selection methods, and cross-validation strategies. Adjusting these settings can help improve the performance of the model or better tailor it to the specific problem or dataset.
To assess the performance of a machine learning model, it is crucial to first establish a success metric. This metric serves as a benchmark to evaluate how well the model is performing based on the type of problem being solved. Success metrics differ depending on the problem type, as outlined below:
- Selecting a success metric:
For classification problems, common success metrics include accuracy, precision, recall, F1-score, and area under the receiver operating characteristic curve (ROC AUC).
For regression problems, metrics such as mean squared error (MSE), mean absolute error (MAE), and R-squared are often used.
In clustering tasks, metrics like silhouette score and adjusted Rand index can be employed. It is essential to choose a metric that best aligns with the objectives and requirements of the specific problem.
- Data splitting options: To measure the success of the model, it is necessary to partition the dataset into separate subsets. This is typically done using a method like train-test split or cross-validation. The model is trained on one subset of the data and evaluated on another, unseen portion to gauge its performance. But you can also do this automatically.
Train-test split (holdout): The dataset is divided into two parts: a training set and a test set. The training set is used to train the model, while the test set serves as an independent evaluation of its performance. The split ratio (e.g., 80/20, 70/30) can be adjusted to balance the need for sufficient training data and a reliable evaluation.
Cross-validation (k-fold): This method involves partitioning the dataset into multiple subsets or "folds." The model is trained and evaluated multiple times, with each iteration using a different fold as the test set while the remaining folds form the training set. This provides a more robust estimate of the model's performance as it is evaluated on various portions of the data. Commonly used cross-validation methods include k-fold cross-validation and stratified k-fold cross-validation.
In the final stages of preparing a machine learning model for training, it is important to customize the selection of features and, if desired, adjust the hyperparameters of the chosen algorithm. This can be accomplished through the following steps:
- Custom feature selection: Rather than using all available features in the dataset, you can choose a subset of features that are most relevant to the target variable. This can be done by analyzing the statistical relationships between the features and the target variable, such as correlation coefficients or mutual information scores. Feature selection can help improve model performance, reduce overfitting, and decrease training time.
- Univariate analysis: Perform univariate statistical tests to determine the relationship between each feature and the target variable, selecting only those features with strong relationships.
- Feature importance analysis: Utilize machine learning algorithms, such as decision trees or ensemble methods, to estimate the importance of each feature. Retain only the most important features for model training.
- Dimensionality reduction techniques: Apply methods like Principal Component Analysis (PCA) or Linear Discriminant Analysis (LDA) to reduce the number of features while retaining the most valuable information.
- Hyperparameter tuning: Hyperparameters are the configuration settings of a machine learning algorithm, which can influence its performance. Octai often provides default hyperparameter settings, but you can customize them to potentially improve model performance. This step offers automated hyperparameter tuning.
After completing the feature selection and hyperparameter tuning, the model is ready for training. Octai will train the model using the selected features, target variable, and specified algorithm with the provided hyperparameters. This process can lead to a more accurate and tailored machine learning model that better addresses the specific problem and dataset.
To gain insights into the performance and progress of a machine learning model during training, you can refer to the Experiments and Trials sections in Octai. These sections provide a comprehensive overview of the model's status and detailed evaluation metrics.
Experiments section: This section allows you to monitor the general status of the models being run, including information such as the current progress, elapsed time, and the number of completed trials. You may also find an overview of the best-performing models, their hyperparameter configurations, and the success metrics achieved so far.
Trials section: For a more in-depth analysis of the model's success, navigate to the Trials section. Here, you can review the performance of each individual trial, as evaluated by the chosen success metric. This section may include additional evaluation metrics, such as training and validation scores, learning curves, and confusion matrices. The Trials section can also provide insights into the relative importance of features and the impact of hyperparameter settings on model performance.
By regularly monitoring the Experiments and Trials sections, you can track the progress of your model training, identify potential areas for improvement, and make informed decisions about further tuning or adjustments. This iterative process enables you to optimize the model's performance and ensure that it meets the desired level of accuracy and generalization for the specific problem and dataset.
Updated 5 months ago