In this tutorial, we will walk through the entire machine learning (ML) life-cycle and show you how to architect and build an ML use case end to end using Amazon SageMaker. Amazon SageMaker provides a rich set of capabilities that enable data scientists, machine learning engineers, and developers to prepare, build, train, and deploy ML models rapidly and with ease. For our use case, we have chosen an automobile claims fraud detection example.
We will initially provide an architectural walkthrough of the various portions of the ML lifecycle and then point to the code that builds each section of the lifecycle on SageMaker.
To get started, data scientists use an experimental process to explore various data preparation tasks, in some cases engineering features, and eventually settle on a standard way of doing so. Then they embark on a more repeatable and scalable process of automating stages of this process, until the model provides the necessary levels of performance (such as accuracy, F1 score, and precision). Then they package this process in a repeatable, automated, and scalable ML pipeline.
The following diagram illustrates the manual investigative and the automated operational workflows.
New capabilities required for new tasks in the ML lifecycle
At a high level, the ML lifecycle looks like the following diagram.
The general phases of the ML lifecycle are data preparation, train and tune, and deploy and monitor, with inference being when we actually serve the model up with new data for inference.
As ML evolves and matures in the industry, we see an increased need for activities that support various facets of scaling of ML tasks and artifacts; making the artifacts that are the outputs of each task consistently standardized, more accessible, more transparent, and therefore more governable. In addition, each of these activities needs to scale from an exploratory activity to a consistent, automated and scalable activity via automated pipelines.
In the detailed preceding ML Lifecycle diagram, the red boxes represent comparatively newer concepts and tasks that are now deemed important to include in, and run in a scalable, operational, and production-oriented (vs. research-oriented) environment.
These newer lifecycle tasks and their corresponding Amazon SageMaker capabilities include the following:
- Data wrangling – We use SageMaker Data Wrangler for cleaning, normalizing, transforming and encoding data, as well as joining datasets. The output of SageMaker Data Wrangler is data transformation code that works with SageMaker Processing, SageMaker Pipelines, SageMaker Feature Store, or with Pandas in a plain Python script. Feature engineering can now be done faster and easier, with SageMaker Data Wrangler where we have a GUI-based environme
Source - Continue Reading: https://aws.amazon.com/blogs/machine-learning/architect-and-build-the-full-machine-learning-lifecycle-with-amazon-sagemaker/