Cancel Preloader

Exploratory data analysis, feature engineering, and operationalizing your data flow into your ML pipeline with Amazon SageMaker Data Wrangler

 Exploratory data analysis, feature engineering, and operationalizing your data flow into your ML pipeline with Amazon SageMaker Data Wrangler

According to The State of Data Science 2020 survey, data management, exploratory data analysis (EDA), feature selection, and feature engineering accounts for more than 66% of a data scientist’s time (see the following diagram).

According to The State of Data Science 2020 survey, data management, exploratory data analysis (EDA), feature selection, and feature engineering accounts for more than 66% of a data scientist’s time.

The same survey highlights that the top three biggest roadblocks to deploying a model in production are managing dependencies and environments, security, and skill gaps (see the following diagram).

The same survey highlights that the top three biggest roadblocks to deploying a model in production are managing dependencies and environments, security, and skill gaps.

The survey posits that these struggles result in fewer than half (48%) of the respondents feeling able to illustrate the impact data science has on business outcomes.

Enter Amazon SageMaker Data Wrangler, the fastest and easiest way to prepare data for machine learning (ML). SageMaker Data Wrangler gives you the ability to use a visual interface to access data, perform EDA and feature engineering, and seamlessly operationalize your data flow by exporting it into an Amazon SageMaker pipeline, Amazon SageMaker Data Wrangler job, Python file, or SageMaker feature group.

SageMaker Data Wrangler also provides you with over 300 built-in transforms, custom transforms using a Python, PySpark or SparkSQL runtime, built-in data analysis such as common charts (like scatterplot or histogram), custom charts using the Altair library, and useful model analysis capabilities such as feature importance, target leakage, and model explainability. Finally, SageMaker Data Wrangler creates a data flow file that can be versioned and shared across your teams for reproducibility.

Solution overview

In this post, we use the retail demo store example and generate a sample dataset. We use three files: users.csv, items.csv, and interactions.csv. We first prepare the data in order to predict the customer segment based on past interactions. Our target is the field called persona, which we later transform and rename to USER_SEGMENT.

The following code is a preview of the users dataset:

id,username,email,first_name,last_name,addresses,age,gender,persona
1,user1,[email protected],Nathan,Smith,"[{""first_name"": ""Nathan"", ""last_name"": ""Smith"", ""address1"": ""049 Isaac Stravenue Apt. 770"", ""address2"": """", ""country"": ""US"", ""city"": ""Johnsonmouth"", ""state"": ""NY"", ""zipcode"": ""12758"", ""default"": true}]",28,M,electronics_beauty_outdoors
2,user2,[email protected],Kevin,Martinez,"[{""first_name"": ""Kevin"", ""last_name"": ""Martinez"", ""address1"": ""074 Jennifer Flats Suite 538"", ""address2"": """", ""country"": ""US"", ""city"": ""East Christineview"", ""state"": ""MI"", ""zipcode"": ""49758"", ""default"": true}]",19,M,electronics_beauty_outdoors

The following code is a preview of the items dataset:

ITEM_ID,ITEM_URL,ITEM_SK,ITEM_NAME,ITEM_CATEGORY,ITEM_STYLE,ITEM_DESCRIPTION,ITEM_PRICE,ITEM_IMAGE,ITEM_FE

[...]

Source - Continue Reading: https://aws.amazon.com/blogs/machine-learning/exploratory-data-analysis-feature-engineering-and-operationalizing-your-data-flow-into-your-ml-pipeline-with-amazon-sagemaker-data-wrangler/

webmaster

Related post