- Domain 1: Data Preparation for Machine Learning (ML) (28% of scored content)
- Domain 2: ML Model Development (26% of scored content)
- Domain 3: Deployment and Orchestration of ML Workflows (22% of scored content)
- Domain 4: ML Solution Monitoring, Maintenance, and Security (24% of scored content)
- contextuality
- applicability
- experience-based
- dynamism
- define ml problem
- collect data
- process data
- choose algorithm
- train model
- evaluate model
- deploy model
- derive inference
- monitor model
- neuron
- input layer
- hidden layers
- output layer
- Artificial Neural Networks (ANN): capable of learning from data by adjusting the weights of connections between neurons to minimize error
- Deep Neural Networks: able to model more complex patterns in the data
- Convolutional Neural Networks (CNN): designed for processing grid-like data such as images
- Recurrent Neural Networks (RNN): tailored for sequential data
- DataSync
- DMS
- Clickstream data
- IOT devices
- Live gaming data
- Kinesis Data Streams
- Kinesis Data Firehose
- Amazon Managed Streaming for Apache Flink
- Relevant
- Representative
- Rich
- Reliable
- Responsible
AWS Storage Systems
- Storage Abstractions: Data Lake, Data Lakehouse, Data Platform, Cloud Data Warehouse
- Storage Systems: HDFS, EMRFS, Object Storage, Block Storage, Streaming Storage, Cache
- Storage Low-Level Components: HDD, SSD, RAM, CPU, Networking, Compression, Serialization
- Semi Structured
- CSV
- JSON
- JSONL
- Structured
- Column Based
- Apache Parquet
- Apache Optimized Row Columnar (ORC)
- Row Based
- Apache Avro
- RecordIO
- Column Based
Benefits of Columnar Formats
- column-specific compression
- column-specific encoding and compression
- queries only access relevant columns, improving performance
- discover data
- migrate data
- archive cold data
- replicate data
- transfer data for timely in-cloud processing
- complex ETL pipeline development
- data discovery
- support for data processing frameworks
- simplified data engineering experience
S3 Storage Classes
- S3 Standard: frequently accessed data, low latency, high throughput
- S3 Intelligent-Tiering: data with unknown or changing access patterns, automatic cost savings
- S3 Standard-IA (Infrequent Access): long-lived but infrequently accessed data
- S3 One Zone-IA: infrequently accessed data that does not require multiple Availability Zone
- S3 Glacier Instant Retrieval: long-term archive data that requires milliseconds retrieval
- S3 Glacier Flexible Retrieval: long-term archive data that requires minutes to hours retrieval
- S3 Glacier Deep Archive: long-term archive data that is accessed once or twice a year
- S3 Outposts: data that needs to remain on-premises for latency or data residency requirements
- serverless interactive query service
- uses standard SQL
- Categorical data: finite number of distinct categories
- Numerical data: anything that can be represented as a number
- Discrete: countable values
- Continuous: any value within a range
- Textual data: books, social media posts, articles, etc.
- Image data: pixel values
- Time series data: collection of observations or measurements recorded over regular intervals of time
- Managing Missing Values
- Collect
- Impute
- Drop
- Detecting and Treating Outliers
- Delete
- Logarithmic Transform
- Impute
- Performing Deduplication
- Standardizing and Reformatting
- Load Data
- Standardize Data
- Export Data
- Create AWS Glue job
- Load clean data
- Removing Noise and Errors
- Normalization
- Equal weighting
- Faster convergegence
- Enhanced interoperability
- Standardization
- Scaling
- Robust Scaling
- MinMax Scaling
- MaxAbs Scaling
- Tokenization: breaking text into smaller units (tokens)
- Stop Words Removal: eliminating common words that do not add significant meaning
- Stemming and Lemmatization: reducing words to their root form
- N-grams: contiguous sequences of n items from a given text
- Word Embeddings: representing words in a continuous vector space
- Create a labeling job
- Automated Data Labeling
- Human review
- Data labeeling and validation
- Store labeled data
- Model training
- Oversampling: increasing the number of instances in the minority class
- Undersampling: reducing the number of instances in the majority class
- Class weighting: assigning higher weights to the minority class during model training
Use Cases
- Linear Learner
- Linear Regression
- Logistic Regression
- Support Vector Machine (SVM)
- Clustering
- K-Means Clustering
- Elbow Method Visualization
- Dimensionality Reduction
- Principal Component Analysis (PCA)
- Database Migration Service (DMS): helps migrate databases to AWS easily and securely
- DataSync: data transfer service
- DynamoDB: NoSQL database service
- EBS: block storage service
- EFS: scalable file storage for Linux
- FSx for Lustre: high-performance file system for fast processing of large datasets
- FSx NetApp ONTAP: fully managed file storage service
- FSx for Windows File Server: fully managed Windows file system, supports SMB
- Glue DataBrew: visual data preparation tool
- Kinesis Data Streams: real-time data streaming service
- Kinesis Data Firehose: fully managed service for real-time data delivery
- Lake Formation: centralize data governance and security
- Managed Streaming for Apache Kafka (MSK): fully managed service for Apache Kafka
- Redshift: petabyte-scale data warehousing
- SageMaker Data Wrangler: simplifies data preparation and feature engineering
- SageMaker Jumpstart: pre-built templates for common ML use cases
- RDS: relational database service
- S3 Standard: object storage service
- S3 Intelligent-Tiering: automatically moves data between two access tiers when access patterns change
- S3 One Zone-IA: lower-cost option for infrequently accessed data that does not require multiple availability zone resilience
Model support by AWS Region in Amazon Bedrock
- Feature: a measurable property or characteristic of the data, can be numerical or categorical
- Overfitting: model learns the training data too well, including noise and outliers, leading to poor generalization on new data
- S3 Partitioning: Apache hive-like storage patterns that may be defined in Athena
- Scaling: adjusting the range or distribution of features
- Underfitting: model is too simple to capture the underlying patterns in the data, leading to poor performance on both training and new data