AWS Certified Data Analytics – Specialty

whoooh.. Knocked one more certification. real tough one and practical Streaming and ETLing

Captured my learning here: https://www.cloudopsguru.com/#/DataAnalytics

Thanks to Jayendra Patil, Siddharth Mehta, Tom Carpenter for their invaluable posts and lessons (Udemy)

www.jayendrapatil.com

https://www.udemy.com/course/aws-serverless-glue-redshift-spectrum-athena-quicksight-training/

https://www.udemy.com/course/total-aws-certified-database-specialty-exam-prep-dbs-c01

CDK & cdk8s

Build and Deploy .Net Core WebAPI Container to Amazon EKS using CDK & cdk8s

Sep 4 2020, wrote a blog where we leveraged the development capabilities of the CDK for Kubernetes framework also known as cdk8s along with the AWS Cloud Development Kit (AWS CDK) framework to provision infrastructure through AWS CloudFormation.

cdk8s allows us to define Kubernetes apps and components using familiar languages. cdk8s is an open-source software development framework for defining Kubernetes applications and reusable abstractions using familiar programming languages and rich object-oriented APIs. cdk8s apps synthesize into standard Kubernetes manifests which can be applied to any Kubernetes cluster. cdk8s lets you define applications using Typescript, JavaScript, and Python. In this blog we will use Python.

The AWS CDK is an open source software development framework to model and provision your cloud application resources using familiar programming languages, including TypeScript, JavaScript, Python, C# and Java.

To read more: https://aws.amazon.com/blogs/developer/build-and-deploy-net-core-webapi-container-to-amazon-eks-using-cdk-cdk8s/

AWS Database (Purpose Built)

AWS offers many purpose built Databases. Note this is an immense wide topic. Thought of writing this for my own refresher.

AWS RDS – Relational Database:
– data is actually relation, ACID (atomic, consistent, integrity, durable) compliant
– referential integrity
– static and unchanging
– ubiquitous – easy and available in many flavors. can handle different types of workloads
– RDS read replicas for 6/6 DB engines available
ex: shopper to read history. query from read db instead of main db

Anti pattern: lot of loads can run but are not good for JSon object or if there is no well defined schema

DynamoDB – NO SQL
– fully managed, multi region
– multi primary, no SQL
– built in security backup and restore
low latency key baseeed query. fast performance handle throughput maintaining consistency
– ex: product desc

for Hot data -  
    DAX Caching - DAX - items to cache in a matter of microseconds 
    reduces response times of eventually consistent read workloads from single digit milliseconds to microseconds
    devs need not modify their app logic

    Elasticache for redis and ElasticCache for MemChached

Amazon Redshift – Dataware House
for cold data – for analytics, columnar database services help
columnar storage tech in order to improve I/O efficiency and parallelized queries across multiple nodes for fast query performance
Analytics, trend monitoring and gaming insights to know what’s ging on in the system
ex: daily report,allow pulling from datalake
use standard SQL.

- AQUA - Advanced query accelerated cache. that does substantial share of data processsing in place on the cache,
  enabling Redshift to run upto 10x faster

Other options/flavors are
financial requirements, doc storage reqruiement or mobile game specific community

Below are some more purpose built database services options available in AWS

Amazon QLDB – ledger database transparent cryptographically verifiable transaction log. Immutable
when you need to keep track of financial activity in an organization

Amazon Document DB for MongoDB for storage of JSON – fault tolerant, self healing storage that supports automatic data scaling
allows customer to scale from 10GB all the way upto 64TB per database cluster

Amazon Neptune – Graph – database option
Propert Graph and W3C’s RDF along with respective query
ex: Games – connectivity between each player
Supports Apache TinkerPop Gremlin and SPARQL

Amazon Keyspaces – AWS Managed Cassandra compatible service
if customer needs wide column key store –
automatically supports three replicas that can be distributed across different AZs

Amazon Timestream – Timeseries database. need to analyze billions/trillions of data
ex: user activity log. ex: trivago
automates rollups, retention, tiering and compression of data
ex: from device sensors constantly


Data patterns

-read heavy
    lot of incoming data. more query processing on that. there is a read contention/querying due to lot of records
    querying over and over again
    Option - materialized view is a database object 

-micro service
    avoid sharing data between micro services

- break monolith into small domains which could be microservice or a collection of microservices

- isolate bounded context - part of domain driven design - back your data based on functionality

- Event sourcing and CQRS (command query responsibilty seggregation)
    changig the parts of the app to define which is responsible of reading vs writing

- SAGA Pattern - sequence of events that happen in a Distributed transactions
    orchestrate/choreograph using orchestrators like step functions to coordinate activities to perform/trigger a transactional boundary to commit/rollback across distributed design
    only works when we can embrace eventual consistency

AWS Database Migrations

Consider a source (on prem) –> RDS / EC2 migration.

Types of migration

  • like to like (homogenous) – moving between database instances using the same db tech
  • like to unlike (heterogenous) – migratnig to a completely different database technology

Options available for migration

  • DMS – Database management service
  • third party
  • partner
  • backup and restore

for like to like – DMS uses replication instance and keeps the source available/online. DB engines are compatible

for like to unlike pattern – SCT (Schema conversion tool) + DMS can help
if the datastructure are not same they need to be converted to a compatible type
ex: once-off migration, source to target, consolidate databases, dev and test

Snowball edge:
if the bandwidth, infrastructure is not available, snowball edge can be used to copy the data into S3 and can be migrated

AWS Lambda Using Dagger (Dependency Injection)

Dependency Injection

DI will help write loosely couple architecture. By moving the dependencies to the interface of the components the code will be more readable and manage dependencies between the objects. Additionally DI helps to provide easier testing with different mock implementation

For AWS Lambdas “Minimize the complexity of your dependencies. Prefer simpler frameworks that load quickly on execution context startup. For example, prefer simpler Java dependency injection (IoC) frameworks like Dagger or Guice, over more complex ones like Spring Framework.”

Please refer this link for more details https://docs.aws.amazon.com/lambda/latest/dg/best-practices.html

1. This simple AWS Lambda makes use Dagger to build the dependency injection
2. Provide commonly used objects (like Gson/Utilities classes/Dynamo, S3, SQS helper) through DI
3. A sample Utility Module class with Gson is injected into the DaggerHandler
3. The sample service just returns the input request and returns a success message

Git Code

https://github.com/shivaramani/aws-dagger-lambda

Testing

1. $ mvn clean package
2. Create a new AWS Lambda (in console or CLI),
    - Name: daggerSvc
    - Runtime: java8
    - Role: Default Lambda Role
3. Upload the the jar.
4. Provide "com.example.InvocationHandlers::handleDaggerRequest" in "Handler" textbox (on the AWS Console)
5. "Save" and "Test"

Terraform

Terraform as “infrastructure as code” seems to be very capable and have a very similar concept that of AWS CloudFormation and AWS CDK.

what’s more interesting is you can build all the AWS services through this!

Overview –Provision AWS infrastructure using Terraform (By HashiCorp) blog

Many web and mobile applications can make use of AWS services and infrastructure to log or ingest data from customer actions and behaviors on the websites or mobile apps, to provide recommendations for better user experience. There are several ‘infrastructure as code’ frameworks available today, to help customers define their infrastructure, such as the AWS CDK or Terraform by HashiCorp. In this blog, we will walk you through a use case of logging customer behavior data on web-application and will use Terraform to model the AWS infrastructure.

The data ingestion process is exposed with an API Gateway endpoint. The Amazon API Gateway processes the incoming data into an AWS Lambda during which the system validates the request using a Lambda Authorizer and pushes the data to a Amazon Kinesis Data Firehose. The solution leverages Firehose’s capability to convert the incoming data into a Parquet file (an open source file format for Hadoop) before pushing it to Amazon S3 using AWS Glue catalog. Additionally, a transformational/consumer lambda does additional processing by pushing it to Amazon DynamoDB.

Read more here: https://aws.amazon.com/blogs/developer/provision-aws-infrastructure-using-terraform-by-hashicorp-an-example-of-web-application-logging-customer-data/

AWS Batch .NET CDK Blog!

Overview – AWS CDK .NET for Batch blog

This post provides a file processing implementation using Docker images and Amazon S3AWS LambdaAmazon DynamoDB, and AWS Batch. In this scenario, the user uploads a CSV file to into an Amazon S3 bucket, which is processed by AWS Batch as a job. These jobs can be packaged as Docker containers and are executed using Amazon EC2 and Amazon ECS.

https://aws.amazon.com/blogs/developer/orchestrating-an-application-process-with-aws-batch-using-aws-cdk/

AWS Batch Using CloudFormation Blog!

Overview – AWS Batch blog

This post provides a file processing implementation using Docker images and Amazon S3AWS LambdaAmazon DynamoDB, and AWS Batch. In this scenario, the user uploads a CSV file into an Amazon S3 bucket, which is processed by AWS Batch as a job. These jobs can be packaged as Docker containers and are executed using Amazon EC2 and Amazon ECS.

https://aws.amazon.com/blogs/compute/orchestrating-an-application-process-with-aws-batch-using-aws-cloudformation/

AWS .NET Core Aurora CDK Blog!

Overview – AWS .NET Core Aurora CDK blog –

Many existing .NET Core applications can be containerized using docker and AWS services like Amazon EC2Amazon Elastic Container Service (ECS)AWS Systems Manager (SSM)Amazon Aurora Database providing a full blown API application system. The application architecture is complemented by build & pipeline tools like AWS CodeCommitAWS CodeBuild using AWS CloudFormation. At the end of this blog, we will create a simple Microsoft .NET Web API for a ToDo Application

https://aws.amazon.com/blogs/developer/developing-a-microsoft-net-core-web-api-application-with-aurora-database-using-aws-cdk/

AWS CDK

Using CDK constructs, we have built the above infrastructure and integrated the solution with a Public Load Balancer. The output of this stack will give the API URLs for health check and API validation. As you notice by defining the solution using CDK, you were able to:

  • Use object-oriented techniques to create a model of your system
  • Organize your project into logical modules
  • Code completion within your IDE

Other major advantages using this CDK approach include, as a developer/development team we should be able to

  • Use logic (if statements, for-loops, etc) when defining your infrastructure
  • Define high level abstractions, share them, and publish them to your team, company, or community
  • Share and reuse your infrastructure as a library
  • Testing your infrastructure code using industry-standard protocols
  • Use your existing code review workflow