AWS CDK Redshfit 데모
추상적인
목차
🚀 AWS Redshift 개요
- AWS Redshift is a cloud-based petabyte-scale data warehouse service offered as one of Amazon’s ecosystem of data solutions.
- Based on PostgreSQL, the platform integrates with most third-party applications by applying its ODBC and JDBC drivers.
- Amazon Redshift delivers fast query performance by using columnar storage technology to improve I/O efficiency and parallelizing queries across multiple nodes
- Redshit cluster overview
🚀 Redshift 클러스터 스택
- The Redshift cluster is in VPC and under private subnet and security group, so we first create the VPC stack, and for saving cost, max
availabilityZones
is set to 1
const vpc = new Vpc(this, `${prefix}-vpc`, {
vpcName: `${prefix}-vpc`,
maxAzs: 1
});
- We will use S3 bucket to store JSON/parquet datas and then load data from S3 bucket to redshift cluster as following flow.
Redshift는 S3 버킷에서 데이터를 다운로드하기 위해 해당 역할을 사용할 수 있도록 역할이 필요합니다.
const s3 = new Bucket(this, `${prefix}-data-ingest`, {
bucketName: `${prefix}-data-ingest`,
encryption: BucketEncryption.S3_MANAGED,
removalPolicy: RemovalPolicy.DESTROY,
enforceSSL: true,
blockPublicAccess: BlockPublicAccess.BLOCK_ALL
});
const role = new Role(this, `${prefix}-role`, {
roleName: `${prefix}-role`,
assumedBy: new ServicePrincipal('redshift.amazonaws.com')
});
s3.grantRead(role);
DC2_LARGE
(기본값)이 있는 다중 노드 클러스터에 있습니다. masterUser
는 비밀 관리자에 저장된 비밀번호로 생성됩니다(기본 옵션). const cluster = new Cluster(this, `${prefix}-cluster-demo`, {
clusterName: `${prefix}-demo`,
vpc: vpc,
masterUser: {
masterUsername: 'admin'
},
numberOfNodes: 2,
clusterType: ClusterType.MULTI_NODE,
removalPolicy: RemovalPolicy.DESTROY,
roles: [role]
});
t3.small
와 같은 작은 인스턴스 유형만AmazonSSMManagedInstanceCore
만 있는 인스턴스 프로파일user-data.sh
처음 시작할 때 postgresql을 설치하는 스크립트 const clusterSg = cluster.connections.securityGroups[0];
clusterSg.addIngressRule(clusterSg, Port.allTcp(), "Allow internal access Redshift");
const ec2 = new Instance(this, `${prefix}-psql`, {
instanceName: `${prefix}-psql`,
vpc: vpc,
securityGroup: clusterSg,
instanceType: InstanceType.of(InstanceClass.T3, InstanceSize.SMALL),
machineImage: new AmazonLinuxImage({generation: AmazonLinuxGeneration.AMAZON_LINUX_2}),
role: new Role(this, `${prefix}-ec2-ssm`, {roleName: `${prefix}-ec2-ssm`, assumedBy: new ServicePrincipal('ec2.amazonaws.com'), managedPolicies: [{managedPolicyArn: 'arn:aws:iam::aws:policy/AmazonSSMManagedInstanceCore'}]})
});
const userData = readFileSync(resolve(__dirname, './user-data.sh'), 'utf8');
ec2.addUserData(userData);
🚀 스택 배포
- Infrastructure of code for this project is ready we now deploy the stacks. It's up to you to use
concurrency
option for fasten deployment andrequire-approval
to bypass confirmation of creating/updating/removing sensitive things.
cdk deploy --concurrency 2 --require-approval never
- Check redshift cluster
🚀 Redshift 클러스터 작업
json 파일에서
pyspark
를 사용하여 parquet 형식으로 변환합니다.Parquet 개요: Parquet는 열 저장 모델을 따르며 Hadoop 생태계의 모든 프로젝트에서 사용할 수 있습니다. 데이터가 순차적으로 기록되는 기존의 순차 저장 모델과 달리 열 저장 모델은 열 값을 함께 저장합니다. Sequential Storage Model은 트랜잭션 처리에 이점이 있지만 빅 데이터에서 분석 쿼리를 실행하는 데는 적합하지 않습니다.
pyspark
를 실행하여 pip install pyspark
를 설치합니다. 도구를 사용하려면 java openjdk도 필요합니다⚡ $ cd data_sample/
⚡ $ pyspark
Python 3.10.4 (main, Jun 29 2022, 12:14:53) [GCC 11.2.0] on linux
Welcome to
____ __
/ __/__ ___ _____/ /__
_\ \/ _ \/ _ `/ __/ '_/
/__ / .__/\_,_/_/ /_/\_\ version 3.3.0
/_/
>>> df = spark.read.json("amzn_reviews_en.json")
>>> print("Schema: {}".format(df.schema))
>>> df.show()
>>> df.write.parquet("amzn_reviews_en.parquet")
~ $ aws ssm start-session --target i-0be265f7c54177548 --region ap-southeast-1
[root@ip-10-0-147-8 bin]# psql -h sin-d1-redshift-demo.cnozo5w39dmk.ap-southeast-1.redshift.amazonaws.com -U admin -p 5439 -d default_db
Password for user admin:
psql (10.17, server 8.0.2)
SSL connection (protocol: TLSv1.2, cipher: ECDHE-RSA-AES256-GCM-SHA384, bits: 256, compression: off)
Type "help" for help.
default_db=#
default_db=# \c prodreview
prodreview=#
CREATE TABLE IF NOT EXISTS product_reviews_json(
review_id varchar(100) NOT NULL distkey sortkey,
product_id varchar(100) NOT NULL,
stars varchar(18) NOT NULL,
review_body varchar(10000) NOT NULL,
review_title varchar(1000) NOT NULL,
reviewer_id varchar(128) NOT NULL,
language varchar(20) NOT NULL,
product_category varchar(100) NOT NULL,
primary key(review_id)
);
CREATE TABLE IF NOT EXISTS product_reviews_parquet(
language varchar(20) NOT NULL ENCODE lzo,
product_category varchar(100) NOT NULL ENCODE lzo,
product_id varchar(100) NOT NULL ENCODE lzo,
review_body varchar(10000) NOT NULL ENCODE lzo,
review_id varchar(100) NOT NULL distkey sortkey ENCODE lzo,
review_title varchar(1000) NOT NULL ENCODE lzo,
reviewer_id varchar(128) NOT NULL ENCODE lzo,
stars varchar(18) NOT NULL ENCODE lzo,
primary key(review_id)
);
amzn_reviews_en.json
및 amzn_reviews_en.parquet
를 S3에 업로드한 다음 redshift 데이터베이스에 로드합니다amzn_reviews_en.json
prodreview=# copy product_reviews_json
prodreview-# FROM 's3://sin-d1-redshift-data-ingest/amzn_reviews_en.json'
prodreview-# IAM_ROLE 'arn:aws:iam::107858015234:role/sin-d1-redshift-role'
prodreview-# json 'auto ignorecase';
INFO: Load into table 'product_reviews_json' completed, 5000 record(s) loaded successfully.
COPY
prodreview=# SELECT COUNT(*) FROM product_reviews_json;
count
-------
5000
(1 row)
amzn_reviews_en.parquet
prodreview=# copy product_reviews_parquet
FROM 's3://sin-d1-redshift-data-ingest/amzn_reviews_en.parquet'
IAM_ROLE 'arn:aws:iam::107858015234:role/sin-d1-redshift-role'
format as parquet;
INFO: Load into table 'product_reviews_parquet' completed, 5000 record(s) loaded successfully.
COPY
prodreview=# SELECT COUNT(*) FROM product_reviews_parquet;
count
-------
5000
(1 row)
select datediff(s,starttime,endtime) as duration,
*
from
stl_query
where query in (
4231, /* query id of json copy */
4521 /* query id of parquet copy */
);
prodreview=# SELECT stars, COUNT(stars) total_ratings FROM product_reviews_json WHERE product_category='kitchen' or product_category='grocery'
prodreview-# GROUP BY stars;
stars | total_ratings
-------+---------------
5 | 92
2 | 78
4 | 63
1 | 72
3 | 66
(5 rows)
prodreview=# SELECT stars, COUNT(stars) total_ratings FROM product_reviews_parquet WHERE product_category='kitchen' or product_category='grocery'
GROUP BY stars;
stars | total_ratings
-------+---------------
3 | 66
5 | 92
2 | 78
4 | 63
1 | 72
(5 rows)
🚀 결론
- Here we have demonstrated how to create a redshift cluster and its ETL using CDK typescript, you can update the cluster as well as adding more S3 buckets or attach more roles to the redshift cluster for separte ETL processes through CDK stacks.
- If you want to destroy all the resources created by the stack, simply run
cdk destroy --all
참조:
🚀 Vu Dao 🚀 팔로우
🚀 AWSome Devops | AWS Community Builder | AWS SA || ☁️ CloudOpz ☁️
붐다오 / 붐다오
Reference
이 문제에 관하여(AWS CDK Redshfit 데모), 우리는 이곳에서 더 많은 자료를 발견하고 링크를 클릭하여 보았다 https://dev.to/aws-builders/aws-cdk-redshfit-demo-2ch5텍스트를 자유롭게 공유하거나 복사할 수 있습니다.하지만 이 문서의 URL은 참조 URL로 남겨 두십시오.
우수한 개발자 콘텐츠 발견에 전념 (Collection and Share based on the CC Protocol.)