Publish Streaming data into Aws S3 Data Lake and Query it
Consume Streaming data from Aws Kinesis, build data lake in S3 and run Sql Quries from Athena.
Goal is not just show you our architecture but provide actual cloudformation to create entire data lake in matter of minutes.
Entire Architecture is discussed in depth at BGI2020 https://youtu.be/LAs2msAPP4g?t=1334
SourceCode:
If you understand above architecture and want to jump right into actual code, Link to Cloudformation template. Simply choose any stack name and create stack from cloudformation console or using aws cli. This will create all necessary resources including a test Lambda function to publish events to Kinesis. Run the test lambda from console using this sample JSON and observe the events in S3 buckets (in 60 seconds) and query from Athena.
Overview:
- Running Test Lambda will publish events into Kinesis.
- Firehose consumes kinesis events, backs up the source event into a raw event bucket(30 day retention), converts it to parquet format based on Glue schema and determines the partition based on current time and writes to final S3 bucket, our data lake.
- Glue Table holds all the partition information for S3 data.
- Glue Crawler Job runs every 30 minutes, looks for any new documents in S3 bucket and create/updates/deletes partition metadata.
- Run sql queries in Athena which uses Glue Table partition metadata to efficiently query S3.
Now Lets go step by step and understand each resource in CF. click on the heading will lead to actual resource in cloudformation.
Data Storage:
Resource related to store data in S3
- Kms Key to encrypt data written to buckets
- Allow Firehose to use this key
- S3 bucket, our actual data lake.
- Data stored in bucket is encrypted with Kms key
- S3 bucket to store raw Json event
- Documents will be deleted after 30 days.
Streaming:
- kinesis stream with 1 shard (1 partition in kafka language)
- Data in kinesis encrypted with Kms.
- Test Lambda written in nodejs to publish events into Kinesis.
- Any code less than 4096 characters can be embedded right into cloudformation, which is exactly what I did.
Lambda Execution Role: All necessary policies for lambda function have been set here.
- Key needed to encrypt data written to kinesis.
- key policy includes access to use for Lambda Role.
Stream Delivery:
- Role used by firehose needs several access polices.
- Some of them include access to Glue table, upload to S3 buckets, to be able to read Streams, decrypt with kms keys, etc.
- Documents are partitioned when writing to S3. We can then give partition range in where conditions in Athena queries to avoid full bucket scan and save a lot of money and time.
- If any exceptions during conversions, firehouse will delivery these errors into separate partition(ErrorOutputPrefix)
- Buffering data every 60 seconds or 64Mb. Smaller than 64Mb is not recommended for parquet files(or that’s the minimum)
- Source events are backed up as is into a raw bucket in json and converted events are sent to actual bucket in parquet.
- Uses Glue Table Schema during conversion. If conversion errors, lets say timestamp is not in right format, events will be sent to exception S3 partition.
Glue Database: Database to hold Glue Table.
- format of the table ‘parquet’ is specified.
- schema of the table is specified.
- Actually , table creation is not mandatory. Crawler job will create the table automatically if table doesn’t exist, but crawler was inferring schema of some of the attributes like timestamp as strings. so, I created Glue Table with our schema.
Crawler Role: Role used by crawler job which needs access to read S3 bucket and use kms key to decrypt s3 data.
- Crawler job behind the scenes executes a simple spark code to crawl through our S3 bucket and updates Glue Table partition metadata.
- Scheduled to run every hour at 15 & 45.
- Several configurations in aws documentation, I used what is suitable for our use case.
Once cloudformation is executed, you should see all above resources created.
Run the Lambda from console with using this sample JSON event and you should see it S3 bucket in a minute.
Wait for Crawler job to run or manually run it from console.
Athena or Glue, we can see the actual schema of the table we have given in template along with partitions.
Select datasource & table and run queries from Athena!
I hope this is enough information to help you setup a simple Data Lake on S3 in no time. Please clap if you like and comment if any feedback.