Glue ETL Script Example for Loading Tulip Table Data
  • 26 Aug 2024
  • 2 Minutes to read
  • Contributors

Glue ETL Script Example for Loading Tulip Table Data


Article summary

Query Tulip Tables with a Glue ETL Script to simplify moving data from Tulip to Redshift (Or other data clouds)

Purpose

This script provides a simple starting point for querying data on Tulip Tables and moving to Redshift or other Data Warehouses

High-level Architecture

This high-level architecture can be used to query data from Tulip's Tables API and then saved to Redshift for further analytics and processing.

image.png

Example Script

The example script below shows how to query a single Tulp Table with Glue ETL (Python Powershell) and then write to Redshift. NOTE: For scaled production use cases, writing to a temporary S3 bucket and then copying the bucket contents to S3 is recommended instead. Additionally, the credentials are saved via AWS Secrets Manager.


import sys
import pandas as pd
import numpy as np
import requests
import json
import boto3
from botocore.exceptions import ClientError
from sqlalchemy import create_engine
import sqlalchemy as sa
from sqlalchemy.engine.url import URL
import psycopg2
from datetime import datetime
import logging
logger = logging.getLogger()

table_id = 'aKzvoscgHCyd2CRu3_DEFAULT'
def get_secret(secret_name, region_name):
    # Create a Secrets Manager client
    session = boto3.session.Session()
    client = session.client(
        service_name='secretsmanager',
        region_name=region_name
    )
    try:
        get_secret_value_response = client.get_secret_value(
            SecretId=secret_name
        )
    except ClientError as e:
        # For a list of exceptions thrown, see
        # https://docs.aws.amazon.com/secretsmanager/latest/apireference/API_GetSecretValue.html
        raise e
    return json.loads(get_secret_value_response['SecretString'])

redshift_credentials = get_secret(secret_name='tulip_redshift', region_name='us-east-1')
api_credentials = get_secret(secret_name='[INSTANCE].tulip.co-API-KEY', region_name='us-east-1')

# build SQL engine
url = URL.create(
drivername='postgresql', 
host=redshift_credentials['host'],
port=redshift_credentials['port'],
database=redshift_credentials['dbname'], 
username=redshift_credentials['username'], 
password=redshift_credentials['password'] 
)
engine = sa.create_engine(url)

header = {'Authorization' : api_credentials['auth_header']}
base_url = 'https://william.tulip.co/api/v3'


offset = 0
function = f'/tables/{table_id}/records?limit=100&offset={offset}&includeTotalCount=false&filterAggregator=all'
r = requests.get(base_url+function, headers=header)
df = pd.DataFrame(r.json())
length = len(r.json())

while length > 0:
    offset += 100
    function = f'/tables/{table_id}/records?limit=100&offset={offset}&includeTotalCount=false&filterAggregator=all'
    r = requests.get(base_url+function,
    headers=header)
    length = len(r.json())
    df_append = pd.DataFrame(r.json())
    df = pd.concat([df, df_append], axis=0)
    
# capture date-time stamp
now = datetime.now()
df['datetime_updated'] = now

# write to Redshift
df.to_sql('station_activity_from_glue', engine, schema='product_growth', index=False,if_exists='replace')

Scale Considerations

Consider using S3 as an intermediary temporary storage to then copy data from S3 to Redshift instead of writing it directly to Redshift. This can be more computationally efficient.

Additionally, you can also use metadata to write all Tulip Tables to a Data Warehouse instead of one-off Tulip Tables

Finally, this example script overwrites the entire table each time. A more efficient method would be to update rows modified since the last update or query.

Next Steps

For further reading, please check out the Amazon Well-Architected Framework. This is a great resource for understanding optimal methods for data flows and integrations


Was this article helpful?