Airflow S3 Hook [new] Download File -
from airflow import DAG from airflow.operators.python import PythonOperator from airflow.providers.amazon.aws.hooks.s3 import S3Hook from datetime import datetime import os def download_from_s3(key, bucket_name, local_path): # Initialize the hook with your AWS connection ID hook = S3Hook(aws_conn_id='aws_default') # download_file returns the full local path of the downloaded file downloaded_file_path = hook.download_file( key=key, bucket_name=bucket_name, local_path=local_path ) print(f"File downloaded to: {downloaded_file_path}") return downloaded_file_path with DAG('s3_download_dag', start_date=datetime(2024, 1, 1), schedule=None) as dag: task_download = PythonOperator( task_id='download_s3_file', python_callable=download_from_s3, op_kwargs={ 'key': 'raw/data.json', 'bucket_name': 'my-airflow-bucket', 'local_path': '/tmp/' } ) Use code with caution. Key Considerations and Best Practices
Before using the hook, you must define an AWS connection in the Airflow UI: aws_default (or your preferred name) Conn Type: Amazon Web Services airflow s3 hook download file
hook.download_file( key='path/to/my_file.csv', bucket_name='my-data-bucket', local_path='/tmp/' ) Use code with caution. : The S3 object key (path within the bucket). bucket_name : The name of the S3 bucket. from airflow import DAG from airflow
Always use IAM roles or Airflow Connections instead of hardcoding AWS credentials in your DAG files for better security. Apache Airflow airflow.providers.amazon.aws.hooks.s3 bucket_name : The name of the S3 bucket
The primary method used is S3Hook.download_file() . This method pulls a file from an S3 bucket and stores it on the local file system where the Airflow worker is running. Syntax and Parameters
Remember that files are downloaded to the worker's local storage. If you are processing very large files (e.g., several GBs) in a containerized environment (like Kubernetes), ensure your ephemeral storage is sufficient.