Data Engineering: Data Pipeline and API Challenge

Overview

This project involves the development of a scalable data pipeline that ingests data from an SFTP location, processes it for analytical use, and exposes it through a secure API with date-based filtering and cursor-based pagination.

Background

The project is designed for an on-premise server environment and aims to:

Ingest data from an SFTP location.
Process and clean the data for consistency and structure.
Expose an API for external clients to access the processed data with support for date filtering and cursor-based pagination.

Requirements

1. Data Ingestion

Retrieve data files from an SFTP location.
Support various data formats (CSV & JSON).
Implement error handling for failed transfers and generate alerts.

2. Data Processing

Flatten and clean the ingested data for business consumption.
Apply data quality measures to ensure consistency.

3. API Development

Develop an API for external clients to access processed data.
Implement date filtering and cursor-based pagination.
Ensure basic security and rate limiting.

Deliverables

1. Working Data Pipeline

A script or automated process to ingest data from the SFTP location.
A robust pipeline capable of handling data inconsistencies.
Curated datasets ready for business consumption.

2. API with Documentation

A functional API enabling data retrieval with filtering and pagination.
Detailed API documentation covering authentication, endpoints, and rate limits.

Implementation

Data Ingestion

Uses Paramiko to securely connect to an SFTP server and retrieve files.
Supports both CSV and JSON formats.
Logs all operations and raises alerts for failures.

Data Processing

Cleans and flattens ingested data using Pandas.
Converts relevant fields to appropriate formats (e.g., date parsing).
Stores curated datasets in a structured format for easy access.

API Development

Built using FastAPI for high performance.
Implements date filtering and cursor-based pagination.
Secured with API key authentication.

API Usage

Endpoints

Endpoint	Method	Description
`/files`	GET	List available cleaned data files.
`/data`	GET	Retrieve filtered loan data with pagination.
`/table`	GET	Render data as an HTML table.

Authentication

All API requests require an API Key (X-API-Key) in the request header.

Example Request (Using curl)

curl -H "X-API-Key: your_api_key" "http://127.0.0.1:8000/data?file=data.csv&start_date=2016-01-01&end_date=2016-12-31&limit=10"

Assumptions

A simulated SFTP server is used for data ingestion.
Sample loan data is used to demonstrate functionality.
The API is hosted locally (127.0.0.1:8000).

Deployment

Clone the repository and install dependencies:

git clone <repository-url>
cd <project-folder>
pip install -r requirements.txt

Run the FastAPI server:

uvicorn main:app --host 0.0.0.0 --port 8000

Version Control

This project follows an iterative development approach using GitHub/GitLab/BitBucket.

Conclusion

This project successfully implements a data pipeline and API to manage and serve processed data efficiently. The pipeline ensures data consistency and quality, while the API provides structured access with filtering and pagination capabilities.

Happy Coding! 🚀

Name		Name	Last commit message	Last commit date
Latest commit History 2 Commits
Cleaned_data		Cleaned_data
Sample_data		Sample_data
graphql_data_pipeline		graphql_data_pipeline
scripts		scripts
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Data Engineering: Data Pipeline and API Challenge

Overview

Background

Requirements

1. Data Ingestion

2. Data Processing

3. API Development

Deliverables

1. Working Data Pipeline

2. API with Documentation

Implementation

Data Ingestion

Data Processing

API Development

API Usage

Endpoints

Authentication

Example Request (Using curl)

Assumptions

Deployment

Clone the repository and install dependencies:

Run the FastAPI server:

Version Control

Conclusion

About

Releases

Packages

Languages

doncoding-ai/data-engineer-pipeline

Folders and files

Latest commit

History

Repository files navigation

Data Engineering: Data Pipeline and API Challenge

Overview

Background

Requirements

1. Data Ingestion

2. Data Processing

3. API Development

Deliverables

1. Working Data Pipeline

2. API with Documentation

Implementation

Data Ingestion

Data Processing

API Development

API Usage

Endpoints

Authentication

Example Request (Using curl)

Assumptions

Deployment

Clone the repository and install dependencies:

Run the FastAPI server:

Version Control

Conclusion

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages