Project Priority: 🥇 MOST IMPORTANT
Version: 1.0 MVP
Status: Ready for Development
Build a lightweight but real data-quality and ETL framework that ingests messy warehouse data from multiple sources, validates it against business rules, and outputs SAP-EWM-ready flat files (or tables) with a simple quality score. The MVP should convincingly show how this prevents classic SAP implementation failures caused by bad master/transaction data.
Why This Matters: “Neglecting Data Quality and Underestimating Migration Challenges” is the #8 reason SAP implementations fail. “Poor Master Data Quality” is specifically listed as mistake #2 in EWM implementations. This project addresses the pain point that causes 80% of project overruns.
At least 3 independent sources: -
materials.csv – SKU master (ID, description, weight,
volume, unit) - locations.csv – bin/location master
(warehouse, storage type, bin) - transactions.csv –
inbound/outbound warehouse movements
Optional: External weather.csv to
demonstrate external feed integration
Data Requirements: - Minimum 1,000 materials - Minimum 500 storage locations - Minimum 5,000 transaction records - Include realistic data quality issues: duplicates, missing values, invalid formats, referential integrity violations
Python scripts to: - Load CSVs (simulate legacy WMS/ERP files) - Handle encoding issues, missing columns, and schema changes (e.g., extra column) - Log ingestion metadata (rows read, errors encountered, timestamps)
Configuration: - One configuration file
(config_sources.yaml) describing: - File path - Expected
schema - Load frequency (daily vs ad-hoc) - Data source type (master vs
transaction)
Use Great Expectations as the main quality engine:
Schema Checks: - Mandatory columns present (e.g.,
MATERIAL_ID, PLANT, STORAGE_BIN)
- Data types: numeric vs string vs date - Column count matches expected
schema
Business Rules (Examples): - Material IDs are
unique, non-null - Storage bins follow pattern (e.g.,
A-01-01 format) - No transaction references a non-existing
material or bin - Quantities are positive for inbound/outbound, zero
allowed only for adjustments - Dates are within valid range (not future
dates for historical data) - Weight/volume values are positive numbers -
Plant codes match expected list
Quality Score Calculation: - Per table: % of rows passing all tests - Overall: weighted score (Materials 40%, Locations 40%, Transactions 20%) - Output: JSON report with detailed breakdown
Standardization: - Date formats to ISO
(YYYY-MM-DD) - Units for weight/volume (e.g., all in KG,
cubic meters) - Text normalization (trim whitespace, uppercase
codes)
Deduplication: - Duplicate materials (same ID, different description) → keep latest or flag - Duplicate locations → merge or flag - Duplicate transactions → flag for review
Data Enrichment: - Add calculated fields (e.g.,
volume_per_unit, density) - Flag records
requiring manual review
Output: - Cleaned tables:
materials_clean.csv, locations_clean.csv,
transactions_clean.csv - A single to_sap/
folder that mimics what you would hand to LSMW - Validation report:
data_quality_report.json
CLI Output: - Summary per run: - Rows ingested, rows failed validation, quality score - Execution time - Top 5 failing rules & counts - List of records requiring manual review
HTML/Markdown Report: - Overall quality scorecard - Table-by-table status (OK / WARN / FAIL) - Visual charts (bar charts for validation results) - Recommendations for fixing common issues
Optional (Nice-to-Have): - Minimal Streamlit page with: - Overall score dashboard - Table-by-table status (OK / WARN / FAIL) - Interactive drill-down into failing records
Core: - Python 3.9+ - Pandas / Polars (for data manipulation) - Great Expectations (for validation framework) - Pydantic (for data validation) - PyYAML (for configuration)
SQL & Database: - SQL (for data extraction and validation queries) - Direct SQL queries mirroring SAP Open SQL patterns - Understanding of EWM data structures (materials, storage bins, handling units) - Optional: DuckDB for faster querying
SAP-EWM Alignment: - Data structures aligned with SAP EWM master data (materials, storage types, bins) - Validation rules mirroring EWM business logic - Output format compatible with LSMW (Legacy System Migration Workbench) - Understanding of IDoc structures for future integration
Storage: - Local CSV/Parquet files - Optional: DuckDB for faster querying
Interface: - CLI (using Click or argparse) - HTML report generation (using Jinja2 templates) - Optional: Streamlit v1 screen
Testing: - pytest for unit tests - Sample test datasets with known issues
python pipeline_run.py and:
pipeline_run.py - Main orchestratoringestion/ - Data loading modulesvalidation/ - Great Expectations suitetransformation/ - Data cleaning logicconfig_sources.yaml - Configuration filerequirements.txt - Dependenciessample_data/materials.csvsample_data/locations.csvsample_data/transactions.csvsample_data/weather.csv (optional)to_sap/ folder with cleaned filesreports/data_quality_report.htmlreports/data_quality_report.jsonWeek 1-2: Foundation - Set up project structure - Implement basic ingestion layer - Create sample datasets with known issues
Week 3-4: Validation - Implement Great Expectations suite - Create business rules - Build quality scoring logic
Week 4-5: Transformation - Implement data cleaning logic - Create SAP-ready output format - Build reporting module
Week 6: Polish & Documentation - Add CLI interface - Create HTML reports - Write documentation - Prepare demo
How Python Skills Complement SAP Technologies:
While ABAP is the core language for SAP EWM customization (BAdIs, enhancements, custom programs), this Python-based data quality pipeline demonstrates complementary skills:
Key Differentiator: Most SAP consultants can configure EWM; fewer understand why implementations fail. This project demonstrates enterprise thinking about data governance and quality—critical for successful SAP projects.