PRD: Enterprise ETL & Data Quality Pipeline for SAP EWM (MVP)

Project Priority: 🥇 MOST IMPORTANT
Version: 1.0 MVP
Status: Ready for Development


1. Product Vision (MVP)

Build a lightweight but real data-quality and ETL framework that ingests messy warehouse data from multiple sources, validates it against business rules, and outputs SAP-EWM-ready flat files (or tables) with a simple quality score. The MVP should convincingly show how this prevents classic SAP implementation failures caused by bad master/transaction data.

Why This Matters: “Neglecting Data Quality and Underestimating Migration Challenges” is the #8 reason SAP implementations fail. “Poor Master Data Quality” is specifically listed as mistake #2 in EWM implementations. This project addresses the pain point that causes 80% of project overruns.


2. Users & Jobs-to-be-Done

Primary User: SAP EWM Functional Consultant

Secondary User: Data/IT Owner (Legacy WMS/ERP)


3. MVP Scope (Must-Haves)

3.1 Data Sources (Simulated/Real)

At least 3 independent sources: - materials.csv – SKU master (ID, description, weight, volume, unit) - locations.csv – bin/location master (warehouse, storage type, bin) - transactions.csv – inbound/outbound warehouse movements

Optional: External weather.csv to demonstrate external feed integration

Data Requirements: - Minimum 1,000 materials - Minimum 500 storage locations - Minimum 5,000 transaction records - Include realistic data quality issues: duplicates, missing values, invalid formats, referential integrity violations

3.2 Ingestion Layer (MVP)

Python scripts to: - Load CSVs (simulate legacy WMS/ERP files) - Handle encoding issues, missing columns, and schema changes (e.g., extra column) - Log ingestion metadata (rows read, errors encountered, timestamps)

Configuration: - One configuration file (config_sources.yaml) describing: - File path - Expected schema - Load frequency (daily vs ad-hoc) - Data source type (master vs transaction)

3.3 Validation & Business Rules (Core)

Use Great Expectations as the main quality engine:

Schema Checks: - Mandatory columns present (e.g., MATERIAL_ID, PLANT, STORAGE_BIN) - Data types: numeric vs string vs date - Column count matches expected schema

Business Rules (Examples): - Material IDs are unique, non-null - Storage bins follow pattern (e.g., A-01-01 format) - No transaction references a non-existing material or bin - Quantities are positive for inbound/outbound, zero allowed only for adjustments - Dates are within valid range (not future dates for historical data) - Weight/volume values are positive numbers - Plant codes match expected list

Quality Score Calculation: - Per table: % of rows passing all tests - Overall: weighted score (Materials 40%, Locations 40%, Transactions 20%) - Output: JSON report with detailed breakdown

3.4 Transformation Layer (Minimal but Realistic)

Standardization: - Date formats to ISO (YYYY-MM-DD) - Units for weight/volume (e.g., all in KG, cubic meters) - Text normalization (trim whitespace, uppercase codes)

Deduplication: - Duplicate materials (same ID, different description) → keep latest or flag - Duplicate locations → merge or flag - Duplicate transactions → flag for review

Data Enrichment: - Add calculated fields (e.g., volume_per_unit, density) - Flag records requiring manual review

Output: - Cleaned tables: materials_clean.csv, locations_clean.csv, transactions_clean.csv - A single to_sap/ folder that mimics what you would hand to LSMW - Validation report: data_quality_report.json

3.5 Reporting/UI (MVP)

CLI Output: - Summary per run: - Rows ingested, rows failed validation, quality score - Execution time - Top 5 failing rules & counts - List of records requiring manual review

HTML/Markdown Report: - Overall quality scorecard - Table-by-table status (OK / WARN / FAIL) - Visual charts (bar charts for validation results) - Recommendations for fixing common issues

Optional (Nice-to-Have): - Minimal Streamlit page with: - Overall score dashboard - Table-by-table status (OK / WARN / FAIL) - Interactive drill-down into failing records


4. Out-of-Scope for MVP


5. Tech Stack (MVP)

Core: - Python 3.9+ - Pandas / Polars (for data manipulation) - Great Expectations (for validation framework) - Pydantic (for data validation) - PyYAML (for configuration)

SQL & Database: - SQL (for data extraction and validation queries) - Direct SQL queries mirroring SAP Open SQL patterns - Understanding of EWM data structures (materials, storage bins, handling units) - Optional: DuckDB for faster querying

SAP-EWM Alignment: - Data structures aligned with SAP EWM master data (materials, storage types, bins) - Validation rules mirroring EWM business logic - Output format compatible with LSMW (Legacy System Migration Workbench) - Understanding of IDoc structures for future integration

Storage: - Local CSV/Parquet files - Optional: DuckDB for faster querying

Interface: - CLI (using Click or argparse) - HTML report generation (using Jinja2 templates) - Optional: Streamlit v1 screen

Testing: - pytest for unit tests - Sample test datasets with known issues


6. Success Criteria (MVP)

Functional Requirements:

Performance Requirements:

Interview Readiness:


7. MVP Deliverables

  1. Source Code:
  2. Documentation:
  3. Sample Data:
  4. Outputs:

8. Development Timeline (MVP)

Week 1-2: Foundation - Set up project structure - Implement basic ingestion layer - Create sample datasets with known issues

Week 3-4: Validation - Implement Great Expectations suite - Create business rules - Build quality scoring logic

Week 4-5: Transformation - Implement data cleaning logic - Create SAP-ready output format - Build reporting module

Week 6: Polish & Documentation - Add CLI interface - Create HTML reports - Write documentation - Prepare demo


9. SAP-EWM Technology Complementarity

How Python Skills Complement SAP Technologies:

While ABAP is the core language for SAP EWM customization (BAdIs, enhancements, custom programs), this Python-based data quality pipeline demonstrates complementary skills:

  1. SQL Proficiency:
  2. Data Structure Understanding:
  3. Integration Readiness:
  4. Analytics & Problem-Solving:

Key Differentiator: Most SAP consultants can configure EWM; fewer understand why implementations fail. This project demonstrates enterprise thinking about data governance and quality—critical for successful SAP projects.


10. Key Interview Talking Points

  1. Addresses Real Pain Point:
  2. Enterprise-Ready:
  3. SAP Technology Alignment:
  4. Scalable Architecture:
  5. Business Impact:

11. Future Enhancements (Post-MVP)