PRD: Enterprise ETL & Data Quality Pipeline for SAP EWM (MVP)

Project Priority: 🥇 MOST IMPORTANT
Version: 1.0 MVP
Status: Ready for Development

1. Product Vision (MVP)

Build a lightweight but real data-quality and ETL framework that ingests messy warehouse data from multiple sources, validates it against business rules, and outputs SAP-EWM-ready flat files (or tables) with a simple quality score. The MVP should convincingly show how this prevents classic SAP implementation failures caused by bad master/transaction data.

Why This Matters: “Neglecting Data Quality and Underestimating Migration Challenges” is the #8 reason SAP implementations fail. “Poor Master Data Quality” is specifically listed as mistake #2 in EWM implementations. This project addresses the pain point that causes 80% of project overruns.

2. Users & Jobs-to-be-Done

Primary User: SAP EWM Functional Consultant

Need: Clean master & transaction data before loading via LSMW/ETL into SAP
Job-to-be-Done: “Before I configure or test EWM, I want to validate and fix data issues so the system behaves predictably.”

Secondary User: Data/IT Owner (Legacy WMS/ERP)

Need: Simple interface to see which feeds are breaking and why
Job-to-be-Done: “I need a daily view of data quality to resolve issues before cutover.”

3. MVP Scope (Must-Haves)

3.1 Data Sources (Simulated/Real)

At least 3 independent sources: - materials.csv – SKU master (ID, description, weight, volume, unit) - locations.csv – bin/location master (warehouse, storage type, bin) - transactions.csv – inbound/outbound warehouse movements

Optional: External weather.csv to demonstrate external feed integration

Data Requirements: - Minimum 1,000 materials - Minimum 500 storage locations - Minimum 5,000 transaction records - Include realistic data quality issues: duplicates, missing values, invalid formats, referential integrity violations

3.2 Ingestion Layer (MVP)

Python scripts to: - Load CSVs (simulate legacy WMS/ERP files) - Handle encoding issues, missing columns, and schema changes (e.g., extra column) - Log ingestion metadata (rows read, errors encountered, timestamps)

Configuration: - One configuration file (config_sources.yaml) describing: - File path - Expected schema - Load frequency (daily vs ad-hoc) - Data source type (master vs transaction)

3.3 Validation & Business Rules (Core)

Use Great Expectations as the main quality engine:

Schema Checks: - Mandatory columns present (e.g., MATERIAL_ID, PLANT, STORAGE_BIN) - Data types: numeric vs string vs date - Column count matches expected schema

Business Rules (Examples): - Material IDs are unique, non-null - Storage bins follow pattern (e.g., A-01-01 format) - No transaction references a non-existing material or bin - Quantities are positive for inbound/outbound, zero allowed only for adjustments - Dates are within valid range (not future dates for historical data) - Weight/volume values are positive numbers - Plant codes match expected list

Quality Score Calculation: - Per table: % of rows passing all tests - Overall: weighted score (Materials 40%, Locations 40%, Transactions 20%) - Output: JSON report with detailed breakdown

3.4 Transformation Layer (Minimal but Realistic)

Standardization: - Date formats to ISO (YYYY-MM-DD) - Units for weight/volume (e.g., all in KG, cubic meters) - Text normalization (trim whitespace, uppercase codes)

Deduplication: - Duplicate materials (same ID, different description) → keep latest or flag - Duplicate locations → merge or flag - Duplicate transactions → flag for review

Data Enrichment: - Add calculated fields (e.g., volume_per_unit, density) - Flag records requiring manual review

Output: - Cleaned tables: materials_clean.csv, locations_clean.csv, transactions_clean.csv - A single to_sap/ folder that mimics what you would hand to LSMW - Validation report: data_quality_report.json

3.5 Reporting/UI (MVP)

CLI Output: - Summary per run: - Rows ingested, rows failed validation, quality score - Execution time - Top 5 failing rules & counts - List of records requiring manual review

HTML/Markdown Report: - Overall quality scorecard - Table-by-table status (OK / WARN / FAIL) - Visual charts (bar charts for validation results) - Recommendations for fixing common issues

Optional (Nice-to-Have): - Minimal Streamlit page with: - Overall score dashboard - Table-by-table status (OK / WARN / FAIL) - Interactive drill-down into failing records

4. Out-of-Scope for MVP

Full-blown Airflow deployment (only show a “DAG-like” Python orchestrator)
Real connectivity to SAP (focus on pre-SAP flat-file outputs)
Full history of lineage (just simple logging with filenames & timestamps)
Real-time data streaming
Multi-warehouse support
Advanced data profiling (beyond basic statistics)
Automated data fixing (only flagging for MVP)

5. Tech Stack (MVP)

Core: - Python 3.9+ - Pandas / Polars (for data manipulation) - Great Expectations (for validation framework) - Pydantic (for data validation) - PyYAML (for configuration)

SQL & Database: - SQL (for data extraction and validation queries) - Direct SQL queries mirroring SAP Open SQL patterns - Understanding of EWM data structures (materials, storage bins, handling units) - Optional: DuckDB for faster querying

SAP-EWM Alignment: - Data structures aligned with SAP EWM master data (materials, storage types, bins) - Validation rules mirroring EWM business logic - Output format compatible with LSMW (Legacy System Migration Workbench) - Understanding of IDoc structures for future integration

Storage: - Local CSV/Parquet files - Optional: DuckDB for faster querying

Interface: - CLI (using Click or argparse) - HTML report generation (using Jinja2 templates) - Optional: Streamlit v1 screen

Testing: - pytest for unit tests - Sample test datasets with known issues

6. Success Criteria (MVP)

Functional Requirements:

✅ Can run python pipeline_run.py and:
- Ingest 3+ sources
- Produce a data quality report
- Generate “SAP-ready” clean outputs
✅ Handles at least 5 different data quality issues:
- Missing values
- Invalid formats
- Duplicate records
- Referential integrity violations
- Schema mismatches

Performance Requirements:

✅ Processes 10,000+ records in under 30 seconds
✅ Generates quality report in under 5 seconds

Interview Readiness:

✅ You can explain in an interview:
- “This pipeline would sit before EWM, catching master data issues that commonly sink SAP projects”
- “Poor Master Data Quality is mistake #2 in EWM implementations, causing 80% of project overruns”
- “Uses SQL skills that directly transfer to SAP Open SQL for performance optimization”
- “Validates data structures aligned with SAP EWM master data (materials, storage bins, handling units)”
- “Output format compatible with LSMW, ready for SAP migration”
- “This demonstrates enterprise thinking: data governance, validation frameworks, quality scorecards”

7. MVP Deliverables

Source Code:
- pipeline_run.py - Main orchestrator
- ingestion/ - Data loading modules
- validation/ - Great Expectations suite
- transformation/ - Data cleaning logic
- config_sources.yaml - Configuration file
- requirements.txt - Dependencies
Documentation:
- README.md with setup instructions
- Architecture diagram
- Example data files
- Sample output reports
Sample Data:
- sample_data/materials.csv
- sample_data/locations.csv
- sample_data/transactions.csv
- sample_data/weather.csv (optional)
Outputs:
- to_sap/ folder with cleaned files
- reports/data_quality_report.html
- reports/data_quality_report.json

8. Development Timeline (MVP)

Week 1-2: Foundation - Set up project structure - Implement basic ingestion layer - Create sample datasets with known issues

Week 3-4: Validation - Implement Great Expectations suite - Create business rules - Build quality scoring logic

Week 4-5: Transformation - Implement data cleaning logic - Create SAP-ready output format - Build reporting module

Week 6: Polish & Documentation - Add CLI interface - Create HTML reports - Write documentation - Prepare demo

9. SAP-EWM Technology Complementarity

How Python Skills Complement SAP Technologies:

While ABAP is the core language for SAP EWM customization (BAdIs, enhancements, custom programs), this Python-based data quality pipeline demonstrates complementary skills:

SQL Proficiency:
- SQL skills directly transfer to SAP Open SQL for performance optimization
- Understanding of database query patterns used in ABAP development
- Ability to analyze EWM transaction data for troubleshooting
Data Structure Understanding:
- Validates data structures aligned with SAP EWM master data
- Understanding of materials, storage bins, handling units, warehouse tasks
- Prepares data in formats compatible with SAP migration tools (LSMW)
Integration Readiness:
- Understanding of IDoc structures for future SAP integration
- Data validation patterns that mirror SAP’s business rule validation
- Output formats ready for SAP system consumption
Analytics & Problem-Solving:
- Python analytics skills complement ABAP technical implementation
- Demonstrates ability to identify and solve data quality issues proactively
- Shows understanding of why implementations fail (data quality focus)

Key Differentiator: Most SAP consultants can configure EWM; fewer understand why implementations fail. This project demonstrates enterprise thinking about data governance and quality—critical for successful SAP projects.

10. Key Interview Talking Points

Addresses Real Pain Point:
- “Poor Master Data Quality is mistake #2 in EWM implementations”
- “This prevents the 80% of project overruns caused by data issues”
- “Uses SQL skills that directly transfer to SAP Open SQL”
Enterprise-Ready:
- “Uses Great Expectations, the industry standard for data validation”
- “Produces SAP-ready outputs that can be directly loaded via LSMW”
- “Validates data structures aligned with SAP EWM master data requirements”
SAP Technology Alignment:
- “Understanding of EWM data structures (materials, storage bins, handling units)”
- “SQL proficiency complements ABAP development skills”
- “Python analytics skills complement ABAP technical implementation”
Scalable Architecture:
- “Configurable via YAML, easy to add new data sources”
- “Modular design allows extension to Airflow/DAGs later”
Business Impact:
- “Catches data quality issues before they reach SAP”
- “Reduces implementation risk and project timeline”

11. Future Enhancements (Post-MVP)

Airflow DAG integration
Real-time SAP connectivity (RFC/BAPI)
Advanced data profiling
Automated data fixing suggestions
Multi-warehouse support
Data lineage tracking
Integration with SAP Data Services