Lesson ProgressPhase 4 of 6
Phase 4Independent Practice
Independent Practice: Data Cleaning and Analysis

Build the cleaned dataset with verification checkpoints

Phase 4: Workbook Sprint

Clean the Real Café Data

Download the raw POS data, apply each cleaning tool, and create your analysis-ready dataset.

Starting Point

Download the raw café weekend sales data. This data has realistic problems:

/resources/unit04-pbl-cafe-weekend-ops-g1.csv
  • Inconsistent spacing in product names
  • Currency symbols in price column
  • Some duplicate transactions
  • Missing values in the category column
  • Dates in mixed formats

Goal: Produce a clean dataset ready for statistical analysis.

Build Sequence
  1. Open the CSV in Excel (or import if using Google Sheets)
  2. Copy the raw sheet → Rename "Raw Data" and "Clean Data"
  3. Clean in order:
    • Select Product column → Data → Text to Columns (delimited) if needed
    • Use Find & Replace: Find "$", Replace with nothing
    • Add a helper column for TRIMmed product names
    • Select all columns → Data → Remove Duplicates → Note the count
    • Filter and identify blank categories → Fill or flag
  4. Calculate before/after row counts for your audit trail
  5. Add documentation in a new sheet: what you cleaned, how many rows affected
Reference: What a Clean Dataset Looks Like

After cleaning, your data should have:

  • No leading/trailing spaces in any text column
  • Prices as numbers (no $ symbols)
  • No duplicate rows
  • Consistent date format (Excel date type, not text)
  • No blank categories (filled or marked as "Unknown")
  • Product names with proper capitalization (e.g., "Latte", not "latte" or "LATTE")
Verification Checkpoints
After Remove Duplicates: Row count reduced (note the difference)
All price values are numeric (no $ signs)
TRIM formula or cleaned values in product name column
No blank cells in Category column (filled or "Unknown")
Documentation sheet created with cleaning summary
Definition of Done

Your cleaned dataset is complete when:

  • All text columns have consistent spacing (no " Latte ")
  • Prices are numbers, not text with $ signs
  • Duplicate rows removed (before/after counts documented)
  • Category column has no blanks
  • You can sort by any column without errors
  • You can run basic calculations (SUM, AVERAGE) on numeric columns
  • A documentation sheet explains what was cleaned