Clean the Real Café Data

Download the raw POS data, apply each cleaning tool, and create your analysis-ready dataset.

Starting Point

Download the raw café weekend sales data. This data has realistic problems:

Goal: Produce a clean dataset ready for statistical analysis.

Build Sequence

Open the CSV in Excel (or import if using Google Sheets)
Copy the raw sheet → Rename "Raw Data" and "Clean Data"
Clean in order:
- Select Product column → Data → Text to Columns (delimited) if needed
- Use Find & Replace: Find "$", Replace with nothing
- Add a helper column for TRIMmed product names
- Select all columns → Data → Remove Duplicates → Note the count
- Filter and identify blank categories → Fill or flag
Calculate before/after row counts for your audit trail
Add documentation in a new sheet: what you cleaned, how many rows affected

Reference: What a Clean Dataset Looks Like

After cleaning, your data should have:

No leading/trailing spaces in any text column
Prices as numbers (no $ symbols)
No duplicate rows
Consistent date format (Excel date type, not text)
No blank categories (filled or marked as "Unknown")
Product names with proper capitalization (e.g., "Latte", not "latte" or "LATTE")

Verification Checkpoints

After Remove Duplicates: Row count reduced (note the difference)

All price values are numeric (no $ signs)

TRIM formula or cleaned values in product name column

No blank cells in Category column (filled or "Unknown")

Documentation sheet created with cleaning summary

Definition of Done

Your cleaned dataset is complete when:

Phase Navigation