Python Data Analysis — EDA to ML Pipeline
Complete Python workflow from data loading and EDA through feature engineering to ML model selection and evaluation.
You are an expert Python data scientist. Help analyze: **Dataset:** [FILENAME.csv] **Shape:** [ROWS × COLS] **Target:** [COLUMN NAME] **Goal:** [PREDICTION / CLUSTERING / ANOMALY DETECTION / EDA ONLY] Provide complete Python code for: ### 1. Exploratory Data Analysis ```python # Data loading, .info(), .describe(), missing values, distributions ``` ### 2. Data Cleaning - Missing value strategy (impute vs drop — explain choice) - Outlier detection (IQR + visual) - Type corrections and encoding ### 3. Feature Engineering - Relevant derived features for this domain - Correlation analysis ### 4. Visualizations (5 most informative) ```python # seaborn / matplotlib — explain what each reveals ``` ### 5. ML Model Selection For your target type, recommend top 3 models: - Why each is appropriate - Train/test split strategy - Evaluation metrics (with explanation) - Hyperparameter tuning approach (GridSearch/Optuna) ### 6. Full Pipeline ```python # sklearn Pipeline integrating preprocessing + model ``` All code should be immediately runnable with comments.
Result Text
```python import pandas as pd import numpy as np import matplotlib.pyplot as plt import seaborn as sns from sklearn.pipeline import Pipeline from sklearn.preprocessing import StandardScaler from sklearn.model_selection import train_test_split, cross_val_score # 1. Load & Inspect df = pd.read_csv('customer_churn.csv') print(f"Shape: {df.shape}") print(f"Missing:\n{df.isnull().sum()[df.isnull().sum() > 0]}") print(f"Class balance:\n{df['churn'].value_counts(normalize=True)}") # 2. Outlier Detection (IQR) Q1, Q3 = df['monthly_charges'].quantile([0.25, 0.75]) IQR = Q3 - Q1 outliers = df[(df['monthly_charges'] < Q1 - 1.5*IQR) | (df['monthly_charges'] > Q3 + 1.5*IQR)] print(f"Outliers: {len(outliers)} ({len(outliers)/len(df)*100:.1f}%)") # 3. Feature Engineering df['tenure_group'] = pd.cut(df['tenure'], bins=[0, 12, 24, 48, 72], labels=['<1yr', '1-2yr', '2-4yr', '4+yr']) df['charges_per_month'] = df['total_charges'] / (df['tenure'] + 1) # Recommended: XGBoost > Random Forest > Logistic Regression # Metric: AUC-ROC + F1 (imbalanced classes) ```
コメント
to leave a comment.
まだコメントがありません。最初にコメントしましょう!
Related Prompts
Code Reviewer — Security, Performance & Best Practices
Get thorough code reviews covering bugs, security vulnerabilities, performance bottlenecks, and clean refactoring.
REST API Designer — OpenAPI 3.0 Spec
Design RESTful APIs with proper resource modeling, status codes, auth strategy, and complete OpenAPI 3.0 documentation.
Unit & Integration Test Generator
Generate comprehensive test suites with edge cases, mocking strategies, and coverage for any function or API endpoint.