ChatGPTBackend

Python Data Analysis — EDA to ML Pipeline

Complete Python workflow from data loading and EDA through feature engineering to ML model selection and evaluation.

@promptallFeb 25, 2026 4795
You are an expert Python data scientist. Help analyze:

**Dataset:** [FILENAME.csv]
**Shape:** [ROWS × COLS]
**Target:** [COLUMN NAME]
**Goal:** [PREDICTION / CLUSTERING / ANOMALY DETECTION / EDA ONLY]

Provide complete Python code for:

### 1. Exploratory Data Analysis
```python
# Data loading, .info(), .describe(), missing values, distributions
```

### 2. Data Cleaning
- Missing value strategy (impute vs drop — explain choice)
- Outlier detection (IQR + visual)
- Type corrections and encoding

### 3. Feature Engineering
- Relevant derived features for this domain
- Correlation analysis

### 4. Visualizations (5 most informative)
```python
# seaborn / matplotlib — explain what each reveals
```

### 5. ML Model Selection
For your target type, recommend top 3 models:
- Why each is appropriate
- Train/test split strategy
- Evaluation metrics (with explanation)
- Hyperparameter tuning approach (GridSearch/Optuna)

### 6. Full Pipeline
```python
# sklearn Pipeline integrating preprocessing + model
```

All code should be immediately runnable with comments.

Result Text

```python import pandas as pd import numpy as np import matplotlib.pyplot as plt import seaborn as sns from sklearn.pipeline import Pipeline from sklearn.preprocessing import StandardScaler from sklearn.model_selection import train_test_split, cross_val_score # 1. Load & Inspect df = pd.read_csv('customer_churn.csv') print(f"Shape: {df.shape}") print(f"Missing:\n{df.isnull().sum()[df.isnull().sum() > 0]}") print(f"Class balance:\n{df['churn'].value_counts(normalize=True)}") # 2. Outlier Detection (IQR) Q1, Q3 = df['monthly_charges'].quantile([0.25, 0.75]) IQR = Q3 - Q1 outliers = df[(df['monthly_charges'] < Q1 - 1.5*IQR) | (df['monthly_charges'] > Q3 + 1.5*IQR)] print(f"Outliers: {len(outliers)} ({len(outliers)/len(df)*100:.1f}%)") # 3. Feature Engineering df['tenure_group'] = pd.cut(df['tenure'], bins=[0, 12, 24, 48, 72], labels=['<1yr', '1-2yr', '2-4yr', '4+yr']) df['charges_per_month'] = df['total_charges'] / (df['tenure'] + 1) # Recommended: XGBoost > Random Forest > Logistic Regression # Metric: AUC-ROC + F1 (imbalanced classes) ```

コメント

to leave a comment.

まだコメントがありません。最初にコメントしましょう!

Related Prompts