PaEmotion - 소비 유형 예측 모델 설계 과정

1. 프로젝트 개요, 데이터 추출

목표 : 유저별 소비 유형 예측
- 데이터 정리, 유형 정리

2. 시행 착오들

단순한 피처
피처 보안, but 단순하고 작은 데이터셋

3. 최종 - 데이터 전처리 및 피처 구성

(1) 데이터 전처리

def preprocess(df):
    # 피처1 - 로그 변환한 소비 금액 -> log_spendCost
    df['log_spendCost'] = np.log1p(df['spendCost'])

    # budgets, actuals 문자열 → 리스트로 변환
    df['budgets'] = df['budgets'].apply(ast.literal_eval)
    df['actuals'] = df['actuals'].apply(ast.literal_eval)

    # 카테고리별 예산, 실제 소비 추출
    df['budget_for_category'] = df.apply(lambda row: row['budgets'][row['spendCategoryId'] - 1], axis=1)
    df['actual_for_category'] = df.apply(lambda row: row['actuals'][row['spendCategoryId'] - 1], axis=1)

    # 피처2 - 예산 대비 소비 비율 -> over_budget_ratio
    df['over_budget_ratio'] = df['spendCost'] / df['budget_for_category']

    # 피처3 - 유저별 감정 소비 비중 (max) -> max_emotion_ratio
    emotion_counts = df.groupby(['userId', 'emotionCategoryId']).size().unstack(fill_value=0)
    emotion_ratios = emotion_counts.div(emotion_counts.sum(axis=1), axis=0)
    df = df.merge(
        emotion_ratios.max(axis=1).rename('max_emotion_ratio').reset_index(),
        on='userId', how='left'
    )

    # 피처4 - 감정 엔트로피 계산 (diversity) -> emotion_entropy
    def calc_entropy(row):
        probs = row / row.sum()
        return entropy(probs, base=2)
    emotion_entropy = emotion_counts.apply(calc_entropy, axis=1).rename('emotion_entropy').reset_index()
    df = df.merge(emotion_entropy, on='userId', how='left')

    # 피처5 - 모임(9), 선물(11) 소비 비중 합산 -> meeting_gift_ratio
    category_counts = df.groupby(['userId', 'spendCategoryId']).size().unstack(fill_value=0)
    category_ratios = category_counts.div(category_counts.sum(axis=1), axis=0)
    category_ratios['meeting_gift_ratio'] = category_ratios.get(9, 0) + category_ratios.get(11, 0)
    df = df.merge(category_ratios['meeting_gift_ratio'].reset_index(), on='userId', how='left')

    # 피처6 - 소비 금액 표준편차 / 평균 비율 -> std_over_mean
    user_stats = df.groupby('userId')['spendCost'].agg(['mean', 'std']).reset_index()
    user_stats['std_over_mean'] = user_stats['std'] / user_stats['mean']
    df = df.merge(user_stats[['userId', 'std_over_mean']], on='userId', how='left')

    # 피처7 - 유저별 가장 많이 소비한 카테고리 비중 (편중성) -> max_cateogry_ratio
    cat_counts = df.groupby(['userId', 'spendCategoryId']).size().unstack(fill_value=0)
    cat_ratios = cat_counts.div(cat_counts.sum(axis=1), axis=0)
    df = df.merge(
        cat_ratios.max(axis=1).rename('max_category_ratio').reset_index(),
        on='userId', how='left'
    )

    return df

(2) 피처 구성

피처 이름	설명	도입 이유
log_spendCost	소비 금액의 로그 변환 값	큰 수 차이를 줄여서 안정적으로 학습하기 위해
over_budget_ratio	예산 대비 실제 소비 비율	예산보다 많이 썼는지 정도를 나타냄
max_emotion_ratio	유저별 가장 많이 사용하는 감정 유형 비중 (편중성)	유저의 감정 편중 정도
emotion_entropy	유저의 감정 분포 다양성	감정 소비가 얼마나 골고루 분포되었나
meeting_gift_ratio	모임(9), 선물(11) 소비 비중 합산	사교적 소비 경향
std_over_mean	소비 금액의 변동성 비율 (소비 금액 표준편차 / 평균 비율)	소비가 일정한지, 들쭉날쭉한지
max_category_ratio	유저별 가장 많이 소비한 카테고리 비중 (편중성)	유저의 소비 편중 정도
spendCategoryId	소비 카테고리 아이디	기본 피처
emotionCategoryId	감정 카테고리 아이디	기본 피처

4. 훈련 결과

코드

from sklearn.model_selection import KFold, cross_val_score
from sklearn.metrics import make_scorer, accuracy_score
# 3-1. 데이터 불러오기
train_df = pd.read_csv('/content/drive/MyDrive/PaEmotion/train_types_final_1.csv')
test_df = pd.read_csv('/content/drive/MyDrive/PaEmotion/train_types_final_2.csv')

# 3-2. 전처리 함수 호출
train_df = preprocess(train_df)
test_df = preprocess(test_df)

# 3-3. 사용할 피처명 리스트
features = ['emotionCategoryId', 'spendCategoryId', 'log_spendCost', 'over_budget_ratio',
            'max_emotion_ratio', 'emotion_entropy', 'meeting_gift_ratio', 'std_over_mean', 'max_category_ratio']

# 3-4. X, y 분리
X_train = train_df[features]
y_train = train_df['spendType']

X_test = test_df[features]
y_test = test_df['spendType']

# 3-5. 모델 학습
model = RandomForestClassifier(
    n_estimators=300,         # 더 많은 트리로 안정성↑
    max_depth=5,             # 너무 깊지 않게 과적합 방지
    min_samples_split=10,     # 가지 분기 조건 강화
    class_weight='balanced',  # 클래스 불균형 보정 (특히 5번!)
    random_state=42
)
model.fit(X_train, y_train)

# 3-6. 예측 및 평가
y_pred = model.predict(X_test)

from sklearn.metrics import classification_report, accuracy_score

# train 데이터 예측값
y_train_pred = model.predict(X_train)

print("Train classification report:")
print(classification_report(y_train, y_train_pred))
print("Train Accuracy:", accuracy_score(y_train, y_train_pred))

print("\\nTest classification report:")
print(classification_report(y_test, y_pred))
print("Test Accuracy:", accuracy_score(y_test, y_pred))

학습 결과

## 학습 데이터 분류 결과
Train classification report:
              precision    recall  f1-score   support

           1       1.00      1.00      1.00       600
           2       0.88      0.90      0.89       600
           3       1.00      1.00      1.00       600
           4       1.00      1.00      1.00       600
           5       0.90      0.88      0.89       600
           6       1.00      1.00      1.00       600

    accuracy                           0.96      3600
   macro avg       0.96      0.96      0.96      3600
weighted avg       0.96      0.96      0.96      3600

Train Accuracy: 0.9627777777777777

테스트 결과

## 테스트 데이터 분류 결과
Test classification report:
              precision    recall  f1-score   support

           1       1.00      1.00      1.00       600
           2       0.79      0.98      0.87       600
           3       1.00      0.92      0.96       600
           4       1.00      1.00      1.00       600
           5       0.98      0.82      0.89       600
           6       1.00      1.00      1.00       600

    accuracy                           0.95      3600
   macro avg       0.96      0.95      0.95      3600
weighted avg       0.96      0.95      0.95      3600

Test Accuracy: 0.9530555555555555

5. 결론

정확도 95.3%로 높은 예측 성능과 안정적인 일반화 능력을 보여주었다.

특히 유형 1, 4, 6은 대부분의 테스트셋에서 정확하게 분류되었으며, 상대적으로 분류가 어려운 유형 2와 5에 대해서도 후처리 룰 적용 없이도 일정 수준 이상의 성능을 달성하였다. 이는 주어진 피처들이 충분한 설명력을 가지고 있으며, 데이터의 구성 또한 모델이 일반화할 수 있을 만큼의 다양성을 포함하고 있음을 의미한다.

과적합 우려가 있었으나, 학습 데이터와 테스트 데이터를 완전히 분류하여 사용하여 데이터 누수 가능성은 존재하지 않는다. 또한 모델이 두 데이터에서 비슷한 성능을 출력하는 것을 보아 과적합의 가능성은 낮다.

최종적으로 본 모델은 실제 서비스 적용이 가능한 수준의 성능을 확보하였으며, 다양한 테스트셋에 대한 성능 안정성도 확인되었기 때문에, 추후 후속 단계에 활용할 수 있을 것으로 판단된다.