K-최근접 이웃 알고리즘

개요[편집 / 원본 편집]

K-최근접 이웃 알고리즘이란 특정공간내에서 입력과 제일 근접한 K개의 요소를 찾아, 더 많이 일치하는 것으로 분류하는 알고리즘이다. 지도 학습(Supervised Learning)의 한 종류로 레이블이 있는 데이터를 사용하여 분류 작업을 한다.

데이터로부터 거리가 가까운 K개의 다른 데이터의 레이블을 참조하여 분류한다. 주로 거리를 측정할 때 유클리디안 거리 계산법을 사용하여 거리를 측정하는데, 벡터의 크기가 커지면 계산이 복잡해진다.

패턴 인식에서 Classification(분류) 혹은 Regression(회귀)에 사용되는 Non-Parametric Statistics(비모수 통계) 방식 알고리즘이다. ~~K-방역~~

비모수 통계[편집 / 원본 편집]

모수에 대한 가정을 전제로 하지 않고, 모집단의 형태에 관계없이 주어진 데이터에서 직접 확률을 계산하여 통계학적 검정을 수행하는 분석법이다. 표본수가 적고, 모집단의 분포를 모르는 경우에 적용할 수 있는 통계 기법이다.

원리[편집 / 원본 편집]

새로 입력된 데이터와 기존 데이터 사이의 거리를 계산하고, 새로 입력된 데이터는 가장 가까운 거리에 있는 k개의 데이터들 중 대다수의 특성을 따라간다.

이때 새로운 데이터와 기존 데이터 사이의 거리를 계산하는 방법은 다음과 같다.

Euclidean Distance (유클리드 거리) : 직교좌표계에서 두 점 사이의 거리
Hamming Distance (해밍 거리, 중첩 거리) : 같은 길이의 두 문자열에서 같은 위치에서 서로 다른 Symbol들의 개수

특징[편집 / 원본 편집]

데이터 기반 분석 방법이다.
- 어떤 점의 특성을 알고 싶다면 가까이 있는 점으로부터 특성 파악이 가능하다.
데이터의 분포를 신경쓰지 않는다.
회귀문제, 분류문제 등에 적용 가능하다.
- 회귀 문제
  - kNN의 값을 평균 내어 값 예측
- 분류 문제
  - kNN 중에 가장 많은 항목 선택

가까운 이웃이 여럿일 경우에는 거리에 따라 가중치를 주거나, 단독 1등이 나올 때까지 k를 하나씩 줄인다.
데이터가 균일하지 않을 경우, cut-off를 조절하여 사전 데이터가 적은 애들한테 가중치를 준다.

K값 선택하기[편집 / 원본 편집]

K의 값을 선택하는 것은 데이터마다 다르게 접근해야 한다. K가 높거나 낮을때 발생할수 있는 문제는 다음과 같다.

K가 낮다 → 적은 이웃 수로 판단한다 → 불안정한 결과 ~ 오버피칭
K가 높다 → 많은 이웃 수로 판단한다 → 지나친 일반화 ~ 언더피팅

따라서 가장 좋은 성능을 내는 값으로 선택해야 한다. K의 값을 1부터 증가시켜가며 각 점들에 대해 KNN으로 분류해보고 오류 계산 가장 오류가 적은 K값을 선택해야 한다.

데이터 간 거리 구하기[편집 / 원본 편집]

데이터 간 거리를 구하는 방식은 다음과 같다.

Eucledian Distance ( L2 )
Manhattan Distance ( L1 )
Cosine Distance
Hamming Distance ( 곱집합)
자카드거리 ( 집합 )
그외
- Standardized Euclidean Distance
- Mahalanobis Distance
- Correlation Distance

장단점[편집 / 원본 편집]

장점[편집 / 원본 편집]

알고리즘이 간단하여 구현하기 쉽고, 수치 기반 데이터 분류 작업에서 성능이 좋다. 또한 학습 단계가 필요가 없고, information loss가 없다.

단점[편집 / 원본 편집]

사전 계산을 할 수 없기 때문에 분류 속도가 느리기에 학습 데이터의 양이 많으면 분류 속도가 느려진다. 또한 차원의 크기가 커지면 계산량이 많아진다. 차원의 저주

모델 생성이 없어서 특징과 클래스간 관계 이해가 제한적이다. 또한 K거리와 갯수를 정해야 하고, 데이터 정규화를 진행해야한다.

예시 코드[편집 / 원본 편집]

C++[편집 / 원본 편집]

// C++ program to find groups of unknown
// Points using K nearest neighbour algorithm.
#include <algorithm>
#include <math>

using namespace std;
  
struct Point
{
    int val;     // Group of point
    double x, y;     // Co-ordinate of point
    double distance; // Distance from test point
};
  
// Used to sort an array of points by increasing
// order of distance
bool comparison(Point a, Point b)
{
    return (a.distance < b.distance);
}
  
// This function finds classification of point p using
// k nearest neighbour algorithm. It assumes only two
// groups and returns 0 if p belongs to group 0, else
// 1 (belongs to group 1).
int classifyAPoint(Point arr[], int n, int k, Point p)
{
    // Fill distances of all points from p
    for (int i = 0; i < n; i++)
        arr[i].distance =
            sqrt((arr[i].x - p.x) * (arr[i].x - p.x) +
                 (arr[i].y - p.y) * (arr[i].y - p.y));
  
    // Sort the Points by distance from p
    sort(arr, arr+n, comparison);
  
    // Now consider the first k elements and only
    // two groups
    int freq1 = 0;     // Frequency of group 0
    int freq2 = 0;     // Frequency of group 1
    for (int i = 0; i < k; i++)
    {
        if (arr[i].val == 0)
            freq1++;
        else if (arr[i].val == 1)
            freq2++;
    }
  
    return (freq1 > freq2 ? 0 : 1);
}

파이썬[편집 / 원본 편집]

# Survivor Prediction of Titanic using k-NN Algorithm (Scikit-Learn)

import matplotlib.pyplot as plt
import pandas as pd
import seaborn as sns

from sklearn.neighbors import KNeighborsClassifier

# Load the dataset
titanic_bunch = sns.load_dataset('titanic')
titanic_bunch = titanic_bunch.dropna(axis=0) # Drop the rows including NaN 
titanic_bunch['sex'] = titanic_bunch['sex'].replace({'female':0, 'male':1}) # Numericalization of Column 'sex'
titanic_bunch['embarked'] = titanic_bunch['embarked'].replace({'C':0, 'S':1, 'Q':2}) # Numericalization of Column 'embarked'
numeric_column = ['pclass', 'sex', 'age', 'sibsp', 'parch', 'fare', 'embarked']

# Data separation
X = titanic_bunch.loc[:, numeric_column]
y = titanic_bunch.loc[:, 'survived']

# Model Evaluatiuon (Finding the optimum k)
from sklearn.model_selection import train_test_split
x_train, x_test, y_train, y_test = train_test_split(
    X,
    y,
    random_state=100
)

model = KNeighborsClassifier(n_neighbors=3)
model.fit(x_train, y_train)

k_range = range(1, 11)
train_score = []
test_score = []

for k in k_range:
    model = KNeighborsClassifier(n_neighbors=k).fit(x_train, y_train)
    train_score.append(model.score(x_train, y_train))
    test_score.append(model.score(x_test, y_test))

# Visualization
plt.plot(k_range, train_score, label='Train Accuracy')
plt.plot(k_range, test_score, label='Test Accuracy')
plt.xticks(k_range)
plt.title('Find Best K-Value in titanic')
plt.legend()
plt.grid()
plt.show()

# Scoring
print('Training Score:', model.score(x_train, y_train))
print('Evaluation Score:', model.score(x_test, y_test))

사용한 자료[편집 / 원본 편집]

CC BY로 가져옴