통계-7 상관 분석

통계 2023. 5. 29. 22:11

1. 상관 분석

상관 분석은 두 변수 사이의 관계를 파악할 수 있는 방법이다.

tips = sns.load_dataset('tips')

corr = tips[['total_bill','tip']].corr()
print(corr)

            total_bill       tip
total_bill    1.000000  0.675734
tip           0.675734  1.000000

seaborn의 tips 데이터를 불러온 뒤 상관 계수를 corr 함수로 출력한다. 0.68이며 식사 가격이 높을수록 팁이 0.68 정도 비례한다는 것을 알 수 있다.

sns.regplot(x='total_bill',y='tip',data=tips)

regplot은 scatter와 line을 함께 볼 수 있는 그래프이며 식사 금액에 따른 팁을 표현한다. 그래프가 선형이며 양의 각 변수가 양의 상관관계를 가지는 것을 알 수 있다.

2. 피어슨 상관계수

피어슨 상관계수는-1부터 1까지의 범위를 가지며 1에 가까울수록 양, -1에 가까울수록음의 상관관계를 나타낸다.

corr = tips['total_bill'].corr(tips['tip'], method='pearson')

print('pearson correlation coefficient: ', corr)

pearson correlation coefficient:  0.6757341092113645

method를 pearson으로 지정하여 피어슨 상관계수를 구한다. 0.675의 상관계수를 확인할 수 있다.

3. 공분산

변수가 두가지 일 때 어떻게 퍼져 있는지 나타내는 것이 공분산이다. x의 편차와 y의 편차를 곱한 것의 평균이며 두 변수가 함께 변화하는 정도를 나타내고 같은 방향으로 변화하면 양수, 반대로 변화하면 음수를 가진다.

x= [1,2,3,4,5]
y= [2,3,5,6,7]
cov_matrix = np.cov(x,y)

print(cov_matrix)

[[2.5  3.25]
 [3.25 4.3 ]]

np.cov를 통해서 구하며 2.5는 x의 분산 4.3은 y의 분산이며 나머지 값이 공분산이다.

covarience = tips['total_bill'].cov(tips['tip'])
print('covariance between total_bill and tip:' , covarience)

covariance between total_bill and tip: 8.323501629224854

tips 데이터의 공분산을 확인한 결과 8.32인 것을 알 수 있다.

4. 상관관계

상관관계는 선형관계와 비선형관계로 나누며 선형은 직선, 비선형은 곡선 형태로 나타난다.

x_linear = np.linspace(0,10,100)
y_linear = 2*x_linear +1
x_nonlinear = np.linspace(-10,10,100)
y_nonlinear = x_nonlinear ** 2

linspace함수로 일정한 변수를 생성한다.

plt.scatter(x_linear, y_linear)
plt.title('linear relationship')
plt.xlabel('x')
plt.ylabel('y')
plt.show()

plt.scatter(x_nonlinear, y_nonlinear)
plt.title('nonlinear relationship')
plt.xlabel('x')
plt.ylabel('y')
plt.show()

5. 다중상관분석

다중상관분석은 두 개 이상의 독립 변수와 종속 변수 사이의 상관관계를 분석하는 방법이다.

cols = ['survived','pclass','age','fare']

df= sns.load_dataset('titanic')[cols].dropna()

corr = df.corr()

print(corr)

          survived    pclass       age      fare
survived  1.000000 -0.359653 -0.077221  0.268189
pclass   -0.359653  1.000000 -0.369226 -0.554182
age      -0.077221 -0.369226  1.000000  0.096067
fare      0.268189 -0.554182  0.096067  1.000000

6. 시계열 상관분석

dates = pd.date_range(start='2021-01-04',periods=100,freq='D')
samsung_prices = [random.randint(80000,90000)for _ in range(100)]
samsung_data = {'Date':dates, '005930.KS':samsung_prices}
samsung_df = pd.DataFrame(samsung_data)
samsung_df.set_index('Date',inplace=True)

lg_prices = [random.randint(140000,160000)for _ in range(100)]
lg_data = {'Date':dates, '066570.KS':lg_prices}
lg_df = pd.DataFrame(lg_data)
lg_df.set_index('Date',inplace=True)

df = pd.concat([samsung_df,lg_df],axis=1)
df = df.loc[:,['005930.KS','066570.KS']]
df.columns = ['Samsung','LG']

df.to_csv('../data/stock_price.csv')

samsung 데이터와 lg 데이터를 각각 범위를 정하고 랜덤하게 설정하여 csv파일로 저장한다.

df = pd.read_csv('../data/stock_price.csv')
df['Date'] = pd.to_datetime(df['Date'])
df.set_index('Date',inplace=True)

returns = df.pct_change()
print(returns)

corr_matrix = returns.corr()
print(corr_matrix)

          Samsung        LG
Samsung  1.000000  0.013868
LG       0.013868  1.000000

csv파일을 불러오고 인덱스를 date로 바꾼 후 pct_change 함수를 통해 행간의 차이를 백분율로 나타낸다. 이것을 상관계수로 다시 출력한다.

sns.heatmap(corr_matrix, annot=True,cmap='coolwarm')
plt.title('stock returns correlation')
plt.show()

상관계수를 히트맵으로 그려보면 0.014로 서로 독립적인 관계라는 것을 알 수 있다.

math_sc = [random.randint(0,100)for _ in range (300)]
english_sc = [random.randint(0,100)for _ in range (300)]
korean_sc = [random.randint(0,100)for _ in range (300)]
grade = np.repeat([1,2,3],100)

score_data={'grade':grade,'math':math_sc,'english':english_sc,'korean':korean_sc}
score_df = pd.DataFrame(score_data)


score_df.to_csv('../data/student_scores.csv')

국어, 수학, 영어 점수를 0부터 100까지 랜덤하게 생성하고 학년을 repeat 함수로 1,2,3을 100번씩 반복하여 데이터프레임을 만들어 csv파일로 저장한다.

df = pd.read_csv('../data/student_scores.csv', index_col = 0)

corr_matrix = df.corr()

sns.heatmap(corr_matrix, annot=True,cmap='coolwarm')
plt.title('stock returns correlation')
plt.show()

index_col=0로 필요없는 값들을 버리고 상관계수를 구해 히트맵으로 출력한다. 값들이 서로 상관관계가 없는 것을 확인할 수 있다.

def hist(type, name):
    plt.hist(type, bins=10,alpha=0.8)
    plt.xlabel('score')
    plt.ylabel('frequency')
    plt.title('{} scores distribution'.format(name))
    print(hist(df['math'],df.columns[1]))
    print(hist(df['korean'],df.columns[2]))
    print(hist(df['english'],df.columns[3]))

히스토그램을 그려주는 함수를 생성하고 국어, 수학, 영어의 점수 분포를 그릴 수 있다.

'통계' 카테고리의 다른 글

통계-9 모수 검정, 비모수 검정, 윌콕슨 부호순위 검정, 오류와 보정 (1)	2023.05.30
통계-8 확률 분포 (0)	2023.05.30
통계-6 가설 감정 (0)	2023.05.29
통계-5 데이터 분석 방법 (0)	2023.05.29

ABOUT ME

My storage My storage

1. 상관 분석

2. 피어슨 상관계수

3. 공분산

4. 상관관계

5. 다중상관분석

6. 시계열 상관분석

'통계' 카테고리의 다른 글

티스토리툴바

ABOUT ME

1. 상관 분석

2. 피어슨 상관계수

3. 공분산

4. 상관관계

5. 다중상관분석

6. 시계열 상관분석

'통계' 카테고리의 다른 글

관련글 관련글 더보기

티스토리툴바