1. 예제 자료

이번 시간 예제 자료는 지난 시간에 생성한 데이터프레임 mydf3를 사용하도록 하겠다. 아래는 지난 시간에 작성한, 자료를 읽어들여 전처리를 하고 하위 척도 점수를 계산하는 명령이다.

library(readxl)
library(dplyr)

mydf = as.data.frame(read_excel(path = '/cloud/project/mydata.xlsx'))

names(mydf) = c('agree', 'age', 'sex', 'edu', 'marital',
                  paste0('bpns', 1:18),
                  paste0('ctq', 1:10)) 

mydf3 = mydf %>% filter(bpns14<=5 & ctq7<=4 & ctq9<=4 & ctq10<=4) %>% 
  filter(rowSums(is.na(.))==0) %>%
  mutate(bpns1r = 6 - bpns1,
         bpns2r = 6 - bpns2,
         bpns3r = 6 - bpns3,
         bpns6r = 6 - bpns6,
         bpns14r = 6 - bpns14,
         ctq6r = 5 - ctq6,
         ctq7r = 5 - ctq7,
         ctq8r = 5 - ctq8,
         ctq9r = 5 - ctq9,
         ctq10r = 5 - ctq10) %>% 
  mutate(autonomy = rowSums(select(.,bpns1r,bpns2r,bpns3r,bpns4,bpns5,bpns6r)),
         competence = rowSums(select(.,bpns7:bpns12)),
         related = rowSums(select(.,bpns13,bpns14r,bpns15:bpns18)),
         abuse = rowSums(select(.,ctq1:ctq5)),
         neglect = rowSums(select(.,ctq6r:ctq10r)))

2. 회귀 분석

R의 lm() 함수를 사용하면 단순회귀(simple regression), 다중회귀(multiple regression), t-test, 분산분석(ANOVA) 등을 포함하는 일반선형모형(general linear model) 분석을 실시할 수 있다.
이 시간에는 종속변수가 연속 변수인 경우만을 다루도록 하겠다. 만약 종속변수가 ‘합격/불합격’과 같은 이산 자료(binary data)이거나, ‘지난 한 달간 음주 횟수’와 같은 빈도 자료(count data) 등 정규분포를 따르지 않는 경우에는 glm() 함수를 사용할 수 있다.

2-1. `lm()` 함수 사용하기

mydf3에서 아동기 정서적 방임(neglect)을 독립변수로 하고 자율성(autonomy)를 종속변수로 하는 단순 회귀 분석을 실시한다고 해보자. 이 때, 다음과 같이 lm() 함수를 사용하면 결과를 얻을 수 있다.

m1 = lm(formula = autonomy ~ neglect, data = mydf3)
summary(m1)

# Call:
#   lm(formula = autonomy ~ neglect, data = mydf3)
# 
# Residuals:
#   Min       1Q   Median       3Q      Max 
# -10.6511  -2.4828  -0.0373   2.4875  11.1212 
# 
# Coefficients:
#             Estimate Std. Error t value Pr(>|t|)    
# (Intercept) 24.42346    0.49533  49.307  < 2e-16 ***
# neglect     -0.27724    0.04766  -5.817 1.51e-08 ***
# ---
# Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
# 
# Residual standard error: 3.827 on 305 degrees of freedom
# Multiple R-squared:  0.09988,	Adjusted R-squared:  0.09693 
# F-statistic: 33.84 on 1 and 305 DF,  p-value: 1.51e-08

lm() 함수를 사용해서 분석을 하기 위해서는 기본적으로 formula 인지와 data 인자를 입력해야 한다.
- formula 인자에는 분석 모형을 입력한다.
- data 인자에는 분석에 사용할 자료를 입력하며, 자료는 데이터프레임 형식이어야 한다.

<aside> 📎 lm() 함수에서 formula 인자 입력 방법

위의 명령에서 볼 수 있듯이, formula 인자 값을 입력할 때는 종속변수를 먼저 쓰고 독립변수를 뒤에 쓰며, 종속변수와 독립변수를 ~로 구분한다.
만약 정서적 방임(neglect) 뿐아니라 정서적 학대(abuse)와 연령(age) 변수도 독립변수로 모형에 투입하여 다중회귀분석을 실시하고자 한다면, 다음과 같이 + 기호를 사용해서 독립변수를 모형에 추가한다.

m2 = lm(formula = autonomy ~ neglect + abuse + age, data = mydf3)
summary(m2)

# Call:
#   lm(formula = autonomy ~ neglect + abuse + age, data = mydf3)
# 
# Residuals:
#   Min       1Q   Median       3Q      Max 
# -10.9453  -2.4717  -0.0349   2.5277  10.8665 
# 
# Coefficients:
#             Estimate Std. Error t value Pr(>|t|)    
# (Intercept) 25.49669    1.38454  18.415  < 2e-16 ***
# neglect     -0.25775    0.05395  -4.778 2.77e-06 ***
# abuse       -0.05737    0.08027  -0.715    0.475    
# age         -0.02710    0.04045  -0.670    0.503    
# ---
# Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
# 
# Residual standard error: 3.834 on 303 degrees of freedom
# Multiple R-squared:  0.1024,	Adjusted R-squared:  0.09354 
# F-statistic: 11.53 on 3 and 303 DF,  p-value: 3.556e-07

데이터프레임에서 종속변수를 제외한 모든 변수를 독립변수로 투입하고자 할 때는 . 기호를 사용할 수 있다.
- 예를 들어, autonomy, neglect, abuse, age 이 네 변수로만 이루어진 데이터프레임 mydf4를 생성해보자. 그리고, mydf4를 사용해서 autonomy를 종속변수로 하고 나머지 모든 변수를 독립변수로 하는 다중회귀분석을 실시한다고 하면, 다음과 같은 명령을 사용할 수 있다.

mydf4 = mydf3 %>% select(autonomy, neglect, abuse, age)

m3 = lm(formula = autonomy ~ ., data = mydf4)
summary(m3)

# Call:
#   lm(formula = autonomy ~ ., data = mydf4)
# 
# Residuals:
#   Min       1Q   Median       3Q      Max 
# -10.9453  -2.4717  -0.0349   2.5277  10.8665 
# 
# Coefficients:
#             Estimate Std. Error t value Pr(>|t|)    
# (Intercept) 25.49669    1.38454  18.415  < 2e-16 ***
# neglect     -0.25775    0.05395  -4.778 2.77e-06 ***
# abuse       -0.05737    0.08027  -0.715    0.475    
# age         -0.02710    0.04045  -0.670    0.503    
# ---
# Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
# 
# Residual standard error: 3.834 on 303 degrees of freedom
# Multiple R-squared:  0.1024,	Adjusted R-squared:  0.09354 
# F-statistic: 11.53 on 3 and 303 DF,  p-value: 3.556e-07

만약 종속변수를 제외한 모든 변수를 독립변수로 모형에 투입하되, 이 중 제외하고 싶은 독립변수가 있는 경우에는 - 기호를 사용할 수 있다.
- 예를 들어, age 변수는 모형에서 제외하고 싶다면, 다음과 같은 명령을 사용할 수 있다.

m4 = lm(formula = autonomy ~ . - age , data = mydf4)
summary(m4)

# Call:
#   lm(formula = autonomy ~ . - age, data = mydf4)
# 
# Residuals:
#   Min      1Q  Median      3Q     Max 
# -10.645  -2.420  -0.055   2.417  10.816 
# 
# Coefficients:
#             Estimate Std. Error t value Pr(>|t|)    
# (Intercept) 24.66980    0.62662  39.369  < 2e-16 ***
# neglect     -0.26151    0.05361  -4.878 1.73e-06 ***
# abuse       -0.05122    0.07967  -0.643    0.521    
# ---
# Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
# 
# Residual standard error: 3.83 on 304 degrees of freedom
# Multiple R-squared:  0.1011,	Adjusted R-squared:  0.09518 
# F-statistic:  17.1 on 2 and 304 DF,  p-value: 9.21e-08

상호작용 항을 모형에 투입할 때는 : 기호를 사용한다.
- 예를 들어, autonomy를 종속변수로 하고, neglect, abuse를 독립변수로 하고, 이 둘의 상호작용 또한 모형에 투입하고자 한다면, 다음과 같은 명령을 사용할 수 있다.

m5 = lm(formula = autonomy ~ neglect + abuse + neglect:abuse, data = mydf3)
summary(m5)

# Call:
#   lm(formula = autonomy ~ neglect + abuse + neglect:abuse, data = mydf3)
# 
# Residuals:
#   Min       1Q   Median       3Q      Max 
# -10.5692  -2.4235  -0.0107   2.4872  11.0013 
# 
# Coefficients:
#               Estimate Std. Error t value Pr(>|t|)    
# (Intercept)   25.61800    1.56131  16.408  < 2e-16 ***
# neglect       -0.33694    0.12577  -2.679  0.00779 ** 
# abuse         -0.19374    0.22923  -0.845  0.39867    
# neglect:abuse  0.01088    0.01641   0.663  0.50772    
# ---
# Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
# 
# Residual standard error: 3.834 on 303 degrees of freedom
# Multiple R-squared:  0.1024,	Adjusted R-squared:  0.09351 
# F-statistic: 11.52 on 3 and 303 DF,  p-value: 3.571e-07

독립변수가 많고, 모든 독립변수들 간의 상호작용을 살펴보고자 한다면 * 기호를 사용할 수 있다.
- 예를 들어, neglect, abuse, age 세 변수의 주효과, 이원 상호작용효과, 삼원 상호작용효과를 모두 모형에 투입하고자 한다면 다음과 같은 명령을 사용할 수 있다.

m6 = lm(formula = autonomy ~ neglect * abuse * age, data = mydf3)
summary(m6)

# Call:
#   lm(formula = autonomy ~ neglect * abuse * age, data = mydf3)
# 
# Residuals:
#   Min       1Q   Median       3Q      Max 
# -10.6159  -2.4345   0.0037   2.5148  10.9317 
# 
# Coefficients:
#                   Estimate Std. Error t value Pr(>|t|)    
# (Intercept)       32.500684   9.218282   3.526 0.000489 ***
# neglect           -1.163237   0.955644  -1.217 0.224477    
# abuse             -1.010648   1.230678  -0.821 0.412179    
# age               -0.208272   0.298421  -0.698 0.485774    
# neglect:abuse      0.112349   0.100643   1.116 0.265186    
# neglect:age        0.025652   0.030302   0.847 0.397924    
# abuse:age          0.024035   0.040790   0.589 0.556159    
# neglect:abuse:age -0.003109   0.003208  -0.969 0.333385    
# ---
# Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
# 
# Residual standard error: 3.845 on 299 degrees of freedom
# Multiple R-squared:  0.1092,	Adjusted R-squared:  0.08833 
# F-statistic: 5.235 on 7 and 299 DF,  p-value: 1.191e-05

만약 변수들의 합, 차, 곱, 비율, 제곱 등을 계산하여 이를 독립변수로 사용하고자 할 때는 변수를 새로 생성하지 않고도 I() 함수를 사용하면 편리하게 분석을 수행할 수 있다.
- 예를 들어, neglect와 abuse 변수의 합을 독립변수로 사용하고자 할 때, formula에 ~ neglect + abuse와 같은 식을 사용하면, 이는 neglect와 abuse 두 변수를 독립변수로 모형에 투입하라는 의미가 된다.
- 그러나, 다음과 같이 formula에 ~ I(neglect + abuse)와 같은 식을 사용하면, 이는 I() 함수 안에 있는 계산을 수행하여 즉, neglect와 abuse의 합을 구하여 이를 독립변수로 모형에 투입하라는 의미가 된다.

m7 = lm(formula = autonomy ~ I(neglect + abuse), data = mydf3)
summary(m7)

# Call:
#   lm(formula = autonomy ~ I(neglect + abuse), data = mydf3)
# 
# Residuals:
#   Min       1Q   Median       3Q      Max 
# -10.6535  -2.5483   0.0571   2.4255  10.5569 
# 
# Coefficients:
#                    Estimate Std. Error t value Pr(>|t|)    
# (Intercept)        24.96914    0.60739   41.11  < 2e-16 ***
# I(neglect + abuse) -0.18420    0.03331   -5.53 6.88e-08 ***
# ---
# Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
# 
# Residual standard error: 3.845 on 305 degrees of freedom
# Multiple R-squared:  0.09113,	Adjusted R-squared:  0.08815 
# F-statistic: 30.58 on 1 and 305 DF,  p-value: 6.885e-08

</aside>

2-2. 범주형 독립변수

범주형 독립변수의 경우, 그 값이 문자로 코딩되어 있다면, 특별한 처리 없이 그대로 모형에 독립변수로 투입하면 된다.
- 예를 들어, mydf3를 사용해서 sex를 독립변수로, related를 종속변수로 하는 단순회귀분석을 실시한다면, 다음과 같은 명령을 사용할 수 있다.

m8 = lm(formula = related ~ sex, data = mydf3)
summary(m8)

# Call:
#   lm(formula = related ~ sex, data = mydf3)
# 
# Residuals:
#   Min       1Q   Median       3Q      Max 
# -10.0732  -2.0732  -0.0732   1.9268   7.1569 
# 
# Coefficients:
#              Estimate Std. Error t value Pr(>|t|)    
# (Intercept)  22.8431     0.3265  69.974  < 2e-16 ***
# sex여성       1.2300     0.3995   3.079  0.00227 ** 
# ---
# Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
# 
# Residual standard error: 3.297 on 305 degrees of freedom
# Multiple R-squared:  0.03015,	Adjusted R-squared:  0.02697 
# F-statistic:  9.48 on 1 and 305 DF,  p-value: 0.002266

sex 변수와 같이 문자열로 코딩된 변수는 자동으로 factor (즉, 범주)로 변환되며 더미코딩(dummy coding)이 된다.
- 여기서는 아래와 같은 방식으로 더미 코딩이 되어, 남성 집단이 기준 집단(reference group)의 역할을 하고 있다.
  
  범주 더미변수1
  
  남성 0
  
  여성 1
- 이 경우 절편(Intercept) 값(=22.8431)은 기준 집단(즉, 남성)의 평균 related 점수를 나타내며, sex여성의 기울기 값(=1.2300)은 기준 집단(즉, 남성)에 비해 여성 집단의 평균이 1.2300만큼 더 높다는 것을 나타낸다.
기준 집단을 여성으로 바꾸고 싶다면, 다음과 같이 factor() 함수를 사용하여 sex 변수에서 수준(level)을 다시 지정하면 된다.
- factor() 함수에 사용하는 levels 인자에는 요인의 수준 값들을 벡터로 입력하는데, 이 때 첫 번째 원소가 나타내는 수준이 기준 집단이 된다.

범주	더미변수1
`남성`	0
`여성`	1

mydf3$sex = factor(x = mydf3$sex, levels = c('여성', '남성'))

m8 = lm(formula = related ~ sex, data = mydf3)
summary(m8)

# summary(m8)
# Call:
#   lm(formula = related ~ sex, data = mydf3)
# 
# Residuals:
#   Min       1Q   Median       3Q      Max 
# -10.0732  -2.0732  -0.0732   1.9268   7.1569 
# 
# Coefficients:
#              Estimate Std. Error t value Pr(>|t|)    
# (Intercept)  24.0732     0.2303 104.542  < 2e-16 ***
# sex남성      -1.2300     0.3995  -3.079  0.00227 ** 
# ---
# Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
# 
# Residual standard error: 3.297 on 305 degrees of freedom
# Multiple R-squared:  0.03015,	Adjusted R-squared:  0.02697 
# F-statistic:  9.48 on 1 and 305 DF,  p-value: 0.002266

1. 예제 자료

2. 회귀 분석

2-1. lm() 함수 사용하기

2-2. 범주형 독립변수

2-1. `lm()` 함수 사용하기