J'ai les vecteurs X et Y simples suivants:
> X
[1] 1.000 0.063 0.031 0.012 0.005 0.000
> Y
[1] 1.000 1.000 1.000 0.961 0.884 0.000
>
> plot(X,Y)
Je veux faire une régression en utilisant le journal de X. Pour éviter d'obtenir le journal (0), j'essaie de mettre +1 ou +0,1 ou +0,00001 ou +0,000000000000001:
> summary(lm(Y~log(X)))
Error in lm.fit(x, y, offset = offset, singular.ok = singular.ok, ...) :
NA/NaN/Inf in 'x'
> summary(lm(Y~log(1+X)))
Call:
lm(formula = Y ~ log(1 + X))
Residuals:
1 2 3 4 5 6
-0.03429 0.22189 0.23428 0.20282 0.12864 -0.75334
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 0.7533 0.1976 3.812 0.0189 *
log(1 + X) 0.4053 0.6949 0.583 0.5910
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 0.4273 on 4 degrees of freedom
Multiple R-squared: 0.07838, Adjusted R-squared: -0.152
F-statistic: 0.3402 on 1 and 4 DF, p-value: 0.591
> summary(lm(Y~log(0.1+X)))
Call:
lm(formula = Y ~ log(0.1 + X))
Residuals:
1 2 3 4 5 6
-0.08099 0.20207 0.23447 0.21870 0.15126 -0.72550
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 1.0669 0.3941 2.707 0.0537 .
log(0.1 + X) 0.1482 0.2030 0.730 0.5058
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 0.4182 on 4 degrees of freedom
Multiple R-squared: 0.1176, Adjusted R-squared: -0.103
F-statistic: 0.5331 on 1 and 4 DF, p-value: 0.5058
> summary(lm(Y~log(0.00001+X)))
Call:
lm(formula = Y ~ log(1e-05 + X))
Residuals:
1 2 3 4 5 6
-0.24072 0.02087 0.08796 0.13872 0.14445 -0.15128
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 1.24072 0.12046 10.300 0.000501 ***
log(1e-05 + X) 0.09463 0.02087 4.534 0.010547 *
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 0.1797 on 4 degrees of freedom
Multiple R-squared: 0.8371, Adjusted R-squared: 0.7964
F-statistic: 20.56 on 1 and 4 DF, p-value: 0.01055
>
> summary(lm(Y~log(0.000000000000001+X)))
Call:
lm(formula = Y ~ log(1e-15 + X))
Residuals:
1 2 3 4 5 6
-0.065506 0.019244 0.040983 0.031077 -0.019085 -0.006714
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 1.06551 0.02202 48.38 1.09e-06 ***
log(1e-15 + X) 0.03066 0.00152 20.17 3.57e-05 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 0.04392 on 4 degrees of freedom
Multiple R-squared: 0.9903, Adjusted R-squared: 0.9878
F-statistic: 406.9 on 1 and 4 DF, p-value: 3.565e-05
La sortie est différente dans tous les cas. Quelle est la valeur correcte à mettre pour éviter le log (0) en régression? Quelle est la bonne méthode pour de telles situations.
Edit: mon objectif principal est d'améliorer la prédiction du modèle de régression en ajoutant un terme logarithmique, c'est-à-dire: lm (Y ~ X + log (X))