# AnalyticBridge

A Data Science Central Community

# Data preparation for factor analysis

Hello --

I am a graduate student who is just setting off on her career in analytics and I am delighted there is a community like this one. Now, my first question!

I have a series of variables on which I would like to do a factor analysis and they have to do with respondents' feelings about a decision. Because of the way a question was asked, sometimes saying "true" means 'I feel great!' and sometimes it means 'I feel terrible.'

Do I need to standardize these before doing the factor analysis?

if so, any pitfalls to avoid?

Thanks kindly,
Sarah

Tags: analysis, factor

Views: 3701

### Replies to This Discussion

Sarah, from the way I see you explain this problem, here's my thought.

Problem Statement:
You've got a set of Responder Survey Data. Every question has 5 levels of response - from 'BEST' a.k.a 'I feel great' to 'WORST' a.k.a 'I feel terrible' - meaning 'A' through 'E' or '1' through '5' ordinal set of data points.

Some questions - 'A' or '1' or 'True' = 'Feeling Great' & 'E' or '5' or 'False' = 'Feeling Terrible'
Remaining - Just the opposite of above.

My Approach:
For a factor analysis, or for that matter any kind of analysis - you won't be required to change the way these variables were coded; unless you find it difficult to interpret. So effectively, if you coded or not, you're going to get the same result.

So what changes then?

Just your coefficients! What was positive earlier changes to negative now and vice versa.

Now, if you have a different question in mind other than that discussed here - things could be different! Hope this helps.
Hi Arun (& others) --

Let me add some details; I think you've got it but I'm not sure as to the implication for my coefficients.

Question 1: Do you feel at peace with your decision? True or False
Question 2: Do you think God will never forgive you? True or False.

So with question 1, true means feeling good about decision while for question 2, true means you (might) feel pretty bad about it.

And there are other questions.

I do a factor analysis, following the steps outlined here. http://dss.princeton.edu/training/Factor.pdf

I get two factors which include a mixture of true=feeling good and true=feeling bad questions. I include those factors in a regression analysis. Now I want to interpret the regression coefficient but I'm not sure how to interpret a one unit change in factor 1...are you feeling better or worse about your decision?

Thanks again,
Sarah
When I was into marketing research, I did what Jeff mentioned. Flip the scales so that the numbers convey the same message throughout.

Regards,
Datalligence
For interpretation of a FA, make sure the scales are consistent (for example, a likert scale should all be lower = worse and higher = better, just for example; if not in the actual survey for all questions, flip them prior to the FA).

I have learned that factor analysis assumes interval level data.
See http://faculty.chass.ncsu.edu/garson/PA765/factor.htm#assume

Correspondence analysis is better suited for categorical data.....

It all depends on the task, if the factors are being loaded into a predictive model - it might be *useful* if not textbook accurate.
Sarah, to answer your question in short ( I think you already would have thought of this!), go ahead with standardizing what 'True' meant globally. That way, interpretation becomes easier.

Like I said, it's only if you need to interpret that you might need to change it, not otherwise.

So, when you're using it in Regression using your factors from the FA, I think it's important you first understand how your factors are loaded with Variables and you'll probably need to come up with rotated set of factors to say that one 'type' of factors load factor 1, while another 'type' of factors load factor 2. So after regression, if factor 1 & 2 occur in the equation y=a+b1(f1) + b2(f2) +e, you can interpret using the loadings to say the corresponding effects of your individuals!

P.S. Why don't you try both - standardizing and not standardizing, and see for yourself what difference it makes! I'd be interested to know what kind of difference it can make other than what I think it can!

Jeff, a BIG thanks for link. I never knew CA was a substitute for PCA in place of categorical variables! Just to confirm, I'm not sure if all variables need to be of similar scales. There could be some which have only 3 levels of response, while another could have 4/5 levels, while another set can be binary (yes/no). Finally, it's how well the variance is explained by every variable that's going to be factorized.
Also, I think I made a mistake when I said ordinal levels - it's actually interval levels! The data is a measure of 'degree of something' - from low to high! Thanks for the correction!
I have finally gotten around to standardizing what true and false meant. It changed the signs on the factor loads and subsequently on those factors in the regression equation. This makes perfect sense.

Thank you all for your help! I'm thrilled to be a part of the AnalyticBridge community and maybe one day I'll have the expertise to answer someone else's questions.
I knew that's all the effect it would have! Good to know that from you again! :)
Sure you will one day!!!

Cheers!
Hai Arun,
I am doing my Project work to measure the level of implementation of HSE Systems. The questionnaire used to conduct the survey consists of Binary Responses (Yes/No) (Enclosing the Questionnaire). My supervisor advised me to do factor analysis for finding out the interdependency. Kindly suggest me that whether i can do the Factor Analysis in this case using SPSS.

Attachments:
The questions need to be tightened up considerably. For many of them, giving the answer "No" would be the equivalent of saying "I am incompetent at my job". How many respondents are going to be honest enough to confess that?

The wording of most questions needs to be made more empirical. What does "Are there effective arrangements for reviewing the health and safety policy at least once a year?" mean? The critical word in that question is "effective". How would you measure whether "arrangements" are "effective"? Why not actually analyze the data on the number and seriousness of incidents, the minutes of the committees or personnel who are supposed to implement these "arrangements", the amount of time devoted to various safety activities, the number of unnanounced inspections or other items which are harder to fake than are the answers to these very general and woolly questions?

If you don't know much about stats, I would suggest that you postpone doing factor analysis until you have learned sufficient to understand what the method tells you, and what the possibilities of bias or misintepretation are.

Incidentally, is there any good reason for coding Yes=1 No= 0 for all the questions? I would have thought that some of the questions should be weighted more heavily than that. For that reason I would expect to see weights of 23, 47.6, 0.8, 5 million or whatever? I would suggest that you read up magnitude ration scaling. Try Sellin and Wolfgang and/or Holmes and Rahe

Laurie