Friday, July 17, 2015

Converting data from wide to long (PROC TRANSPOSE)

Longitudinal data (also called panel data/time-series cross section data) are getting increasing popular. How can one handle them effectively? This post deals with one crucial aspect: converting data from "long" format to "wide" format. If you aren't familiar with long/wide, take a look at this page.

First, let's generate a dataset from scratch:

/* create a sample data set named test */
data test;
input china russia usa; /* specifies variable names */
datalines;
1 2 3
4 5 6
7 8 9
;
run;

You should have this table.




But we aren't done with data generation yet. Let's continue creating more variables.

proc means noprint; *generates means, does not print output;
var _numeric_; /*specifies all numeric variables in the current DATA step, which means, all variables*/
output out=new(drop=_type_ _freq_) mean= median= std= q1= q3= / autoname;
/* drop = _type_ _freq_ means you drop the two variables so named*/
/* additional variables generated are mean, median, std, q1, q3*/
run;

/* Transposes the dataset */
proc transpose data=new out=out;
run;

Now we have artificially created all data.



However, there are two issues:
(1) it is in "long" format. We want it in "wide" format.
(2) ideally, we don't want a single variable to take on values such as "china_Mean". We'd prefer to have two variables, the first taking on "China", "Russia", "USA", and the second taking "mean", "median", etc.

Let's deal with issue 2 first. We use the scan function and indicate (with '_') that _ is the separator. _name_ means we'll take any word, and 1 means the first instance, 2 means the second instance.

/* Separates the variables */

data separated;
set out;
varname=scan(_name_,1,'_');
stat=scan(_name_,2,'_');
drop _name_;
run;

proc sort data=separated;
by varname;
run;

The last three lines sort the data.

Next, we need to transpose the data.

proc transpose data=separated out=transposed(drop=_name_);
by varname;
id stat;
var col1;
run;

proc print data=transposed;
title 'Transposed Data!';
run;

And you can admire the finished product (for a little while, before going to learn more SAS).


Click here to download all the code of today's post.

Monday, July 13, 2015

PROC SQL

One of cool things about SAS is that, unlike R and Stata, you can use SQL commands as part of the base package.

I'll assume you have the auto dataset loaded.

Since you've loaded the auto dataset, we can skip the data step and go directly to "proc sql":

Task #1: select all observations

proc sql;
select * from auto;
run;

Task #2: select observations where price of car is above $8000

proc sql;
select * from auto where price > 8000;
run;

Task #3: find the average price of foreign and domestic cars

proc sql;
select avg(price) as avgprice, foreign from auto group by foreign;
run;

Task #4: order observations according to price

proc sql;
select * from auto order by price;
run;

Task #5: find observations where the model contains AMC

proc sql;
select * from auto where make contains 'AMC';

run;

Task #6: find the number of cars which are foreign and domestic

proc sql;
select count(make) as numberofcars, foreign from auto group by foreign;
run;

Task #7: find the number of cars which are above $8000 and less than $8000

proc sql;
select count(make) as numberofcars, (price > 8000) as highprice from auto group by highprice;

run;

Task #8: create a table to store data (and subsequently print it)
proc sql;
create table exampletable as select * from auto where price > 8000;
run;

proc print data = exampletable;

run;

Introduction to SAS for Stata users

You've learned lots of Stata. However, you need to learn SAS, possibly because it can handle large datasets efficiently, or possibly there are lots of SAS jobs (a search of LinkedIn at time of writing reveals there are 7339 SAS jobs). If this description sounds like you, don't fret! This blogpost is made just for you.

I assume that you have SAS installed and loaded. You should be seeing a screen like this:


In this tutorial, we will be learning five basic tasks:

  1. Opening a dataset
  2. Creating a "do-file"
  3. Generate summary statistics
  4. Generate correlation matrices
  5. List observations fulfilling certain criteria
Task #1: Opening a dataset

Download the auto dataset. Click the relevant buttons to import a CSV file. 

Click on File > Import Data


Choose the relevant commands to import the file. You can name the dataset as you wish. 

Assuming you've successfully imported the dataset, there will be new output in the log window, which should end with

NOTE: WORK.AUTO data set was successfully created.
NOTE: The data set WORK.AUTO has 74 observations and 12 variables.

Task #2: Creating a do-file

When you open Stata, you have to press Ctrl+9 to create a do-file. In SAS, a do-file window is automatically generated for you. See the Editor window (the bottom right)? That's the do-file, and you can start typing in it.

Task #3: Generate summary statistics 


Now, it's time to start learning some actual commands. Type the commands into the Editor window. Some pointers:

  • Comments start with * (these are in green)
  • Commands are in blue
    • These start with proc
    • And end with run
  • Don't forget semi-colons. The program will not run properly without it


* Stata's equivalent of tabulate;
proc freq data = auto;
tables rep78 foreign;
run;

* two way tabulation;
proc freq data = auto;
tables rep78*foreign;
*if you want percentages only;
* tables rep78*foreign / norow nocol nofreq;
run;
* Stata's equivalent of summarize;
proc means; * data = auto is optional, since you've already loaded auto;
var mpg rep78;
run;
* If you want even more detailed summary statistics;
proc univariate; * data = auto is optional, since you've already loaded auto;
var mpg rep78;
run;


Task #4: Generate correlation matrices 

* Stata's equivalent of corr;
proc corr;
run;
* just between certain variables;
proc corr;
var price weight;
run;
*suppress significance;
proc corr noprob;
run;

Task #5: List observations fulfilling certain criteria

* Stata's equivalent of list;
proc print data=auto;
run;

* to list observations fulfilling certain criteria;
proc print data=auto;
where price > 10000 & rep78; * excludes all observations which rep78 is missing, or rep78 == 0;
run;