SAS and PROC SQL

Structured Query Language (SQL) is a standardized, widely used language that retrieves and updates data in relational tables and databases.

The SQL procedure is SAS’ implementation of Structured Query Language. PROC SQL is part of Base SAS software, and you can use it with any SAS data set (table). Often, PROC SQL can be an alternative to other SAS procedures or the DATA step. You can use SAS language elements such as global statements, data set options, functions, informats, and formats with PROC SQL just as you can with other SAS procedures.

PROC SQL  is used in analytics to :

  • Retrieve data from database tables or views (Oracle or SQL Server)
  • Combine SAS datasets from tables or views (MERGE)
  • Create datasets and indexes
  • Compute statistics and Generate reports

Data extraction with PROC SQL 

 We often use PROC SQL to extract data from various warehouses. Below is an example reproduced from a previous chapter.

PROC SQL;

Connect To ORACLE(User=901115644 Password=ypasswd Buffsize=10000 Path=BDCIT1 Preserve_Comments );

CREATE TABLE acct_status AS SELECT * FROM Connection To ORACLE

(SELECT current_account_nbr AS account_number,  external_status_reason_code AS

ext_rcode,external_status AS estatus,

billing_cycle_day AS billing_cycle_day

FROM

ACCOUNT_DIM

WHERE CLIENT_ID='BROOK BROS'

AND

nvl(EXTERNAL_STATUS_REASON_CODE,'0') <>'98');

Disconnect From ORACLE;

Quit;

In the above example PROC SQL use a connect string  “Connect To ORACLE(User=901115644 Password=ypasswd Buffsize=10000 Path=BDCIT1 Preserve_Comments )” to identify the oracle database (BDCITI) and use the user name and passwords specified  to read data from it. Further,  CREATE TABLE statement creates a SAS Dataset from the output of SELECT statement. 

When querying a data warehouse, PROC SQL automatically converts the field formats in to SAS formats. For example, a date field in Oracle will be converted into SAS date and an Oracle Varchar field will be converted into character.

SAS Data steps with PROC SQL

PROC SQL can perform some of the operations that are provided by the DATA step and the PRINT, SORT, and SUMMARY procedures. 

Let us create a dataset and then we will see how PROC SQL works like a DATA step.

DATA account_perf;

INPUT client $ account fico_seg $  current_os tot_payment  ;

cards;

Cmart 1002 401-500 300  100

Cmart 1003 501-600 200  150

Cmart 1004 601-700 1200 180

Cmart 1005 701-800 800  190

Cmart 1006 801-900 450  200

GIA     1007 401-500 560  210

GIA     1008 501-600 450  180

GIA     1009 601-700 900  145

GIA     1110 701-800 300  148

;

run;

 

PROC SQL;

title "Summary of O/S and Payments by Client";

select client, sum(current_os) as tot_os , sum(tot_payment)as tot_payment

from account_perf

group by client

order by tot_os descending ;

quit;

The above program block shows how a PROC SQL is substituting a PROC PRINT, PROC SORT and PROC SUMMARY.

Line 2: Assigns a title to the output of SELECT statement that follows

Line 3: Select statement with group function SUM used to summarize the data. GROUP BY clause is used to compute the SUM for each distinct group in the database. ORDER BY is used to sort the output and DESCENDING keyword is to control the sort order. It is also possible to SORT by multiple variables.

A PROC SQL statement ends with ‘QUIT\;’ and  it terminates the procedure. Output is always printed to the screen (like PROC PRINT) and it’s also possible to create a SAS dataset from the output. To create a dataset from the output the above program can be modified as follows:

PROC SQL;

title "Summary of O/S and Payments by Client";

create table summary as select client, sum(current_os) as tot_os , sum(tot_payment)as tot_payment

from account_perf

group by client

order by tot_os descending ;

quit;

Line 3: Note the CREATE TABLE <table name> AS statement.

 

PROC SQL and SELECT statement

We have seen above how SELECT statements are used in PROC SQL. Most of data retrieval and data combining are done using select statement. We will see some sample select statements and how it is used conditionally to work with data. In the example above the simple SELECT statement is shown below.

SELECT client, sum(current_os) as tot_os , sum(tot_payment)as tot_payment

FROM account_perf

The SELECT statement must contain a SELECT clause and a FROM clause, both of which are required in a PROC SQL query. Other clauses added to SLECT statements to restrict the data retrieval or conditional processing . Those clauses are WHERE, ORDER BY, GROUP BY and HAVING.

The WHERE clause restrict the data that you retrieve by specifying a condition that each row of the table must satisfy. In our example below, clients are restricted to GIA and Cmart only.

SELECT client, sum(current_os) as tot_os , sum(tot_payment)as tot_payment

FROM account_perf

WHERE client in ('GIA', 'CMART')

ORDER BY sorts the output in ascending or descending order as specified.  In our example below total outstanding in sorted in a descending order.

SELECT client, sum(current_os) as tot_os , sum(tot_payment)as tot_payment

FROM account_perf

WHERE client in ('GIA', 'CMART')

ORDER BY tot_os descending ;

GROUP BY computes the statistics for each category of values in the specified variable. A summary or group function like average or SUM in SELECT statement is followed by GROUP BY clause to instruct SAS that the statistics should be computed for each group of data. Let us look at our modified example to see how total outstanding and total payments are computed client wise.

PROC SQL;

SELECT client, sum(current_os) as tot_os , sum(tot_payment)as tot_payment

FROM account_perf

GROUP BY  client;

quit;

The HAVING clause works with the GROUP BY clause to restrict the groups in a query’s results based on a given condition. PROC SQL applies the HAVING condition after grouping the data and applying aggregate functions. For example, the following query restricts the groups to include only the client GIA.

PROC SQL;

SELECT client, sum(current_os) as tot_os , sum(tot_payment)as tot_payment

FROM account_perf

GROUP BY  client

HAVING Client='GIA';

quit;

 

Data retrieval Methods using SELECT

Using the dataset in above example, we will demonstrate how to retrieve data from a single table and how to create SAS datasets from resultant output.

 

      I.      SELECT – All columns in a Table

 

PROC SQL;

SELECT * FROM account_perf;

quit;

      II.      SELECT- Specific columns in a Table

PROC SQL;

SELECT  client,current_os  FROM account_perf;

quit;

    III.      SELECT-How to create dataset from SELECT statements

As we have seen from the previous examples, just add a CREATE TABLE <Table Name> AS before the SELECT statements. So the above example so modified would look like as follows

PROC SQL;

CREATE TABLE Sample_Dataset AS

SELECT  client,current_os  FROM account_perf;

quit;

  IV.      SELECT- Eliminating duplicate rows

PROC SQL;

SELECT  DISTINCT client  FROM account_perf;

quit;

  V.      SELECT-Computing values

PROC SQL;

SELECT  client,(current_os/1000)as OS_in_1000  FROM account_perf;

quit;

Note that row level computing can be done using the formula or multiple columns.

  VI.      SELECT-Assigning Column Alias and formatting it.

We have seen earlier in our examples that a new column name is formed with ‘AS’ statement in SELECT statement. Its also possible to specify the format of that variable in PROC SQL. Let us have a look at how OS_in_1000 variable is formed.

PROC SQL;

SELECT  client,(current_os/1000)as OS_in_1000 format =4.2  FROM account_perf;

quit;

  VII.      SELECT – Conditional Assignment using CASE

Using CASE statement for conditional processing is a powerful feature of SAS Data step and PROC SQL. Here is an example where in SELECT statement CASE statement is used to create a new field Risk_Category based on certain conditions.

PROC SQL;

SELECT  client,current_os,

CASE

WHEN current_os <= 300 THEN 'Low Risk'

WHEN current_os <= 800 THEN 'Med Risk'

ELSE 'High Risk'

END AS Risk_category

from account_perf;

quit;

 

Note that unlike DATA step CASE constructs, in PROC SQL each line does not end with a semi column. Also ‘AS’ key word is logically follows after the END of the loop.

VIII.      SELECT-Specifying COLUMN attributes

You can specify the following column attributes, which determine how SAS data is displayed:

FORMAT=

INFORMAT=

LABEL=

LENGTH=

If you do not specify these attributes, then PROC SQL uses attributes that are already saved in the table or, if no attributes are saved, then it uses the default attributes. Let us have look at an example:

PROC SQL;

SELECT  client,current_os format =4.2 label ='Current Outstanding'  FROM account_perf;

quit;

IX.      SELECT – Using Sub queries

SUB Queries and Queries inside a Query. Instances where the WHERE clause evaluates the output of another SELECT statement, the second SLECT statement is known as a SUB Query. Here is an example:

PROC SQL;

SELECT  client,current_os 

FROM account_perf

WHERE client IN (SELECT   distinct client  FROM account_perf

where tot_payment >180);

quit;

Though not a real life scenario, the above example demonstrates how a Sub-Query is used. The sub-query returns a list of clients that had at least one payment more than $180 and their current outstanding is listed for all accounts. In real life, esp when we work with multiple tables, sub queries are very useful to frame the right WHERE clauses for data retrieval.  Now let us look at some conditional operators used in a WHERE clause of a SELECT Statement. Exercise for you is to frame a query using these operators. Consult some online help for syntax help.

 

Operator

 

Definition

 

ANY

Specifies that at least one of a set of values obtained from a sub query must satisfy a given condition

ALL

Specifies that all of the values obtained from a Sub query must satisfy a given condition

BETWEEN-AND

Tests for values within an inclusive range

CONTAINS

 

Tests for values that contain a specified string

EXISTS

 

Tests for the existence of a set of values obtained

From a sub query

IN

Tests for values that match one of a list of values

IS NULL or IS MISSING

Tests for missing values

LIKE

Tests for values that match a specified pattern

X.      SELECT- GROUP FUNCTIONS

There are a lot of group functions we can use in SELECT Statement of a PROC SQL. When you use an aggregate function, PROC SQL applies the function to the entire table, unless you use a GROUP BY clause. Here is an example:

PROC SQL;

SELECT client, avg(current_os) as avg_os , avg(tot_payment)as avg_payment

FROM account_perf

GROUP BY  client;

quit;

 

PROC SQL;

SELECT client, avg(current_os) as avg_os , avg(tot_payment)as avg_payment

FROM account_perf;

quit;

When you execute these program blocks, the first PROC SQL computes the averages for each client group and the second PROC SQL computes them for the entire table.   Having seen how a group function is used, below given is a list of group functions you can use in a PROC SQL statement. Note that all of these are substitutes for a PROC SUMMMARY or PROC MEANS statistics.

 

Function

Definition

AVG, MEAN

Mean or average of values

 

COUNT, FREQ, N

Number of nonmissing values

 

CV

Coefficient of variation (percent)

 

MAX

Largest value

MIN

Smallest value

NMISS

Number of missing values

RANGE

Range of values

 

STD

Standard deviation

STDERR

Standard error of the mean

SUM

Sum of values

VAR

Variance

 

 

XI.      SELECT – Joining the tables

 When we work with multiple SAS tables we often will have to join them for data processing. We commonly use SAS DATA step and MERGE method to achieve this task. PROC SQL can be used as a simpler substitute for the MERGE data step method. Let us create another dataset so that we can demonstrate the example.

DATA account_perf2;

INPUT  account fico_seg $  prev_os prev_payment  ;

cards;

1002  401-500 400  100

1003  501-600 300  150

1004  601-700 400 180

1005  701-800 900  190

2009  601-700 600  145

1007  401-500 660  210

1008  501-600 750  180

2006  801-900 550  200

2009  601-700 600  145

2110  701-800 400  148

;run;

Now to join these tables for all common accounts (equi-join), we use DATA step and MERGE statement as follows. Note that in the data step, dataset should be sorted before we MERGE them.

proc sort data=account_perf out=account_perf;

by account;

run;

proc sort data=account_perf2 out =account_perf2;

by account;

run;

Data merged1;

merge  account_perf(in=a) account_perf2(in=b);

by account;

if a=b;

run;

Now the same results can be achieved using PEOC SQL as follows.

Proc SQL;

create table merged2 as Select a.*, b.*  from  account_perf a,

account_perf2 b

where

a.account=b.account;

quit;

The above example demonstrates that how PROC SQL can simplify the coding.  Not only that we could avoid the data sort, now we can make use of the powerful WHERE clause to exactly tell SAS various conditions of merging. With the use of Sub-queries and conditional processing (like IN, NOT IN , LIKE , CONTAIN)  in WHERE clause,  we can achieve any combination of data merging.

PROC SQL in SAS also provides direct merging of multiple tables with RIGHT JOIN, LEFT JOIN and FULL JOIN keywords for various Outer joins. Readers are requested to explore them as well.

Copyright free public information. All trademarks,service marks, logos and names are properties of their respective owners.