Sunday, 22 April 2018

Data Analysis Tips


1.    Introduction


Quantitative data analysis consists of the activities associated with organising data into information, in the form of tables and graphs, which can assist with understanding or deciding something.
The tips given here are especially for regular data analysis, where the reports generated should be consistent and comparable over a period of time.
The tips are most closely aligned with MS Excel, but are also generic.

2.    Report Title


2.1.  Use consistent titles for reports.  Suggested format:
“{Topic} {Time Period} as at {Date}”
For example:
Health Faculty Enrolments 2015 04 01 to 2015 06 30 as at 2015 07 02”
2.2.  State the information source in the title information (header, footer and / or title page).
2.3.  Further optional information, which can be stated in sub-titles, headers, footers or title pages, includes but is not limited to:
2.3.1.     report author;
2.3.2.     data selection criteria;
2.3.3.     target audience;
2.3.4.     limitations of usage of the data analysis; and
2.3.5.     copyright statements.

3.    Calculations


3.1.  Do all calculations using formulas – never manually (no results of manual calculations should be typed into data analyses).  Ensure that the formulas are the correct ones and are reading the correct source data.
3.2.  Estimate expected values, and / or use cross-checks of totals.  Investigate the causes of anomalies until all of them have been eliminated or explained.

4.    Cross-tabulations


4.1.  Cross-tabulations are known as “pivot tables” in MS Excel.
4.2.  Ensure that the rows and columns of each cross-tabulation are appropriate to represent what is being analysed.
4.3.  Ensure that all of the relevant rows and columns of the raw data are included in the source data for the cross-tabulation. 

5.    Graphs


Did you know?  In MS Excel, the F11 key instantaneously produces a basic graph, which can be customised using the Design function.

(Place the cursor anywhere in the table of the graph’s source information, which can be a pivot table or other table, and press F11.)

5.1.  Use the graph Design function (and Layout function if needed) to ensure that:
5.1.1.     the graph type is appropriate to the type of data being analysed;
5.1.2.     all of the relevant rows and columns of the graph’s source data are included in the graph;
5.1.3.     The graph has:
·       a title;
·       an appropriate scale; and
·       a legend or a data table.

5.2.  If you include data labels, these must also be done via the Design / Layout function – labels should never be typed in manually.

5.3 Exclude totals from the graph. (Including them would make for a very skew-looking graph, as everything else would be much smaller than the total.) If you want to include something that gives a sense of the overall picture, then include the average value rather than the total.

Friday, 20 April 2018

Data Standards


1.    Data Standard 1: Data collection purpose

The purpose of collecting the data / information[1] and how it is expected to be utilised must be clearly stated.

1.1.  Examine the purpose of collecting the data / information, and how it is expected to be utilised.  Typical questions:  Which of these (one or more) will you want to do?
·          Count how often things appear or happen;
·          Extract data into a report (and do further calculations in the report file);
·          Track the progress of a process;
·          Keep a register of something;
·          Use mail merge (inside a letter or report, or print mailing labels);
·          Print out (or view on-screen) lists of phone numbers to call;
·          Copy and paste groups of e-mail addresses into e-mail messages;
·          Extract information from a “knowledge base”, to help answer enquiries; and / or
·          Do calculations for invoicing.

1.2.  Write the purpose statement for collecting the information.
1.3.  Name the folder or system according to the purpose of the information.

2.    Data Standard 2: Data file structure

File structures should contain each field only once – there must be no duplicate information.

Note: If a new system is being designed in order to collect the data, the activities concerning the development, implementation and maintenance of that system are outside the scope of the data standards.  However, the system must facilitate compliance with the data standards:

2.1.  Set up a file structure that ensures that – wherever possible – each item is entered only once, and the single entry is referred to, over and over.
2.2.  Ensure that the validations and data descriptions facilitate compliance with the data standards.  Functionality such as the following assists with compliance:
·          MS Excel: drop-downs / lookup tables and VLookup;
·          MS Access and other database packages: related tables (i.e. relational database).


3.    Data Standard 3: Data integrity and validation

Data and information should be of the highest integrity.

·       Free text should be used as little as possible.
·       Lookup tables and predetermined lists should be a standard feature of databases.

·       Typing of information should be done once, and thereafter correct use of copy and paste method should be employed.

When capturing, editing and checking data:


3.1.  Use free text as little as possible, and lookup tables or predefined lists as much as possible.  Wherever a national set of values exists, use it.  (Examples: Stats SA’s “Health and Functioning” definitions; SAQA’s NQF Levels.)
3.2.  Wherever possible, use copy-and-paste instead of retyping.
3.3.  Format the inputs uniformly (e.g. phone numbers) – an “input mask” should be used to ensure this.
3.4.  Do as little as possible manually, and as much as possible via the tools and utilities available to you.  For example, use formulas inside documents (including MS Word) rather than typing in what has been found using a calculator or by counting; use pivot tables and graphs.
3.5.  Use data validations as much as possible.  (Spelling and grammar checks in MS Word are also validations.)
3.6.  Try to see the patterns in things.
3.7.  Try to see where things are the same as each other.
3.8.  Look for where there are risks of inaccuracies, and find ways to prevent these inaccuracies.

4.    Data Standard 4: File naming

Files should be consistently named.

4.1.  Format file names uniformly (e.g. “Qualifications offered as at yyyy mm dd.xlsx”: each time there is an update to the file, the name always starts with “Qualifications offered as at ” and the date, which is always part of the file name, is formatted as yyyy mm dd). 
4.2.  Where there are several versions of the same file, add “v{number}”, e.g. “v2”, before the date.

5.    Data Standard 5: Procedures

Procedures for handling data and information must be written and maintained, and adherence to the procedures must be monitored.

5.1.  Develop, maintain and adhere to procedures for handling data and information.
5.2.  Ensure that the procedures are included in the organisation’s system for cataloguing procedures (such as a Quality Management System or a Business Continuity System).
5.3.  Monitor that the data standards exist and are being met:
·          Data standards exist if a list of allowed values can be demonstrated.

Depending on the nature and use of the data, the parameters can be set narrowly or broadly.

·          Data validations are in use if:

o   The system has been programmed to use validations when people try to enter or edit data; and / or
o   Data or information is manually checked, and corrected if it does not meet the data standards.

In some circumstances, it is not possible to account for everything via validations, for instance of one does not wish to block everything from entering an information system.  In such cases, exception reports can be produced and acted upon, after capturing or loading the data.

·          Data standards are being met if:

o   The system contains only allowed values;
o   Aggregations and analyses make sense; and
o   There is consistency in the format and values of those data elements that have more flexibility allowed.

·          Sample table that can be used to assess whether data standards exist and are being met:

System
Data Standards Exist
(Y/N)
Data Validations are in use
(Y/N)
Data Standards are being met
(Rating as %)
















[1] Data are raw facts, such as one person’s test score.  Information is the product after data have been organised (aggregated or analysed).  Information assists with understanding or deciding something, such as the class average of the test scores assisting with decisions concerning the moderation of results.

Data Quality


Most people have a story to tell about what can go wrong with data quality and data management.

The following are just a few examples:

·       Being assigned the incorrect gender on your ID document;
·       Being denied credit due to a mix-up with someone else’s blacklisted information;
·       Money paid to the wrong creditor due to incorrect capturing of an account number;
·       Misquoted youth unemployment statistics due to incorrect capturing of birth dates;
·       Impossible transformation statistics due to incorrect capturing of “population groups”;
·       Colleges not being able to verify that students qualified there because of incorrectly stated National ID numbers (either at the college or by the enquirer); and
·       Lecturers sorting only one column of a spreadsheet and students receiving each others’ results.

These are all cases where there was no fraud or identity theft intended, yet the consequences are equally problematic.

Most people also believe that they are not capable of making such mistakes, yet mistakes are being made all the time.  All it takes is a momentary lapse in concentration.  However, simple methods can and should be put in place to prevent mistakes.  The type of method depends on the type of data[1] being captured.

The most common methods are as follows:

1.   For data elements that must always be the same (over and over again), such as gender, population group (sometimes referred to as “equity”), province, qualification type, NQF Level, achievement status: lookup tables must be used.  In databases and spreadsheets, these appear as drop-down menus, and only the data elements already present in each menu are accepted for each data field.[2]

2.    For data elements that are not always the same but only have certain allowed values, totals or combinations: validations must be used.  For example, it must not be possible to assign times that add up to more than 24 hours in a day; dates must all be formatted the same as each other; it must not be possible to capture a Bachelor’s degree with NQF Level 4.

3.   For data elements that are truly free-form, such as names, addresses, qualification titles: it is not possible to put in complete validations, although even here there can be some validations, for instance e-mail addresses must contain “@” and must have no spaces; certain characters are allowed or disallowed; it may be necessary to convert special characters to plain characters, e.g. é to e.  In general, for free-form data elements, an additional check must be used, the most effective being “eyeballing” by someone other than the person who has done the capturing. 

Items 1 and 2 should be provided by whoever develops and maintains the database or spreadsheet.  It is helpful if the lookup tables and validations incorporate whatever rules are laid down by the load specifications of any other system that the data must feed into.  (If not, then a more complex mapping process has to be utilised when transmitting data from one system to another.)

Item 3 is the responsibility of each individual who captures the data.  A data standards document should be made available to all data capturers.  It should go into more detail about the “how” of data quality.



[1] Data are raw facts, such as one person’s test score.  Information is the product after data have been organised (aggregated or analysed).  Information assists with understanding or deciding something, such as the class average of the test scores assisting with decisions concerning the moderation of results.

[2] A data field is a place where one stores data, such as a column in a database / spreadsheet, or a specific place (field) on a data entry form or web form.

Shortcut keys on the computer (Windows)

Shortcut Keys on the Computer ©  Yvonne Shapiro 2019 Shortcut keys in Windows The Windows (“Vlaggie”) key is shown as “W-” i...