SlideShare a Scribd company logo
Presented by
Michael Kano, ACDA
Data Analytics Consultant, Arbutus Analytics
Tracking Down
Outliers
About Jim Kaplan, CIA, CFE
 President and Founder of AuditNet®,
the global resource for auditors
(available on iOS, Android and
Windows devices)
 Auditor, Web Site Guru,
 Internet for Auditors Pioneer
 IIA Bradford Cadmus Memorial Award
Recipient
 Local Government Auditor’s Lifetime
Award
 Author of “The Auditor’s Guide to
Internet Resources” 2nd Edition
Page 2
ABOUT AUDITNET® LLC
• AuditNet®, the global resource for auditors, serves the global audit
community as the primary resource for Web-based auditing content. As the first online
audit portal, AuditNet® has been at the forefront of websites dedicated to promoting the
use of audit technology.
• Available on the Web, iPad, iPhone, Windows and Android devices and
features:
• Over 3,200 Reusable Templates, Audit Programs, Questionnaires, and
Control Matrices – unlimited downloads for subscribers.
• Webinars focusing on fraud, data analytics, IT audit, and internal audit
with free CPE for subscribers and site license users.
• Audit guides, manuals, and books on audit basics and using audit
technology.
• LinkedIn Networking Group with over 10K members.
• Monthly Newsletters with Expert Guest Columnists.
• Surveys on timely topics for internal auditors.
Introductions
Page 3
HOUSEKEEPING
This webinar and its material are the property of AuditNet® and its Webinar partners.
Unauthorized usage or recording of this webinar or any of its material is strictly forbidden.
• If you logged in with another individual’s confirmation email you will not receive
CPE as the confirmation login is linked to a specific individual
• This Webinar is not eligible for viewing in a group setting. You must be logged in with
your unique join link.
• We are recording the webinar and you will be provided access to that recording after
the webinar. Downloading or otherwise duplicating the webinar recording is expressly
prohibited.
• If you meet the criteria for earning CPE, you will receive a link via email to download
your certificate. The official email for CPE will be issued via cpe@email.cpe.io and it is
important to white list this address. It is from this email that your CPE credit will be sent.
There may be a processing fee to have your CPE credit regenerated if you did not
receive the first mailing.
• Submit questions via the chat box on your screen and we will answer them either during
or at the conclusion.
• You must answer the survey questions after the Webinar or before downloading your
certificate.
IMPORTANT INFORMATION
REGARDING CPE!
• ATTENDEES - If you attend the entire Webinar and meet the criteria for CPE you will receive
an email with the link to download your CPE certificate. The official email for CPE will be
issued via cpe@email.cpe.io and it is important to white list this address. It is from this email
that your CPE credit will be sent. There may be a processing fee to have your CPE credit
regenerated after the initial distribution.
• We cannot manually generate a CPE certificate as these are handled by our 3rd party provider.
We highly recommend that you work with your IT department to identify and correct any email
delivery issues prior to attending the Webinar. Issues would include blocks or spam filters in
your email system or a firewall that will redirect or not allow delivery of this email from
Gensend.io
• You must opt-in for our mailing list. If you indicate, you do not want to receive our emails
your registration will be cancelled, and you will not be able to attend the Webinar.
• We are not responsible for any connection, audio or other computer related issues. You must
have pop-ups enabled on you computer otherwise you will not be able to answer the polling
questions which occur approximately every 20 minutes. We suggest that if you have any
pressing issues to see to that you do so immediately after a polling question.
The views expressed by the presenters do not necessarily represent the views,
positions, or opinions of AuditNet® LLC. These materials, and the oral presentation
accompanying them, are for educational purposes only and do not constitute
accounting or legal advice or create an accountant-client relationship.
While AuditNet® makes every effort to ensure information is accurate and complete,
AuditNet® makes no representations, guarantees, or warranties as to the accuracy or
completeness of the information provided via this presentation. AuditNet® specifically
disclaims all liability for any claims or damages that may result from the information
contained in this presentation, including any websites maintained by third parties and
linked to the AuditNet® website.
Any mention of commercial products is for information only; it does not imply
recommendation or endorsement by AuditNet® LLC
Michael Kano
Data Analytics Consultant, Arbutus Analytics
Michael has 25 years of experience in data analytics and internal audit with organizations in the USA, Canada,
and the Middle East.
From 2015 to 2019, he was a senior member of the data analytics practice at Focal Point Data Risk, a US-based
professional services firm.
Prior to Focal Point, Michael led eBay, Inc.’s data analytics program in the Internal Audit department. He was
tasked with integrating data analytics into the audit workflow on strategic and tactical levels. This included
developing quality and documentation standards, training users, and providing analytics support on numerous
audits in the IT, PayPal, and eBay marketplaces business areas. He also provided support to non-IA teams such
as the Business Ethics Office and Enterprise Risk Management teams.
During his years at eBay, Michael supported audits throughout the organization in the IT, compliance,
operations, vendor management, revenue assurance, T&E, and human resources areas. Michael's software
experience includes Arbutus Analyzer, ACL Desktop/Direct Link, Alteryx, Microsoft Access, SQL, andTableau.
He led ACL Services Ltd.’s global training team for 8 years.
He is a graduate of the UCLA Anderson School of Management.
AGENDA
 What are outliers?
 Why are they important?
 Traditional outlier identification
 Advanced statistical methods
 Q&A
What are outliers?
 Transactions or events that are significantly different from the
rest of the population.
 Usually in terms of materiality, but can also include date/age
data
 Unexpected values
Example: Cable Bill
MONTH AMOUNT
JAN 98.76
FEB 101.33
MAR 95.49
APR 100.08
MAY 99.71
JUN 135.89
MOVING AVERAGE
98.76
100.05
98.53
98.92
99.07
105.21
Why are they important?
 Can distort population profile
 Materiality of a few large questionable transactions can lead to
misstatements in financials
 May indicate unexpected behavior or errors
Where should I look for outliers?
 GL journal entries
 T&E claims
 Vendor invoices
 Date/time gaps
 Interest/FX rates
 Payroll
Traditional Outlier Selection
 Transactions > X
 All transactions greater than a fixed amount
 Top X transactions
 Top 10 by amount
 Top X% transactions
 Largest transactions accounting for 20% of total
Fixed Amount Threshold
 Based on past experience or current bandwidth
 Not related to current economic conditions
 Can result in too many or too few results
 Filter: Amount >= X
Fixed Threshold in a Growing Business
-
5,000
10,000
15,000
20,000
25,000
1 2 3 4 5
#ofTransactions
Period
Leveraging Arbutus Commands
 Statistics
 Stratify
 Classify
 Summarize
Statistics
 Provides information on
numeric and date fields
 Top X amounts indicated in
output
 Quickly ID wide gaps
between Amount values
 Values stored in variables
 Filter: Amount >= HIGH1
 Arbitrary cutoff
Identifying Top X%
 Sort by Amount in descending order
 Run Total command and copy from Log
 Create computed field for each record's % of total
 Use script for cumulative %
 Identify cutoff record
 Again, arbitrary—may get too many/few records
Stratify
 Default divides range into equal bands
 Provides count and total for each band
 Quick way to identify some outliers
 Drill-down capability from results table
Stratify Results
Statistical Methods
 Based on entire current population
 Dynamic
 Possible with advances in computing power
 Key methods:
 Standard deviations from the mean
 Median Absolute Deviation (MAD)
 Logarithm of value + standard deviations
Standard Deviations from the Mean
 Assumes a normal distribution
 Height, weight, etc…
 Analysts usually look for values > 2 standard deviations
 Easily distorted by a small number of very large transactions if
non-normal
Normal Distribution
#of
Instances
Values
Mean/Median
Standard Deviation
 Measure of dispersal around the mean
 Higher standard deviation value indicates greater spread
 SD is based on the square of the distances from the mean
 % distribution of values in normal distributions is constant
 Two populations may have the same mean but different SD
value
 One very large value can throw off SD
Normal Distribution: Female Height
#ofInstances
Height
+1 SD
172
Mean
165
-1
SD
158
SD: 7.1 cm
+2 SD
179
-2 SD
151
Normal Distribution of Values
SD Range % of Number
Greater than +3 SD 0.1%
Between +2 and +3 SD 2.1%
Between +1 and +2 SD 13.6%
Between Mean and +1 SD 34.1%
Between Mean and -1 SD 34.1%
Between -1 and -2 SD 13.6%
Between -2 and -3 SD 2.1%
Less than -3 SD 0.1%
Same Mean, Different SD
#of
Instances
Values
Calculating Mean and Standard Deviation 1
 Run Statistics command including SD option
Calculating Mean and Standard Deviation 2
 Method 1: Create filter for Amount > 2 SD using variables:
Amount > AVERAGE1 + (2 * STDDEV1)
We can use the Count
command to see how
many records are
outliers. In this case,
they are 2.8% of the
population.
Calculating Mean and Standard Deviation 3
 Method 2: Calculate every record's SD in a script or via the
Command Line:
DEFINE FIELD SD_Value COMPUTED (Amount - %AVERAGE1%)/ %STDDEV1%
Stratify on SD_Value
Current Data Compared to Normal Distribution
SD Bucket Current Normal
Greater than +3 SD 0.4% 0.1%
Between +2 and +3 SD 2.4% 2.1%
Between +1 and +2 SD 11.2% 13.6%
Between Mean and +1 SD 36.0% 34.1%
Between Mean and -1 SD 35.7% 34.1%
Between -1 and -2 SD 11.4% 13.6%
Between -2 and -3 SD 2.6% 2.1%
Less than -3 SD 0.4% 0.1%
Identifying Outliers by Category 1
 Example: Outliers by vendor
 Requires mean and SD by vendor for each vendor's 2 SD
threshold value
 Use Summarize command to calculate the values in a new
table
 Join transactions file to new table to populate with each
vendor's 2 SD threshold
 Filter for transaction values > 2 SD for each vendor
Identifying Outliers by Category2
 Summarize by
Vendor
 Open "Fields to
process" dialog
 Select "Amount"
twice
 Change Type to
AVG and
STDDEV
Identifying Outliers by Category 3
 Output file has mean and
SD by vendor
 Create computed field for
2 SD threshold
AVG_Amount + (2 *
STDDEV_Amount)
Identifying Outliers by Category 4
 Open transaction file
 Join to vendor threshold
file and add threshold
field
 Filter for Amount >
Vendor_Treshold
What if my data is not normally distributed?
 Normal distribution methods not appropriate—too many or
too few records captured
0%
5%
10%
15%
20%
25%
30%
35%
40%
Median Absolute Deviation (MAD)
 Statistical method that reduces distortions of very large
transactions on analysis
 Relies on the median rather than the mean
 Median: 50% of values are above the median value
 Absolute Deviation: Absolute distance from the median
 No squaring of the distance; extreme values have less impact
MAD
74.6
2.2
2.0
0
0.0
0.4
1.0
AD
9,770
282
267
0
4
54
131
Amount
10,000
512
497
230
226
176
99
Median Absolute Deviation Calculation
1. Calculate median of
population amount
SD
2.4
(0.3)
(0.3)
(0.4)
(0.4)
(0.4)
(0.5)
2. Calculate absolute
distance from median of
each amount
4. Divide AD by new
median for number of
deviations
3. Calculate median of
absolute distances
Alternative Method: Use Logarithm of Amount
 Calculate logarithm of amount by using L0g() function
 "Compresses" values and distribution is closer to normal
 Use on positive amounts (non-signed)
Comparison of Distributions
SD Amount Log Normal
A - Greater than +3 SD 0.0% 0.0% 0.1%
B - Between +2 and +3 SD 0.0% 0.0% 2.1%
C - Between +1 and +2 SD 0.8% 4.0% 13.6%
D - Between Mean and +1 SD 46.4% 55.7% 34.1%
E - Between Mean and -1 SD 52.8% 23.2% 34.1%
F - Between -1 and -2 SD 0.0% 12.1% 13.6%
G - Between -2 and -3 SD 0.0% 4.0% 2.1%
H - Less than -3 SD 0.0% 1.0% 0.1%
Parting Thoughts
 Review outlier risks
 Determine if data is normally distributed
 Use method that takes into account current financial state
 Document reasoning behind choice of method
 Save best practice in a script
How to Normalize Addresses and Detect Hidden Duplicates
Any
Questions?
Live Webinar – Q&A
Tracking Down Outliers
Data Quality Management- Practical tests
Michael Kano (ACDA)
Data Analyst Consultant, Arbutus
mkano@arbutussoftware.com I Linkedin: Michael Kano
www.arbutusanalytics.com I Phone: (408) 887-4843
Click to read our latest article about Arbutus Analyzer -
Technical Insights. Author: Michael Kano
THANK YOU

More Related Content

PDF
Implementing and Auditing General Data Protection Regulation
Jim Kaplan CIA CFE
 
PDF
Touchstone Research for Internal Audit 2020 – A Look at the Now and Tomorrow ...
Jim Kaplan CIA CFE
 
PDF
When is a Duplicate not a Duplicate? Detecting Errors and Fraud
Jim Kaplan CIA CFE
 
PDF
mplementing and Auditing GDPR Series (10 of 10)
Jim Kaplan CIA CFE
 
PDF
How to get auditors performing basic analytics using excel
Jim Kaplan CIA CFE
 
PDF
General Data Protection Regulation Webinar 6
Jim Kaplan CIA CFE
 
PDF
How to detect fraud like a pro detective slides
Jim Kaplan CIA CFE
 
PDF
CyberSecurity Update Slides
Jim Kaplan CIA CFE
 
Implementing and Auditing General Data Protection Regulation
Jim Kaplan CIA CFE
 
Touchstone Research for Internal Audit 2020 – A Look at the Now and Tomorrow ...
Jim Kaplan CIA CFE
 
When is a Duplicate not a Duplicate? Detecting Errors and Fraud
Jim Kaplan CIA CFE
 
mplementing and Auditing GDPR Series (10 of 10)
Jim Kaplan CIA CFE
 
How to get auditors performing basic analytics using excel
Jim Kaplan CIA CFE
 
General Data Protection Regulation Webinar 6
Jim Kaplan CIA CFE
 
How to detect fraud like a pro detective slides
Jim Kaplan CIA CFE
 
CyberSecurity Update Slides
Jim Kaplan CIA CFE
 

What's hot (20)

PDF
Focused agile audit planning using analytics
Jim Kaplan CIA CFE
 
PDF
Implementing and Auditing GDPR Series (8 of 10)
Jim Kaplan CIA CFE
 
PDF
Agile auditing for financial services
Jim Kaplan CIA CFE
 
PPTX
Implementing and Auditing GDPR Series (9 of 10)
Jim Kaplan CIA CFE
 
PDF
General Data Protection Regulation for Auditors 5 of 10
Jim Kaplan CIA CFE
 
PDF
Ethics and the Internal Auditor
Jim Kaplan CIA CFE
 
PDF
How analytics should be used in controls testing instead of sampling
Jim Kaplan CIA CFE
 
PDF
Implementing and Auditing GDPR Series (3 of 10)
Jim Kaplan CIA CFE
 
PDF
Is Your Audit Department Highly Effective?
Jim Kaplan CIA CFE
 
PDF
How to use ai apps to unleash the power of your audit program
Jim Kaplan CIA CFE
 
PDF
Ethics for internal auditors
Jim Kaplan CIA CFE
 
PDF
Driving More Value With Automated Analytics
Jim Kaplan CIA CFE
 
PDF
Cybersecurity Slides
Jim Kaplan CIA CFE
 
PDF
Implementing and Auditing General Data Protection Regulation
Jim Kaplan CIA CFE
 
PDF
How to build a data analytics strategy in a digital world
Jim Kaplan CIA CFE
 
PDF
GDPR Series Session 4
Jim Kaplan CIA CFE
 
PDF
Forensic and investigating audit reporting
Jim Kaplan CIA CFE
 
PDF
Fraud auditing creative techniques
Jim Kaplan CIA CFE
 
PDF
Cybersecurity update 12
Jim Kaplan CIA CFE
 
PDF
Implementing and Auditing GDPR Series (2 of 10)
Jim Kaplan CIA CFE
 
Focused agile audit planning using analytics
Jim Kaplan CIA CFE
 
Implementing and Auditing GDPR Series (8 of 10)
Jim Kaplan CIA CFE
 
Agile auditing for financial services
Jim Kaplan CIA CFE
 
Implementing and Auditing GDPR Series (9 of 10)
Jim Kaplan CIA CFE
 
General Data Protection Regulation for Auditors 5 of 10
Jim Kaplan CIA CFE
 
Ethics and the Internal Auditor
Jim Kaplan CIA CFE
 
How analytics should be used in controls testing instead of sampling
Jim Kaplan CIA CFE
 
Implementing and Auditing GDPR Series (3 of 10)
Jim Kaplan CIA CFE
 
Is Your Audit Department Highly Effective?
Jim Kaplan CIA CFE
 
How to use ai apps to unleash the power of your audit program
Jim Kaplan CIA CFE
 
Ethics for internal auditors
Jim Kaplan CIA CFE
 
Driving More Value With Automated Analytics
Jim Kaplan CIA CFE
 
Cybersecurity Slides
Jim Kaplan CIA CFE
 
Implementing and Auditing General Data Protection Regulation
Jim Kaplan CIA CFE
 
How to build a data analytics strategy in a digital world
Jim Kaplan CIA CFE
 
GDPR Series Session 4
Jim Kaplan CIA CFE
 
Forensic and investigating audit reporting
Jim Kaplan CIA CFE
 
Fraud auditing creative techniques
Jim Kaplan CIA CFE
 
Cybersecurity update 12
Jim Kaplan CIA CFE
 
Implementing and Auditing GDPR Series (2 of 10)
Jim Kaplan CIA CFE
 
Ad

Similar to Tracking down outliers (20)

PDF
Structuring your organization for success with data analytics
Jim Kaplan CIA CFE
 
PDF
Visualize audit sampling and fraud detection in excel
Jim Kaplan CIA CFE
 
PDF
How analytics should be used in controls testing instead of sampling
Jim Kaplan CIA CFE
 
PDF
Future audit analytics
Jim Kaplan CIA CFE
 
PDF
How to data mine your print reports
Jim Kaplan CIA CFE
 
PDF
Are You a Smart CAAT or a Copy CAAT
Jim Kaplan CIA CFE
 
PDF
How to prepare for your first anti fraud review
Jim Kaplan CIA CFE
 
PDF
Retrospective data analytics slides
Jim Kaplan CIA CFE
 
PDF
IT Fraud and Countermeasures
Jim Kaplan CIA CFE
 
PDF
Cybersecurity Series - Cyber Defense for Internal Auditors
Jim Kaplan CIA CFE
 
PDF
Right to Audit Clauses: What you need to know!
Jim Kaplan CIA CFE
 
PDF
Enhanced fraud detection with data analytics
Jim Kaplan CIA CFE
 
PDF
The Future of Auditing and Fraud Detection
Jim Kaplan CIA CFE
 
PDF
Embracing Multigenerational Teams in Audit
Jim Kaplan CIA CFE
 
PDF
Global In-House Contact Center Benchmark Report Highlights July 2017
Golden Gate BPO Solutions
 
PDF
Fieldwork Webinar
Jim Kaplan CIA CFE
 
PDF
Internal Auditing Basics
Jim Kaplan CIA CFE
 
PDF
How Data.com Drives Data Quality
Salesforce Partners
 
PDF
Audit analytics and the agile auditor
Jim Kaplan CIA CFE
 
PPTX
Pricing in a Post Royal Commission world
netwealthInvest
 
Structuring your organization for success with data analytics
Jim Kaplan CIA CFE
 
Visualize audit sampling and fraud detection in excel
Jim Kaplan CIA CFE
 
How analytics should be used in controls testing instead of sampling
Jim Kaplan CIA CFE
 
Future audit analytics
Jim Kaplan CIA CFE
 
How to data mine your print reports
Jim Kaplan CIA CFE
 
Are You a Smart CAAT or a Copy CAAT
Jim Kaplan CIA CFE
 
How to prepare for your first anti fraud review
Jim Kaplan CIA CFE
 
Retrospective data analytics slides
Jim Kaplan CIA CFE
 
IT Fraud and Countermeasures
Jim Kaplan CIA CFE
 
Cybersecurity Series - Cyber Defense for Internal Auditors
Jim Kaplan CIA CFE
 
Right to Audit Clauses: What you need to know!
Jim Kaplan CIA CFE
 
Enhanced fraud detection with data analytics
Jim Kaplan CIA CFE
 
The Future of Auditing and Fraud Detection
Jim Kaplan CIA CFE
 
Embracing Multigenerational Teams in Audit
Jim Kaplan CIA CFE
 
Global In-House Contact Center Benchmark Report Highlights July 2017
Golden Gate BPO Solutions
 
Fieldwork Webinar
Jim Kaplan CIA CFE
 
Internal Auditing Basics
Jim Kaplan CIA CFE
 
How Data.com Drives Data Quality
Salesforce Partners
 
Audit analytics and the agile auditor
Jim Kaplan CIA CFE
 
Pricing in a Post Royal Commission world
netwealthInvest
 
Ad

Recently uploaded (20)

PPTX
Introduction to computer chapter one 2017.pptx
mensunmarley
 
PPTX
Data-Driven Machine Learning for Rail Infrastructure Health Monitoring
Sione Palu
 
PDF
An Uncut Conversation With Grok | PDF Document
Mike Hydes
 
PDF
Research about a FoodFolio app for personalized dietary tracking and health o...
AustinLiamAndres
 
PPTX
Fuzzy_Membership_Functions_Presentation.pptx
pythoncrazy2024
 
PPTX
INFO8116 - Week 10 - Slides.pptx data analutics
guddipatel10
 
PPTX
IP_Journal_Articles_2025IP_Journal_Articles_2025
mishell212144
 
PDF
Blue Futuristic Cyber Security Presentation.pdf
tanvikhunt1003
 
PPTX
INFO8116 - Week 10 - Slides.pptx big data architecture
guddipatel10
 
PDF
Technical Writing Module-I Complete Notes.pdf
VedprakashArya13
 
PDF
Classifcation using Machine Learning and deep learning
bhaveshagrawal35
 
PPTX
Complete_STATA_Introduction_Beginner.pptx
mbayekebe
 
PPTX
Multiscale Segmentation of Survey Respondents: Seeing the Trees and the Fores...
Sione Palu
 
PDF
D9110.pdfdsfvsdfvsdfvsdfvfvfsvfsvffsdfvsdfvsd
minhn6673
 
PDF
Key_Statistical_Techniques_in_Analytics_by_CA_Suvidha_Chaplot.pdf
CA Suvidha Chaplot
 
PPTX
short term project on AI Driven Data Analytics
JMJCollegeComputerde
 
PPTX
Presentation (1) (1).pptx k8hhfftuiiigff
karthikjagath2005
 
PDF
oop_java (1) of ice or cse or eee ic.pdf
sabiquntoufiqlabonno
 
PDF
Chad Readey - An Independent Thinker
Chad Readey
 
PDF
WISE main accomplishments for ISQOLS award July 2025.pdf
StatsCommunications
 
Introduction to computer chapter one 2017.pptx
mensunmarley
 
Data-Driven Machine Learning for Rail Infrastructure Health Monitoring
Sione Palu
 
An Uncut Conversation With Grok | PDF Document
Mike Hydes
 
Research about a FoodFolio app for personalized dietary tracking and health o...
AustinLiamAndres
 
Fuzzy_Membership_Functions_Presentation.pptx
pythoncrazy2024
 
INFO8116 - Week 10 - Slides.pptx data analutics
guddipatel10
 
IP_Journal_Articles_2025IP_Journal_Articles_2025
mishell212144
 
Blue Futuristic Cyber Security Presentation.pdf
tanvikhunt1003
 
INFO8116 - Week 10 - Slides.pptx big data architecture
guddipatel10
 
Technical Writing Module-I Complete Notes.pdf
VedprakashArya13
 
Classifcation using Machine Learning and deep learning
bhaveshagrawal35
 
Complete_STATA_Introduction_Beginner.pptx
mbayekebe
 
Multiscale Segmentation of Survey Respondents: Seeing the Trees and the Fores...
Sione Palu
 
D9110.pdfdsfvsdfvsdfvsdfvfvfsvfsvffsdfvsdfvsd
minhn6673
 
Key_Statistical_Techniques_in_Analytics_by_CA_Suvidha_Chaplot.pdf
CA Suvidha Chaplot
 
short term project on AI Driven Data Analytics
JMJCollegeComputerde
 
Presentation (1) (1).pptx k8hhfftuiiigff
karthikjagath2005
 
oop_java (1) of ice or cse or eee ic.pdf
sabiquntoufiqlabonno
 
Chad Readey - An Independent Thinker
Chad Readey
 
WISE main accomplishments for ISQOLS award July 2025.pdf
StatsCommunications
 

Tracking down outliers

  • 1. Presented by Michael Kano, ACDA Data Analytics Consultant, Arbutus Analytics Tracking Down Outliers
  • 2. About Jim Kaplan, CIA, CFE  President and Founder of AuditNet®, the global resource for auditors (available on iOS, Android and Windows devices)  Auditor, Web Site Guru,  Internet for Auditors Pioneer  IIA Bradford Cadmus Memorial Award Recipient  Local Government Auditor’s Lifetime Award  Author of “The Auditor’s Guide to Internet Resources” 2nd Edition Page 2
  • 3. ABOUT AUDITNET® LLC • AuditNet®, the global resource for auditors, serves the global audit community as the primary resource for Web-based auditing content. As the first online audit portal, AuditNet® has been at the forefront of websites dedicated to promoting the use of audit technology. • Available on the Web, iPad, iPhone, Windows and Android devices and features: • Over 3,200 Reusable Templates, Audit Programs, Questionnaires, and Control Matrices – unlimited downloads for subscribers. • Webinars focusing on fraud, data analytics, IT audit, and internal audit with free CPE for subscribers and site license users. • Audit guides, manuals, and books on audit basics and using audit technology. • LinkedIn Networking Group with over 10K members. • Monthly Newsletters with Expert Guest Columnists. • Surveys on timely topics for internal auditors. Introductions Page 3
  • 4. HOUSEKEEPING This webinar and its material are the property of AuditNet® and its Webinar partners. Unauthorized usage or recording of this webinar or any of its material is strictly forbidden. • If you logged in with another individual’s confirmation email you will not receive CPE as the confirmation login is linked to a specific individual • This Webinar is not eligible for viewing in a group setting. You must be logged in with your unique join link. • We are recording the webinar and you will be provided access to that recording after the webinar. Downloading or otherwise duplicating the webinar recording is expressly prohibited. • If you meet the criteria for earning CPE, you will receive a link via email to download your certificate. The official email for CPE will be issued via [email protected] and it is important to white list this address. It is from this email that your CPE credit will be sent. There may be a processing fee to have your CPE credit regenerated if you did not receive the first mailing. • Submit questions via the chat box on your screen and we will answer them either during or at the conclusion. • You must answer the survey questions after the Webinar or before downloading your certificate.
  • 5. IMPORTANT INFORMATION REGARDING CPE! • ATTENDEES - If you attend the entire Webinar and meet the criteria for CPE you will receive an email with the link to download your CPE certificate. The official email for CPE will be issued via [email protected] and it is important to white list this address. It is from this email that your CPE credit will be sent. There may be a processing fee to have your CPE credit regenerated after the initial distribution. • We cannot manually generate a CPE certificate as these are handled by our 3rd party provider. We highly recommend that you work with your IT department to identify and correct any email delivery issues prior to attending the Webinar. Issues would include blocks or spam filters in your email system or a firewall that will redirect or not allow delivery of this email from Gensend.io • You must opt-in for our mailing list. If you indicate, you do not want to receive our emails your registration will be cancelled, and you will not be able to attend the Webinar. • We are not responsible for any connection, audio or other computer related issues. You must have pop-ups enabled on you computer otherwise you will not be able to answer the polling questions which occur approximately every 20 minutes. We suggest that if you have any pressing issues to see to that you do so immediately after a polling question.
  • 6. The views expressed by the presenters do not necessarily represent the views, positions, or opinions of AuditNet® LLC. These materials, and the oral presentation accompanying them, are for educational purposes only and do not constitute accounting or legal advice or create an accountant-client relationship. While AuditNet® makes every effort to ensure information is accurate and complete, AuditNet® makes no representations, guarantees, or warranties as to the accuracy or completeness of the information provided via this presentation. AuditNet® specifically disclaims all liability for any claims or damages that may result from the information contained in this presentation, including any websites maintained by third parties and linked to the AuditNet® website. Any mention of commercial products is for information only; it does not imply recommendation or endorsement by AuditNet® LLC
  • 7. Michael Kano Data Analytics Consultant, Arbutus Analytics Michael has 25 years of experience in data analytics and internal audit with organizations in the USA, Canada, and the Middle East. From 2015 to 2019, he was a senior member of the data analytics practice at Focal Point Data Risk, a US-based professional services firm. Prior to Focal Point, Michael led eBay, Inc.’s data analytics program in the Internal Audit department. He was tasked with integrating data analytics into the audit workflow on strategic and tactical levels. This included developing quality and documentation standards, training users, and providing analytics support on numerous audits in the IT, PayPal, and eBay marketplaces business areas. He also provided support to non-IA teams such as the Business Ethics Office and Enterprise Risk Management teams. During his years at eBay, Michael supported audits throughout the organization in the IT, compliance, operations, vendor management, revenue assurance, T&E, and human resources areas. Michael's software experience includes Arbutus Analyzer, ACL Desktop/Direct Link, Alteryx, Microsoft Access, SQL, andTableau. He led ACL Services Ltd.’s global training team for 8 years. He is a graduate of the UCLA Anderson School of Management.
  • 8. AGENDA  What are outliers?  Why are they important?  Traditional outlier identification  Advanced statistical methods  Q&A
  • 9. What are outliers?  Transactions or events that are significantly different from the rest of the population.  Usually in terms of materiality, but can also include date/age data  Unexpected values
  • 10. Example: Cable Bill MONTH AMOUNT JAN 98.76 FEB 101.33 MAR 95.49 APR 100.08 MAY 99.71 JUN 135.89 MOVING AVERAGE 98.76 100.05 98.53 98.92 99.07 105.21
  • 11. Why are they important?  Can distort population profile  Materiality of a few large questionable transactions can lead to misstatements in financials  May indicate unexpected behavior or errors
  • 12. Where should I look for outliers?  GL journal entries  T&E claims  Vendor invoices  Date/time gaps  Interest/FX rates  Payroll
  • 13. Traditional Outlier Selection  Transactions > X  All transactions greater than a fixed amount  Top X transactions  Top 10 by amount  Top X% transactions  Largest transactions accounting for 20% of total
  • 14. Fixed Amount Threshold  Based on past experience or current bandwidth  Not related to current economic conditions  Can result in too many or too few results  Filter: Amount >= X
  • 15. Fixed Threshold in a Growing Business - 5,000 10,000 15,000 20,000 25,000 1 2 3 4 5 #ofTransactions Period
  • 16. Leveraging Arbutus Commands  Statistics  Stratify  Classify  Summarize
  • 17. Statistics  Provides information on numeric and date fields  Top X amounts indicated in output  Quickly ID wide gaps between Amount values  Values stored in variables  Filter: Amount >= HIGH1  Arbitrary cutoff
  • 18. Identifying Top X%  Sort by Amount in descending order  Run Total command and copy from Log  Create computed field for each record's % of total  Use script for cumulative %  Identify cutoff record  Again, arbitrary—may get too many/few records
  • 19. Stratify  Default divides range into equal bands  Provides count and total for each band  Quick way to identify some outliers  Drill-down capability from results table
  • 21. Statistical Methods  Based on entire current population  Dynamic  Possible with advances in computing power  Key methods:  Standard deviations from the mean  Median Absolute Deviation (MAD)  Logarithm of value + standard deviations
  • 22. Standard Deviations from the Mean  Assumes a normal distribution  Height, weight, etc…  Analysts usually look for values > 2 standard deviations  Easily distorted by a small number of very large transactions if non-normal
  • 24. Standard Deviation  Measure of dispersal around the mean  Higher standard deviation value indicates greater spread  SD is based on the square of the distances from the mean  % distribution of values in normal distributions is constant  Two populations may have the same mean but different SD value  One very large value can throw off SD
  • 25. Normal Distribution: Female Height #ofInstances Height +1 SD 172 Mean 165 -1 SD 158 SD: 7.1 cm +2 SD 179 -2 SD 151
  • 26. Normal Distribution of Values SD Range % of Number Greater than +3 SD 0.1% Between +2 and +3 SD 2.1% Between +1 and +2 SD 13.6% Between Mean and +1 SD 34.1% Between Mean and -1 SD 34.1% Between -1 and -2 SD 13.6% Between -2 and -3 SD 2.1% Less than -3 SD 0.1%
  • 27. Same Mean, Different SD #of Instances Values
  • 28. Calculating Mean and Standard Deviation 1  Run Statistics command including SD option
  • 29. Calculating Mean and Standard Deviation 2  Method 1: Create filter for Amount > 2 SD using variables: Amount > AVERAGE1 + (2 * STDDEV1) We can use the Count command to see how many records are outliers. In this case, they are 2.8% of the population.
  • 30. Calculating Mean and Standard Deviation 3  Method 2: Calculate every record's SD in a script or via the Command Line: DEFINE FIELD SD_Value COMPUTED (Amount - %AVERAGE1%)/ %STDDEV1%
  • 32. Current Data Compared to Normal Distribution SD Bucket Current Normal Greater than +3 SD 0.4% 0.1% Between +2 and +3 SD 2.4% 2.1% Between +1 and +2 SD 11.2% 13.6% Between Mean and +1 SD 36.0% 34.1% Between Mean and -1 SD 35.7% 34.1% Between -1 and -2 SD 11.4% 13.6% Between -2 and -3 SD 2.6% 2.1% Less than -3 SD 0.4% 0.1%
  • 33. Identifying Outliers by Category 1  Example: Outliers by vendor  Requires mean and SD by vendor for each vendor's 2 SD threshold value  Use Summarize command to calculate the values in a new table  Join transactions file to new table to populate with each vendor's 2 SD threshold  Filter for transaction values > 2 SD for each vendor
  • 34. Identifying Outliers by Category2  Summarize by Vendor  Open "Fields to process" dialog  Select "Amount" twice  Change Type to AVG and STDDEV
  • 35. Identifying Outliers by Category 3  Output file has mean and SD by vendor  Create computed field for 2 SD threshold AVG_Amount + (2 * STDDEV_Amount)
  • 36. Identifying Outliers by Category 4  Open transaction file  Join to vendor threshold file and add threshold field  Filter for Amount > Vendor_Treshold
  • 37. What if my data is not normally distributed?  Normal distribution methods not appropriate—too many or too few records captured 0% 5% 10% 15% 20% 25% 30% 35% 40%
  • 38. Median Absolute Deviation (MAD)  Statistical method that reduces distortions of very large transactions on analysis  Relies on the median rather than the mean  Median: 50% of values are above the median value  Absolute Deviation: Absolute distance from the median  No squaring of the distance; extreme values have less impact
  • 39. MAD 74.6 2.2 2.0 0 0.0 0.4 1.0 AD 9,770 282 267 0 4 54 131 Amount 10,000 512 497 230 226 176 99 Median Absolute Deviation Calculation 1. Calculate median of population amount SD 2.4 (0.3) (0.3) (0.4) (0.4) (0.4) (0.5) 2. Calculate absolute distance from median of each amount 4. Divide AD by new median for number of deviations 3. Calculate median of absolute distances
  • 40. Alternative Method: Use Logarithm of Amount  Calculate logarithm of amount by using L0g() function  "Compresses" values and distribution is closer to normal  Use on positive amounts (non-signed)
  • 41. Comparison of Distributions SD Amount Log Normal A - Greater than +3 SD 0.0% 0.0% 0.1% B - Between +2 and +3 SD 0.0% 0.0% 2.1% C - Between +1 and +2 SD 0.8% 4.0% 13.6% D - Between Mean and +1 SD 46.4% 55.7% 34.1% E - Between Mean and -1 SD 52.8% 23.2% 34.1% F - Between -1 and -2 SD 0.0% 12.1% 13.6% G - Between -2 and -3 SD 0.0% 4.0% 2.1% H - Less than -3 SD 0.0% 1.0% 0.1%
  • 42. Parting Thoughts  Review outlier risks  Determine if data is normally distributed  Use method that takes into account current financial state  Document reasoning behind choice of method  Save best practice in a script
  • 43. How to Normalize Addresses and Detect Hidden Duplicates Any Questions? Live Webinar – Q&A Tracking Down Outliers Data Quality Management- Practical tests Michael Kano (ACDA) Data Analyst Consultant, Arbutus [email protected] I Linkedin: Michael Kano www.arbutusanalytics.com I Phone: (408) 887-4843 Click to read our latest article about Arbutus Analyzer - Technical Insights. Author: Michael Kano