SlideShare a Scribd company logo
6/24/2020
1
Key Presenter:
Michael Kano, ACDA
Data Analytics Consultant, ArbutusAnalytics
When is a Duplicate not a Duplicate?
Detecting Errors and Fraud
About Jim Kaplan, CIA, CFE
 President and Founder of AuditNet®,
the global resource for auditors
(available on iOS, Android and Windows
devices)
 Auditor, Web Site Guru,
 Internet for Auditors Pioneer
 IIA Bradford Cadmus Memorial Award
Recipient
 Local Government Auditor’s Lifetime
Award
 Author of “The Auditor’s Guide to
Internet Resources” 2nd Edition
6/24/2020
1
2
6/24/2020
2
About AuditNet® LLC
• AuditNet®, the global resource for auditors, serves the global audit
community as the primary resource for Web-based auditing content. As the first online
audit portal, AuditNet® has been at the forefront of websites dedicated to promoting the
use of audit technology.
• Available on the Web, iPad, iPhone, Windows and Android devices and
features:
• Over 3,100 Reusable Templates, Audit Programs, Questionnaires, and
Control Matrices
• Webinars focusing on fraud, data analytics, IT audit, and internal audit
with free CPE for subscribers and site license users.
• Audit guides, manuals, and books on audit basics and using audit
technology
• LinkedIn Networking Groups
• Monthly Newsletters with Expert Guest Columnists
• Surveys on timely topics for internal auditors
Introductions
6/24/2020
HOUSEKEEPING
This webinar and its material are the property of AuditNet® and its Webinar partners. Unauthorized usage
or recording of this webinar or any of its material is strictly forbidden.
 If you logged in with another individual’s confirmation email you will not receive CPE as the
confirmation login is linked to a specific individual
 This Webinar is not eligible for viewing in a group setting. You must be logged in with your unique join
link.
 We are recording the webinar and you will be provided access to that recording after the webinar.
Downloading or otherwise duplicating the webinar recording is expressly prohibited.
 If you meet the criteria for earning CPE you will receive a confirmation link via email to for your CPE.
You must first complete the course evaluation and then you will receive the link to download your CPE.
The official email for CPE will be sent from cpe@email.cpe.io and it is important to white list this
address. There may be a processing fee to have your CPE credit regenerated if you did not receive the
first mailing.
 CPE will only be sent to those who have opted in for the AuditNet mailing list.
 Submit questions via the chat box on your screen and we will answer them either during or at the
conclusion.
 You must answer the survey questions after the Webinar or before downloading your certificate.
3
4
6/24/2020
3
IMPORTANT INFORMATION
REGARDING CPE!
 ATTENDEES - If you opted in to the AuditNet mailing list, attended the entire Webinar you will receive
a confirmation email for the CPE certificate. You must complete the course evaluation in order to
receive the download link for CPE. The official email for CPE will be issued via cpe@email.cpe.io and it
is important to white list this address. There may be a processing fee to have your CPE credit
regenerated after the initial distribution.
 We cannot manually generate a CPE certificate as these are handled by our 3rd party provider. We
highly recommend that you work with your IT department to identify and correct any email delivery
issues prior to attending the Webinar. Issues would include blocks or spam filters in your email system
or a firewall that will redirect or not allow delivery of this email from Gensend.io
 You must opt in for our mailing list. If you indicate that you do not want to receive our e-mails, you can
attend the Webinar but will not receive CPE.
 We are not responsible for any connection, audio or other computer related issues. You must have
pop-ups enabled on your computer otherwise you will not be able to answer the polling questions
which occur approximately every 20 minutes. We suggest that if you have any pressing issues to see to
that you do so immediately after a polling question.
The views expressed by the presenters do not necessarily represent the views, positions, or
opinions of AuditNet® LLC. These materials, and the oral presentation accompanying them,
are for educational purposes only and do not constitute accounting or legal advice or create an
accountant-client relationship.
While AuditNet® makes every effort to ensure information is accurate and complete,
AuditNet® makes no representations, guarantees, or warranties as to the accuracy or
completeness of the information provided via this presentation. AuditNet® specifically
disclaims all liability for any claims or damages that may result from the information contained
in this presentation, including any websites maintained by third parties and linked to the
AuditNet®website.
Any mention of commercial products is for information only; it does not imply recommendation
or endorsement by AuditNet® LLC
5
6
6/24/2020
4
Michael Kano
Data Analytics Consultant, Arbutus Analytics
Michael has 25 years of experience in data analytics and internal audit with organizations in the USA, Canada,
and the Middle East.
From 2015 to 2019, he was a senior member of the data analytics practice at Focal Point Data Risk, a US-based
professional services firm.
Prior to Focal Point, Michael led eBay, Inc.’s data analytics program in the Internal Audit department. He was
tasked with integrating data analytics into the audit workflow on strategic and tactical levels.This included
developing quality and documentation standards, training users, and providing analytics support on numerous
audits in the IT, PayPal, and eBay marketplaces business areas. He also provided support to non-IA teams such
as the Business EthicsOffice and Enterprise Risk Management teams.
During his years at eBay, Michael supported audits throughout the organization in the IT, compliance,
operations, vendor management, revenue assurance,T&E, and human resources areas. Michael's software
experience includesArbutusAnalyzer,ACL Desktop/Direct Link,Alteryx, MicrosoftAccess, SQL, andTableau.
He led ACL Services Ltd.’s global training team for 8 years.
He is a graduate of the UCLAAnderson School of Management.
Why fuzzy testing?
 Detect fraud
 Identify errors
 Reduce false positives
8
7
8
6/24/2020
5
Detect Fraud
 Multiple billings
 Vendors with same address
 Counterparties on watch lists (OFAC, GSA)
9
Identify Errors
 Input
 Optical character recognition (OCR)
 Multiple databases not synchronized
 Fatigue
1
9
10
6/24/2020
6
Reduce False Positives
 Less time spent on dead ends
 Increased efficiency
1
Automated Analytics
 Weak or absent controls require continuous auditing/monitoring.
 Scripted solutions allow for complex algorithms to be run against data to
mitigate these risks.
 Automation of best practices ensures consistency and adds to efficiency.
 Powerful DA tools include functionality that makes sophisticated testing
and detection possible.
1
11
12
6/24/2020
7
Analytic Functionality
 Functions
 Normalize()
 SortNormalize()
 Format()
 Include()/Exclude()
 Commands
 DUPLICATES with Different, Near, & Similar parameters
 JOIN
1
 Near()
 Similar()
 Difference()
Are these the same addresses?
Addr1: 2847 Congress Pkwy West
Addr2: Suite 201
Addr1: #201, 2847W Congress Parkway
Addr2:
Addr1: 125 Fifth Str. E Addr1: 125 East 5th Street
Addr1: 707 Rooke Road Addr1: 707 Rook Rd
Addr1: 3960 Monjah Circle Addr1: 3960 Monja Circle
13
14
6/24/2020
8
Normalizing Data
 Normalize( Vendor_Address,'addr2.txt’ )
16023, 40th Way South  16023 40TH WAY S
#105, 1470 Boston Street  105 1470 BOSTON ST
 SortNormalize( Vendor_Address,'addr2.txt’ )
16023, 40th Way South  WAY S 40TH 16023
#105, 1470 Boston Street  ST BOSTON 1470 105
Checking for Matching or Close Addresses
205 E. 10th St 205 10th Street East Original
205 E 10TH ST 205 10TH ST E Normalized
ST E 205 10TH ST E 205 10TH SortNormalized
Matched!
15
16
6/24/2020
9
Elizabeth or Rick by any other name?
BESS LIB DICK
BESSIE LIBBY DICKIE
BET LIDDY BRODERICK
BETH LILIBET CEDRIC
BETSY LISBETH DERRICK
BETTE LISSIE ERIC
BETTY LIZ RICH
ELISE LIZA RICHARD
ELSA LIZBETH RICHIE
LIZZIE LIZZY RICKY
Normalizing Data
Normalize( First,'female name substitution table.sub,male name substitution table.sub’ )
JOHANN  JOHN
JOHNNY  JOHN
JON  JOHN
JONATHAN  JOHN
JENNIE  JEN
JENNY  JEN
JENNIFER  JEN
JENN  JEN
17
18
6/24/2020
10
19
Quick Lesson: A Usable Fuzzy Algorithm
 ‘Rob’ COMPARED TO ‘Robert’ = 3
 ‘Gary’ COMPARED TO ‘Mary’ = 1
 ‘Gary’ COMPARED TO ‘Gray’ = 1
 ‘123 Main Street’ COMPARED TO ‘123 Main St’ = 4
 In Arbutus used in NEAR , SIMILAR & DIFFERENCE functions/parameters
When to use Near and when to use Similar
 NEAR
 Character fields: Straight string/character data comparison
 Numeric fields: looks for numeric proximity
 Date fields: looks for date/time proximity
 SIMILAR
 Character fields: Pre-modifies data for visually similar characteristics
before doing string comparison
 Numeric fields: converts to character data before processing
 Date fields: converts to character data before processing
19
20
6/24/2020
11
Today’s Tests
1) Duplicate Payments (Identical)
2) Duplicate Payments with Near Dates
3) Duplicate Payments with Near Amounts and Dates
4) Duplicate Payments Similar Invoice Numbers
5) Duplicate Vendor Addresses
6) Duplicate/Similar Vendor-OFAC Addresses
7) Duplicate/Similar Vendor-OFAC Addresses (Word Match %)
8) Similar Vendor Phone Numbers
9) Similar Employee Names: HR vs PCard
Test #1: Duplicate Payments (Identical)
 Same Date
 Same Vendor
 Same Amount
 Same Product Number
 Same Invoice Number
Run the DUPLICATES command, selecting these fields in the “Field(s) to
test for Duplicates”
Select “All Fields” from the “List fields” list
Result: No identical payments
21
22
6/24/2020
12
Test #2: Duplicate Payments with Near Dates
 Same Vendor
 Same Amount
 Same Product Number
 Transaction dates within 5 days of each other (no exacts)
Run the DUPLICATES command, selecting these fields in this order in the
“Field(s) to test for Duplicates”
Change the “Last duplicate field is” parameter to “Near” and change the
value to 5.
Select Transaction Number and Invoice Number from the “List fields” list
Result: 3 pairs of possible duplicates
Test #2: Duplicate Payments with Near Dates
23
24
6/24/2020
13
Test #3: Duplicate Payments with Near Amounts/Dates
 Same Vendor
 Same Product Number
 Amount within $10
Run the DUPLICATES command, selecting the first three fields in this order
in the “Field(s) to test for Duplicates”
Change the “Last duplicate field is” parameter to “Near” and change the
value to 10.
Select Transaction Number and Transaction Date from the “List fields” list.
In the result create a filter ABS(Transaction Date 1 – Transaction Date 2) < 14
Result: 7239 pairs of possible duplicates
Test #4: Duplicate Payments Similar Invoice Numbers
 Same Vendor
 Same Product Number
 Similar Invoice Number
Run the DUPLICATES command, selecting the first three fields in this order
in the “Field(s) to test for Duplicates”
Change the “Last duplicate field is” parameter to “Similar” and change the
value to 1.
Select Transaction Number and Transaction Date from the “List fields” list.
Result: 63 pairs of possible duplicates
25
26
6/24/2020
14
Test #4: Duplicate Payments Similar Invoice Numbers
2
Test #5: Duplicate Vendor Addresses
Create a computed field to SortNormalize the Vendor Address:
SORTNORMALIZE(Vendor_Address,”ADDR.TXT”
Run the DUPLICATES command on this computed field. (You may want to
include zip code.)
Select other fields from the “List fields” list.
Result: 39 possible duplicates
27
28
6/24/2020
15
Test #5: Duplicate Vendor Addresses
Test #6: Duplicate/Similar Vendor-OFAC Addresses
Create computed fields to SortNormalize the Vendor and OFAC Addresses:
Run a Many-to-Many JOIN using the computed fields as the key fields. Add a filter
to the JOIN*
Difference(OFAC_Address_SORTNORM,Vendor_Master_Extract.Vendor_Address_SORTN
ORM) <= 1
Result: 15 pairs of possible duplicates
*Best Practice: If your JOIN includes computed fields, EXTRACT FIELDS for each file of only the
minimum necessary fields. This will cause the computed fields to be written out as physical fields.
Then execute the JOIN between the two new tables, including the zip codes in the filter for more
precision. Physical fields process much faster than computed fields.
29
30
6/24/2020
16
Test #6: Duplicate/Similar Vendor-OFAC Addresses
Test #7: Duplicate/Similar Vendor-OFAC Addresses
(Word Match %)
The script will calculate the number of common words between all possible
normalized address pairs.
It will then calculate the percent match for each address, which is the number of
common words divided by the total number of words.
The final match score is the average of the two scores.
The final output includes exact matches and all other matches where the match
score is greater than or equal to 75%.
31
32
6/24/2020
17
Test #7: Duplicate/Similar Vendor-OFAC Addresses
(Word Match %)
Test #8: Similar Vendor Phone Numbers
Convert phone numbers to numeric values in a computed field by:
1) Removing all non-numeric characters
2) Converting to a numeric value
3) Executing DUPLICATES on the new field with a NEAR parameter equal to 1.
Computed Field: VALUE(INCLUDE(Vendor Phone,”0~9”),0)
Result: 29 matched pairs
33
34
6/24/2020
18
Test #8: Similar Vendor Phone Numbers
Test #9: Similar Employee Names: HR vs PCard
Two databases might not be in sync with regard to employee names:
misspellings, marriage status change, “Von Bulow” vs “Vonbulow”.
PCard list doesn’t always include employee number, just first and last name
in addition to last 4 digits of card. Last 4 digits of card not unique.
Testing Pcard list to match against HR data can result in unmatched cards
when joining on combination of last 4 digits + first name + last name.
Card List Transactions
35
36
6/24/2020
19
Test #9: Similar Employee Names: HR vs PCard
1) Identify which database stores last name as “Von Schmidt”
2) Create computed field that combines the two parts of the last name and
convert to uppercase: UPPER(EXCLUDE(Last Name," "))
3) Rejoin the databases and isolate new unmatched.
4) Using new unmatched, execute a fuzzy join where the last 4 are the same
and the names are within 1 character of each other.
5) Remnants (final unmatched) are likely due to name changes.
.
Summary
 Identify data where manual input has occurred or where counterparty has
provided input.
 Test for consistency.
 Identify tests needed to reduce risk.
 Examine the functionality for ways to make necessary changes.
 Call Tech Support
37
38
6/24/2020
20
How to Normalize Addresses and Detect Hidden Duplicates
Any Questions?
Live Webinar – Q&A
When is a Duplicate not a
Duplicate?
Data Quality Management- Practical tests
Michael Kano (ACDA)
Data Analyst Consultant, Arbutus
mkano@arbutussoftware.com I Linkedin: Michael Kano
www.arbutusanalytics.com I Phone: (408) 887-4843
Click to read our latest article about Arbutus Analyzer - Technical
Insights. Author: Michael Kano
THANK YOU
39
40

More Related Content

What's hot (20)

PDF
mplementing and Auditing GDPR Series (10 of 10)
Jim Kaplan CIA CFE
 
PDF
Focused agile audit planning using analytics
Jim Kaplan CIA CFE
 
PDF
Implementing and Auditing GDPR Series (3 of 10)
Jim Kaplan CIA CFE
 
PDF
Agile auditing for financial services
Jim Kaplan CIA CFE
 
PDF
General Data Protection Regulation for Auditors 5 of 10
Jim Kaplan CIA CFE
 
PPTX
Implementing and Auditing GDPR Series (9 of 10)
Jim Kaplan CIA CFE
 
PDF
How to use ai apps to unleash the power of your audit program
Jim Kaplan CIA CFE
 
PDF
Is Your Audit Department Highly Effective?
Jim Kaplan CIA CFE
 
PDF
GDPR Series Session 4
Jim Kaplan CIA CFE
 
PDF
Ethics and the Internal Auditor
Jim Kaplan CIA CFE
 
PDF
Implementing and Auditing General Data Protection Regulation
Jim Kaplan CIA CFE
 
PDF
Ethics for internal auditors
Jim Kaplan CIA CFE
 
PDF
Driving More Value With Automated Analytics
Jim Kaplan CIA CFE
 
PPTX
Ethics for Internal Auditors
Jim Kaplan CIA CFE
 
PDF
How analytics should be used in controls testing instead of sampling
Jim Kaplan CIA CFE
 
PDF
How to build a data analytics strategy in a digital world
Jim Kaplan CIA CFE
 
PDF
Enhanced fraud detection with data analytics
Jim Kaplan CIA CFE
 
PDF
IT Fraud Series: Data Analytics
Jim Kaplan CIA CFE
 
PDF
Auditing Social Media
Jim Kaplan CIA CFE
 
PDF
How to prepare for your first anti fraud review
Jim Kaplan CIA CFE
 
mplementing and Auditing GDPR Series (10 of 10)
Jim Kaplan CIA CFE
 
Focused agile audit planning using analytics
Jim Kaplan CIA CFE
 
Implementing and Auditing GDPR Series (3 of 10)
Jim Kaplan CIA CFE
 
Agile auditing for financial services
Jim Kaplan CIA CFE
 
General Data Protection Regulation for Auditors 5 of 10
Jim Kaplan CIA CFE
 
Implementing and Auditing GDPR Series (9 of 10)
Jim Kaplan CIA CFE
 
How to use ai apps to unleash the power of your audit program
Jim Kaplan CIA CFE
 
Is Your Audit Department Highly Effective?
Jim Kaplan CIA CFE
 
GDPR Series Session 4
Jim Kaplan CIA CFE
 
Ethics and the Internal Auditor
Jim Kaplan CIA CFE
 
Implementing and Auditing General Data Protection Regulation
Jim Kaplan CIA CFE
 
Ethics for internal auditors
Jim Kaplan CIA CFE
 
Driving More Value With Automated Analytics
Jim Kaplan CIA CFE
 
Ethics for Internal Auditors
Jim Kaplan CIA CFE
 
How analytics should be used in controls testing instead of sampling
Jim Kaplan CIA CFE
 
How to build a data analytics strategy in a digital world
Jim Kaplan CIA CFE
 
Enhanced fraud detection with data analytics
Jim Kaplan CIA CFE
 
IT Fraud Series: Data Analytics
Jim Kaplan CIA CFE
 
Auditing Social Media
Jim Kaplan CIA CFE
 
How to prepare for your first anti fraud review
Jim Kaplan CIA CFE
 

Similar to When is a Duplicate not a Duplicate? Detecting Errors and Fraud (20)

PDF
Retrospective data analytics slides
Jim Kaplan CIA CFE
 
PDF
Visualize audit sampling and fraud detection in excel
Jim Kaplan CIA CFE
 
PDF
Future audit analytics
Jim Kaplan CIA CFE
 
PDF
The Future of Auditing and Fraud Detection
Jim Kaplan CIA CFE
 
PDF
Structuring your organization for success with data analytics
Jim Kaplan CIA CFE
 
PDF
Audit analytics and the agile auditor
Jim Kaplan CIA CFE
 
PDF
The Truth Behind Detecting Fraud Using Data Analytics
Jim Kaplan CIA CFE
 
PDF
How to breakthrough barriers and drive more value from your data analytics pr...
Jim Kaplan CIA CFE
 
PDF
Top 10 excel analytic tests to minimize fraud and process risks
Jim Kaplan CIA CFE
 
PDF
Robotic Process Auditing
Jim Kaplan CIA CFE
 
PDF
Fraud auditing creative techniques
Jim Kaplan CIA CFE
 
PDF
Are You a Smart CAAT or a Copy CAAT
Jim Kaplan CIA CFE
 
PDF
Building and Striving for Data Analytics Excellence
Jim Kaplan CIA CFE
 
PDF
Fieldwork Webinar
Jim Kaplan CIA CFE
 
PDF
How to data mine your print reports
Jim Kaplan CIA CFE
 
PDF
IDEA Basics, Getting Started, and Basics of Importing Data
Jim Kaplan CIA CFE
 
PDF
Uncovering Fraud in Key Financial Accounts using Data Analysis
FraudBusters
 
PDF
IT Fraud and Countermeasures
Jim Kaplan CIA CFE
 
PDF
How analytics should be used in controls testing instead of sampling
Jim Kaplan CIA CFE
 
PDF
20141203 akjk
Jim Kaplan CIA CFE
 
Retrospective data analytics slides
Jim Kaplan CIA CFE
 
Visualize audit sampling and fraud detection in excel
Jim Kaplan CIA CFE
 
Future audit analytics
Jim Kaplan CIA CFE
 
The Future of Auditing and Fraud Detection
Jim Kaplan CIA CFE
 
Structuring your organization for success with data analytics
Jim Kaplan CIA CFE
 
Audit analytics and the agile auditor
Jim Kaplan CIA CFE
 
The Truth Behind Detecting Fraud Using Data Analytics
Jim Kaplan CIA CFE
 
How to breakthrough barriers and drive more value from your data analytics pr...
Jim Kaplan CIA CFE
 
Top 10 excel analytic tests to minimize fraud and process risks
Jim Kaplan CIA CFE
 
Robotic Process Auditing
Jim Kaplan CIA CFE
 
Fraud auditing creative techniques
Jim Kaplan CIA CFE
 
Are You a Smart CAAT or a Copy CAAT
Jim Kaplan CIA CFE
 
Building and Striving for Data Analytics Excellence
Jim Kaplan CIA CFE
 
Fieldwork Webinar
Jim Kaplan CIA CFE
 
How to data mine your print reports
Jim Kaplan CIA CFE
 
IDEA Basics, Getting Started, and Basics of Importing Data
Jim Kaplan CIA CFE
 
Uncovering Fraud in Key Financial Accounts using Data Analysis
FraudBusters
 
IT Fraud and Countermeasures
Jim Kaplan CIA CFE
 
How analytics should be used in controls testing instead of sampling
Jim Kaplan CIA CFE
 
20141203 akjk
Jim Kaplan CIA CFE
 
Ad

Recently uploaded (20)

PDF
apidays Helsinki & North 2025 - Monetizing AI APIs: The New API Economy, Alla...
apidays
 
PDF
apidays Helsinki & North 2025 - API-Powered Journeys: Mobility in an API-Driv...
apidays
 
PPTX
Advanced_NLP_with_Transformers_PPT_final 50.pptx
Shiwani Gupta
 
PPTX
apidays Helsinki & North 2025 - APIs at Scale: Designing for Alignment, Trust...
apidays
 
PDF
JavaScript - Good or Bad? Tips for Google Tag Manager
📊 Markus Baersch
 
PDF
Building Production-Ready AI Agents with LangGraph.pdf
Tamanna
 
PDF
apidays Helsinki & North 2025 - REST in Peace? Hunting the Dominant Design fo...
apidays
 
PDF
WEF_Future_of_Global_Fintech_Second_Edition_2025.pdf
AproximacionAlFuturo
 
PPTX
apidays Singapore 2025 - From Data to Insights: Building AI-Powered Data APIs...
apidays
 
PPTX
apidays Helsinki & North 2025 - Agentic AI: A Friend or Foe?, Merja Kajava (A...
apidays
 
PDF
Data Chunking Strategies for RAG in 2025.pdf
Tamanna
 
PDF
Context Engineering for AI Agents, approaches, memories.pdf
Tamanna
 
PDF
OPPOTUS - Malaysias on Malaysia 1Q2025.pdf
Oppotus
 
PPTX
Climate Action.pptx action plan for climate
justfortalabat
 
PDF
Copia de Strategic Roadmap Infographics by Slidesgo.pptx (1).pdf
ssuserd4c6911
 
PPTX
GenAI-Introduction-to-Copilot-for-Bing-March-2025-FOR-HUB.pptx
cleydsonborges1
 
PPTX
Aict presentation on dpplppp sjdhfh.pptx
vabaso5932
 
PPT
deep dive data management sharepoint apps.ppt
novaprofk
 
PDF
apidays Helsinki & North 2025 - How (not) to run a Graphql Stewardship Group,...
apidays
 
PDF
OOPs with Java_unit2.pdf. sarthak bookkk
Sarthak964187
 
apidays Helsinki & North 2025 - Monetizing AI APIs: The New API Economy, Alla...
apidays
 
apidays Helsinki & North 2025 - API-Powered Journeys: Mobility in an API-Driv...
apidays
 
Advanced_NLP_with_Transformers_PPT_final 50.pptx
Shiwani Gupta
 
apidays Helsinki & North 2025 - APIs at Scale: Designing for Alignment, Trust...
apidays
 
JavaScript - Good or Bad? Tips for Google Tag Manager
📊 Markus Baersch
 
Building Production-Ready AI Agents with LangGraph.pdf
Tamanna
 
apidays Helsinki & North 2025 - REST in Peace? Hunting the Dominant Design fo...
apidays
 
WEF_Future_of_Global_Fintech_Second_Edition_2025.pdf
AproximacionAlFuturo
 
apidays Singapore 2025 - From Data to Insights: Building AI-Powered Data APIs...
apidays
 
apidays Helsinki & North 2025 - Agentic AI: A Friend or Foe?, Merja Kajava (A...
apidays
 
Data Chunking Strategies for RAG in 2025.pdf
Tamanna
 
Context Engineering for AI Agents, approaches, memories.pdf
Tamanna
 
OPPOTUS - Malaysias on Malaysia 1Q2025.pdf
Oppotus
 
Climate Action.pptx action plan for climate
justfortalabat
 
Copia de Strategic Roadmap Infographics by Slidesgo.pptx (1).pdf
ssuserd4c6911
 
GenAI-Introduction-to-Copilot-for-Bing-March-2025-FOR-HUB.pptx
cleydsonborges1
 
Aict presentation on dpplppp sjdhfh.pptx
vabaso5932
 
deep dive data management sharepoint apps.ppt
novaprofk
 
apidays Helsinki & North 2025 - How (not) to run a Graphql Stewardship Group,...
apidays
 
OOPs with Java_unit2.pdf. sarthak bookkk
Sarthak964187
 
Ad

When is a Duplicate not a Duplicate? Detecting Errors and Fraud

  • 1. 6/24/2020 1 Key Presenter: Michael Kano, ACDA Data Analytics Consultant, ArbutusAnalytics When is a Duplicate not a Duplicate? Detecting Errors and Fraud About Jim Kaplan, CIA, CFE  President and Founder of AuditNet®, the global resource for auditors (available on iOS, Android and Windows devices)  Auditor, Web Site Guru,  Internet for Auditors Pioneer  IIA Bradford Cadmus Memorial Award Recipient  Local Government Auditor’s Lifetime Award  Author of “The Auditor’s Guide to Internet Resources” 2nd Edition 6/24/2020 1 2
  • 2. 6/24/2020 2 About AuditNet® LLC • AuditNet®, the global resource for auditors, serves the global audit community as the primary resource for Web-based auditing content. As the first online audit portal, AuditNet® has been at the forefront of websites dedicated to promoting the use of audit technology. • Available on the Web, iPad, iPhone, Windows and Android devices and features: • Over 3,100 Reusable Templates, Audit Programs, Questionnaires, and Control Matrices • Webinars focusing on fraud, data analytics, IT audit, and internal audit with free CPE for subscribers and site license users. • Audit guides, manuals, and books on audit basics and using audit technology • LinkedIn Networking Groups • Monthly Newsletters with Expert Guest Columnists • Surveys on timely topics for internal auditors Introductions 6/24/2020 HOUSEKEEPING This webinar and its material are the property of AuditNet® and its Webinar partners. Unauthorized usage or recording of this webinar or any of its material is strictly forbidden.  If you logged in with another individual’s confirmation email you will not receive CPE as the confirmation login is linked to a specific individual  This Webinar is not eligible for viewing in a group setting. You must be logged in with your unique join link.  We are recording the webinar and you will be provided access to that recording after the webinar. Downloading or otherwise duplicating the webinar recording is expressly prohibited.  If you meet the criteria for earning CPE you will receive a confirmation link via email to for your CPE. You must first complete the course evaluation and then you will receive the link to download your CPE. The official email for CPE will be sent from [email protected] and it is important to white list this address. There may be a processing fee to have your CPE credit regenerated if you did not receive the first mailing.  CPE will only be sent to those who have opted in for the AuditNet mailing list.  Submit questions via the chat box on your screen and we will answer them either during or at the conclusion.  You must answer the survey questions after the Webinar or before downloading your certificate. 3 4
  • 3. 6/24/2020 3 IMPORTANT INFORMATION REGARDING CPE!  ATTENDEES - If you opted in to the AuditNet mailing list, attended the entire Webinar you will receive a confirmation email for the CPE certificate. You must complete the course evaluation in order to receive the download link for CPE. The official email for CPE will be issued via [email protected] and it is important to white list this address. There may be a processing fee to have your CPE credit regenerated after the initial distribution.  We cannot manually generate a CPE certificate as these are handled by our 3rd party provider. We highly recommend that you work with your IT department to identify and correct any email delivery issues prior to attending the Webinar. Issues would include blocks or spam filters in your email system or a firewall that will redirect or not allow delivery of this email from Gensend.io  You must opt in for our mailing list. If you indicate that you do not want to receive our e-mails, you can attend the Webinar but will not receive CPE.  We are not responsible for any connection, audio or other computer related issues. You must have pop-ups enabled on your computer otherwise you will not be able to answer the polling questions which occur approximately every 20 minutes. We suggest that if you have any pressing issues to see to that you do so immediately after a polling question. The views expressed by the presenters do not necessarily represent the views, positions, or opinions of AuditNet® LLC. These materials, and the oral presentation accompanying them, are for educational purposes only and do not constitute accounting or legal advice or create an accountant-client relationship. While AuditNet® makes every effort to ensure information is accurate and complete, AuditNet® makes no representations, guarantees, or warranties as to the accuracy or completeness of the information provided via this presentation. AuditNet® specifically disclaims all liability for any claims or damages that may result from the information contained in this presentation, including any websites maintained by third parties and linked to the AuditNet®website. Any mention of commercial products is for information only; it does not imply recommendation or endorsement by AuditNet® LLC 5 6
  • 4. 6/24/2020 4 Michael Kano Data Analytics Consultant, Arbutus Analytics Michael has 25 years of experience in data analytics and internal audit with organizations in the USA, Canada, and the Middle East. From 2015 to 2019, he was a senior member of the data analytics practice at Focal Point Data Risk, a US-based professional services firm. Prior to Focal Point, Michael led eBay, Inc.’s data analytics program in the Internal Audit department. He was tasked with integrating data analytics into the audit workflow on strategic and tactical levels.This included developing quality and documentation standards, training users, and providing analytics support on numerous audits in the IT, PayPal, and eBay marketplaces business areas. He also provided support to non-IA teams such as the Business EthicsOffice and Enterprise Risk Management teams. During his years at eBay, Michael supported audits throughout the organization in the IT, compliance, operations, vendor management, revenue assurance,T&E, and human resources areas. Michael's software experience includesArbutusAnalyzer,ACL Desktop/Direct Link,Alteryx, MicrosoftAccess, SQL, andTableau. He led ACL Services Ltd.’s global training team for 8 years. He is a graduate of the UCLAAnderson School of Management. Why fuzzy testing?  Detect fraud  Identify errors  Reduce false positives 8 7 8
  • 5. 6/24/2020 5 Detect Fraud  Multiple billings  Vendors with same address  Counterparties on watch lists (OFAC, GSA) 9 Identify Errors  Input  Optical character recognition (OCR)  Multiple databases not synchronized  Fatigue 1 9 10
  • 6. 6/24/2020 6 Reduce False Positives  Less time spent on dead ends  Increased efficiency 1 Automated Analytics  Weak or absent controls require continuous auditing/monitoring.  Scripted solutions allow for complex algorithms to be run against data to mitigate these risks.  Automation of best practices ensures consistency and adds to efficiency.  Powerful DA tools include functionality that makes sophisticated testing and detection possible. 1 11 12
  • 7. 6/24/2020 7 Analytic Functionality  Functions  Normalize()  SortNormalize()  Format()  Include()/Exclude()  Commands  DUPLICATES with Different, Near, & Similar parameters  JOIN 1  Near()  Similar()  Difference() Are these the same addresses? Addr1: 2847 Congress Pkwy West Addr2: Suite 201 Addr1: #201, 2847W Congress Parkway Addr2: Addr1: 125 Fifth Str. E Addr1: 125 East 5th Street Addr1: 707 Rooke Road Addr1: 707 Rook Rd Addr1: 3960 Monjah Circle Addr1: 3960 Monja Circle 13 14
  • 8. 6/24/2020 8 Normalizing Data  Normalize( Vendor_Address,'addr2.txt’ ) 16023, 40th Way South  16023 40TH WAY S #105, 1470 Boston Street  105 1470 BOSTON ST  SortNormalize( Vendor_Address,'addr2.txt’ ) 16023, 40th Way South  WAY S 40TH 16023 #105, 1470 Boston Street  ST BOSTON 1470 105 Checking for Matching or Close Addresses 205 E. 10th St 205 10th Street East Original 205 E 10TH ST 205 10TH ST E Normalized ST E 205 10TH ST E 205 10TH SortNormalized Matched! 15 16
  • 9. 6/24/2020 9 Elizabeth or Rick by any other name? BESS LIB DICK BESSIE LIBBY DICKIE BET LIDDY BRODERICK BETH LILIBET CEDRIC BETSY LISBETH DERRICK BETTE LISSIE ERIC BETTY LIZ RICH ELISE LIZA RICHARD ELSA LIZBETH RICHIE LIZZIE LIZZY RICKY Normalizing Data Normalize( First,'female name substitution table.sub,male name substitution table.sub’ ) JOHANN  JOHN JOHNNY  JOHN JON  JOHN JONATHAN  JOHN JENNIE  JEN JENNY  JEN JENNIFER  JEN JENN  JEN 17 18
  • 10. 6/24/2020 10 19 Quick Lesson: A Usable Fuzzy Algorithm  ‘Rob’ COMPARED TO ‘Robert’ = 3  ‘Gary’ COMPARED TO ‘Mary’ = 1  ‘Gary’ COMPARED TO ‘Gray’ = 1  ‘123 Main Street’ COMPARED TO ‘123 Main St’ = 4  In Arbutus used in NEAR , SIMILAR & DIFFERENCE functions/parameters When to use Near and when to use Similar  NEAR  Character fields: Straight string/character data comparison  Numeric fields: looks for numeric proximity  Date fields: looks for date/time proximity  SIMILAR  Character fields: Pre-modifies data for visually similar characteristics before doing string comparison  Numeric fields: converts to character data before processing  Date fields: converts to character data before processing 19 20
  • 11. 6/24/2020 11 Today’s Tests 1) Duplicate Payments (Identical) 2) Duplicate Payments with Near Dates 3) Duplicate Payments with Near Amounts and Dates 4) Duplicate Payments Similar Invoice Numbers 5) Duplicate Vendor Addresses 6) Duplicate/Similar Vendor-OFAC Addresses 7) Duplicate/Similar Vendor-OFAC Addresses (Word Match %) 8) Similar Vendor Phone Numbers 9) Similar Employee Names: HR vs PCard Test #1: Duplicate Payments (Identical)  Same Date  Same Vendor  Same Amount  Same Product Number  Same Invoice Number Run the DUPLICATES command, selecting these fields in the “Field(s) to test for Duplicates” Select “All Fields” from the “List fields” list Result: No identical payments 21 22
  • 12. 6/24/2020 12 Test #2: Duplicate Payments with Near Dates  Same Vendor  Same Amount  Same Product Number  Transaction dates within 5 days of each other (no exacts) Run the DUPLICATES command, selecting these fields in this order in the “Field(s) to test for Duplicates” Change the “Last duplicate field is” parameter to “Near” and change the value to 5. Select Transaction Number and Invoice Number from the “List fields” list Result: 3 pairs of possible duplicates Test #2: Duplicate Payments with Near Dates 23 24
  • 13. 6/24/2020 13 Test #3: Duplicate Payments with Near Amounts/Dates  Same Vendor  Same Product Number  Amount within $10 Run the DUPLICATES command, selecting the first three fields in this order in the “Field(s) to test for Duplicates” Change the “Last duplicate field is” parameter to “Near” and change the value to 10. Select Transaction Number and Transaction Date from the “List fields” list. In the result create a filter ABS(Transaction Date 1 – Transaction Date 2) < 14 Result: 7239 pairs of possible duplicates Test #4: Duplicate Payments Similar Invoice Numbers  Same Vendor  Same Product Number  Similar Invoice Number Run the DUPLICATES command, selecting the first three fields in this order in the “Field(s) to test for Duplicates” Change the “Last duplicate field is” parameter to “Similar” and change the value to 1. Select Transaction Number and Transaction Date from the “List fields” list. Result: 63 pairs of possible duplicates 25 26
  • 14. 6/24/2020 14 Test #4: Duplicate Payments Similar Invoice Numbers 2 Test #5: Duplicate Vendor Addresses Create a computed field to SortNormalize the Vendor Address: SORTNORMALIZE(Vendor_Address,”ADDR.TXT” Run the DUPLICATES command on this computed field. (You may want to include zip code.) Select other fields from the “List fields” list. Result: 39 possible duplicates 27 28
  • 15. 6/24/2020 15 Test #5: Duplicate Vendor Addresses Test #6: Duplicate/Similar Vendor-OFAC Addresses Create computed fields to SortNormalize the Vendor and OFAC Addresses: Run a Many-to-Many JOIN using the computed fields as the key fields. Add a filter to the JOIN* Difference(OFAC_Address_SORTNORM,Vendor_Master_Extract.Vendor_Address_SORTN ORM) <= 1 Result: 15 pairs of possible duplicates *Best Practice: If your JOIN includes computed fields, EXTRACT FIELDS for each file of only the minimum necessary fields. This will cause the computed fields to be written out as physical fields. Then execute the JOIN between the two new tables, including the zip codes in the filter for more precision. Physical fields process much faster than computed fields. 29 30
  • 16. 6/24/2020 16 Test #6: Duplicate/Similar Vendor-OFAC Addresses Test #7: Duplicate/Similar Vendor-OFAC Addresses (Word Match %) The script will calculate the number of common words between all possible normalized address pairs. It will then calculate the percent match for each address, which is the number of common words divided by the total number of words. The final match score is the average of the two scores. The final output includes exact matches and all other matches where the match score is greater than or equal to 75%. 31 32
  • 17. 6/24/2020 17 Test #7: Duplicate/Similar Vendor-OFAC Addresses (Word Match %) Test #8: Similar Vendor Phone Numbers Convert phone numbers to numeric values in a computed field by: 1) Removing all non-numeric characters 2) Converting to a numeric value 3) Executing DUPLICATES on the new field with a NEAR parameter equal to 1. Computed Field: VALUE(INCLUDE(Vendor Phone,”0~9”),0) Result: 29 matched pairs 33 34
  • 18. 6/24/2020 18 Test #8: Similar Vendor Phone Numbers Test #9: Similar Employee Names: HR vs PCard Two databases might not be in sync with regard to employee names: misspellings, marriage status change, “Von Bulow” vs “Vonbulow”. PCard list doesn’t always include employee number, just first and last name in addition to last 4 digits of card. Last 4 digits of card not unique. Testing Pcard list to match against HR data can result in unmatched cards when joining on combination of last 4 digits + first name + last name. Card List Transactions 35 36
  • 19. 6/24/2020 19 Test #9: Similar Employee Names: HR vs PCard 1) Identify which database stores last name as “Von Schmidt” 2) Create computed field that combines the two parts of the last name and convert to uppercase: UPPER(EXCLUDE(Last Name," ")) 3) Rejoin the databases and isolate new unmatched. 4) Using new unmatched, execute a fuzzy join where the last 4 are the same and the names are within 1 character of each other. 5) Remnants (final unmatched) are likely due to name changes. . Summary  Identify data where manual input has occurred or where counterparty has provided input.  Test for consistency.  Identify tests needed to reduce risk.  Examine the functionality for ways to make necessary changes.  Call Tech Support 37 38
  • 20. 6/24/2020 20 How to Normalize Addresses and Detect Hidden Duplicates Any Questions? Live Webinar – Q&A When is a Duplicate not a Duplicate? Data Quality Management- Practical tests Michael Kano (ACDA) Data Analyst Consultant, Arbutus [email protected] I Linkedin: Michael Kano www.arbutusanalytics.com I Phone: (408) 887-4843 Click to read our latest article about Arbutus Analyzer - Technical Insights. Author: Michael Kano THANK YOU 39 40