SlideShare a Scribd company logo
Middleware
1 Session 185
ORACLE SOA SUITE 11G TROUBLESHOOTING METHODOLOGY
Harold A. Dost III, Raastech, Inc.
ABSTRACT
Most troubleshooting guides simply list out solutions to
common errors. This paper introduces a troubleshooting
methodology surrounding performance, composite
instances, deployment, and logging. The goal is to better
equip the reader with the ability to solve most problems as
they pertain to the SOA infrastructure and its executed
transactions. As well as, learn where to look, what to look
for, and what do afterward.
TARGET AUDIENCE
This is intended for every Oracle SOA Suite 11g developer
and administrator should read.
EXECUTIVE SUMMARY
There is no guarantee that every error is an easy fix away.
However, this paper will provide the reader with a better
understanding of where to look for errors, how to
categorize them, and deal with them in an appropriate
manner. With the tools explained later on, quicker
resolutions may be achieved which will produce more
efficient development and better support.
BACKGROUND
According to Splunk one of their clients, Macy’s, noted
that tracking down the exact cause of a problem could be
“exceedingly difficult.” It often required a team comprised
of members from various IT functional areas to fix these
problems. Even with these teams, resolutions still took
days.
In the past when an issue presented itself it was always the
network admins to be blamed. Over time technology has
improved in that area, therefore network issues at most
companies are few and far between by comparison.
Everything is connected higher in the stack through
various integration servers and technologies. The blame is
often shifted to the integration team, but much like people
blaming a browser for a bad Internet connection, this can
often be misdirected.
Customers hold a company responsible to maintain near-
continuous reliable services and by transitivity the
integration team. This puts a lot of pressure onto the
integration team to quickly determine if the error is
something within their realm or if it needs a different
group’s attention.
One of the biggest pains with tracking down issues in
middleware is that it is composed of so many layers. For
example, a web application might make a call. The payload
first goes through Oracle Enterprise Gateway (OEG), this
is because it is going from the Internet to the company
intranet. Then the company uses Oracle Service Bus
(OSB) for all internal service calls to abstract naming and
versions of services. Finally, the payload makes it to
Oracle SOA Suite and it goes on from there to call other
systems. Since the focus is on Oracle SOA Suite, below
are a few issues.
A custom ANT script is used to iterate through a list of
composites and deploys them one at a time. After the 66th
composite an OutofMemeory:PermGen error is thrown;
an odd but repeatable error. A much more common error
is: “Unable to access endpoint…” This error can have
many explanations from a simple timeout, to a security
issue such as an invalid certificate. Without knowing how
to diagnose the source of these symptoms will slow down
even the most senior developers and administrators.
TECHNICAL DISCUSSIONS AND EXAMPLES
Before learning how to solve these problems, it is first a
good idea to step back and acknowledge that
troubleshooting problems is an art. Like any other art it is
part skill and part knowledge. For skill there each person
has a certain level of natural inclination towards solving
problems, some being better than others. Much of it deals
with having a very methodical and scientific approach. The
other half, knowledge, refers to a person’s intimacy with
the product. Unless someone has the ability to deduce the
topology of a system without ever using it is at hand, there
needs to be some time spent working with and
understanding the various subsystems of a product. To
understand how SOA Suite works and how to fix errors
there are many resources.
Many people, not having an answer to an issue, will
immediately jump onto the Internet and perform a series
of queries on their favorite search engine. This can lead to
various blogs and even some Oracle specific resources,
Middleware
2 Session 185
such as the OTN discussion forums. This is often wasted
time leading to solutions that aren’t related to the problem
at hand. Finding no resolution, many will hop onto the
Oracle support site to search for the existence of a patch.
While none of these options are bad, if unable to properly
direct searches this can be very time consuming,
frustrating, and wasteful. The Internet should not be the
only resource used. In fact, one’s brainstorming and
knowledge should also be a resource on how to determine
the issue at hand. Ideally once the source of the issue is
tracked a resolution is obvious or at least achievable. If
that is not the case then it’s time to resort to the
aforementioned resources. The company may also have an
error tracking and knowledge base of its own. Also, always
remember talking to coworkers is useful, since often issues
have been previously solved and forgotten.
The first step in tracking down the error should be to
classify the problem. For purposes of this paper, lets start
by placing the issues into one of three major categories:
deployment, runtime, and performance. Distinguishing
between these categories may not be at first obvious, but
after encountering a few different types of problems this
will provide a better idea. Runtime errors are going to be
an issue in the logic of integration; this can be actual code
or configuration in the server.
In certain cases the problem would be specific to a
particular composite. Signs that only a composite is
affected are usually obvious since the only errors showing
related to that integration. However, there may also be
issues that affect the entire infrastructure. For now, the
focus is on singular composites and deployment.
The quickest, and usually easiest, issues to troubleshoot are
deployment related. Deployment of a composite is broken
into different phases: cleanup, validation, compilation, and
the deployment. The cleanup phase should never fail as it
searches for existing packaged integrations and deletes
them if they exist. Validation examines the code, and many
errors related to bad references and XML. The
compilation phase will provide further errors should they
arise, but if successful this also packages the source into a
JAR file to prepare for deployment. Finally, deployment
occurs. The deployment process will reveal a number of
issues, however they may not all be displayed from the
deployer’s point of view. Normally that is not a problem,
as most of the issues will be revealed at runtime. These
issues are usually with the server configuration: data
sources, queues, topics, etc. When dealing with a process
that polls a database or file folder the processes will simply
not start. The best way to identify the root cause here is to
tail the out logs while performing a deployment.
Commonly, the issue is a bad JNDI name or a directory
that doesn’t exist. Most of these require coordination with
an application administrator depending on the level of
permissions that the developer has in the particular
environment. Issues that can be determined by the
developers themselves will be discussed with runtime
errors.
During runtime any number of errors can occur, but not
all of them will be caused by individual composites. Some
of them can be overarching issues that affect multiple
integrations. Similar to deployment issues, runtime issues
may be caused by problems in the code or in server
configurations. Most code related issues will appear in the
flow trace and will be obvious to solve. Most issues, even
non-code related, will manifest as an error in the console
but the root cause will be hidden in the logs.
In the case of Figure 1, the error is a missing organization.
This is a business fault and should be handled by the
integration code or passed back to the calling application.
Other issues can include errors like: “Cannot insert NULL
into…” These issues may or may not need to be handled
by the integration. Unfortunately, not all of the errors will
appear in the logs all the time, or the error that does show
is not descriptive enough to determine a resolution
immediately. One such error is the “Unable to access the
following endpoints…” error. Logging levels can be
increased to various levels to obtain further information.
However, there are many different loggers available, so
always knowing which logger to modify can be difficult.
The best way to decide which logger should be modified is
by looking in the header of a log message. Next, finding
the right level of logging can be difficult, because trace
logging at times can be overly verbose leading to more
time sifting through the noise. One of the best ways to
find the right logging level is to increment by a couple
levels at a time until the true problem is revealed.
There are many signs that there is a problem with the
performance of a system. Some of those signs being:
 The Oracle Enterprise Manager Fusion
Middleware Control is abnormally slow.
 The completion time of composites is increased
consistently across the board.
 The size of the dehydration store is growing
rapidly.
 A large number of errors are appearing in the logs.
<Aug 6, 2011 10:10:33 AM EDT> <Error> <oracle.soa.mediator.serviceEngine> <BEA-
000000>
<Got an exception:
oracle.fabric.common.FabricInvocationException:
javax.xml.ws.soap.SOAPFaultException:
Message: Organization 129024 not found. Stack trace: at
Core.WebServices.Message.MessageWebService.SaveNotificati
on(Organization organization, Notification notification)
in c:Data1.0CoreMessageMessageWebService.svc.cs:line
100, detail=javax.xml.ws.soap.SOAPFaultException:
Figure 1: Business Fault
Middleware
3 Session 185
Knowing the server is experiencing any of these issues
listed above means there is likely a performance issue.
There are a number of places to look to track down the
root cause. First, check if there is enough available space
on the hard drives. A lack a space can result in drastic
performance reductions. Secondly, be sure to check the
processor, memory, and I/O statistics with a tool like
vmstat to help narrow down which process is exactly
hogging resources on the [virtual] machine. Other factors
in performance can be the number of files open and the
number of processes running. A runaway integration has
the possibility to consume all file descriptors thereby
degrading performance across the rest of the system. If
issues arise like this, it is often a good idea in development
to clear the logs and restart Weblogic while watching the
logs for any errors that may be a precursor to the “too
many files open” error. If nothing is found specific to
SOA Suite, check other applications running, and be sure
to check the OS logs (/var/log/messages). While errors
can be a common reason for a slow environment, there
could be other issues playing a role.
A tuned JVM is the only one that will give the kind of
performance demanded by production level environment;
this is especially true when there are high volumes of
transactions passing through the environment. If the
application server is not already running in the JRockit
JVM, it is highly recommended. Speed increases can be
realized with little configuration. However, once JRockit is
running there are a number of tools such as the JRockit
Flight Recorder (JFR) that come with the JVM to further
tune your instance as necessary. As of writing this paper,
the Hotspot and JRockit JVMs will ship as one product
with the release of JDK 8. This means the benefits of
JRockit will be realized within the JVM. Tuning a JVM is
not the only useful part interacting directly with your
configuration settings. Additional information can be
provided by your JVM as well. Performing a heap dump
when a memory error occurs is one of those ways. The
JVM is not the only part that should be monitored.
Data sources are another critical component that should
be monitored in the case of performance issues. It is
possible that the available connection pool has been
saturated with connections and is causing a bottleneck. If
there is consistently an issue with a particular connection
pool, involve a DBA to help understand why the pool may
be getting full. There may be some SQL tuning that can be
done so that queries and procedures run more efficiently
shortening the length of connection times.
In the end, even this paper can only gloss over the very
complex art that is troubleshooting. There are many
variables that can come into determining the cause such as
security considerations, operating system, hardware, etc.
Most issues that arise can be narrowed into runtime or
infrastructure errors, performance issues, and deployment
issues. Targeting the category can allow focus on where
the true cause of the issue lay. For deployment issues, it is
good to have an understanding of the overall deployment
process. Also, knowing the purpose of the adf-config.xml
can provide insight as to how the MDS is referenced and
other important deployment related information.
When dealing with errors determining whether there is a
code specific issue or a system wide issue can prevent
many long hours looking in the wrong place. Modifying
logging levels can assist in this and allow for drilling into
the true cause of the issue.
APPENDICES
JVM Performance Tuning Documentation
http://docs.oracle.com/cd/E23943_01/web.1111/e13814
.pdf
http://docs.oracle.com/cd/E15289_01/doc.40/e15060.p
df
Location of out.err (Used for deployment errors)
Unix/Linux:
/tmp/out.err
Microsoft Windows:
C:Users<user>AppDataLocalTempout.err
Oracle ADF-config.xml Description
http://docs.oracle.com/cd/E15586_01/web.1111/b3197
4/appendixa.htm#BGBIFEJE
REFERENCES
Splunk. Ensure the availability and performance of your critical
applications using the genius of splunk. Retrieved from
http://www.splunk.com/web_assets/pdfs/secure/Troubl
eshooting_Critical_Applications.pdf

More Related Content

What's hot (10)

PDF
jBPM5 Community Training Module #5: Domain Specific Processes
Mauricio (Salaboy) Salatino
 
PDF
jBPM5 Community Training Module 4: jBPM5 APIs Overview + Hands On
Mauricio (Salaboy) Salatino
 
PDF
Attribute based access control
Elimity
 
PDF
Process automation report
Marc Gourvenec
 
PPT
Ca Service Desk Demo Scenarios
Emirates Computers
 
KEY
Event Driven Architecture
Chris Patterson
 
PPS
ITIL Service Desk Tools
ahmedshama
 
PDF
jBPM Community Training #2: The BPM Practice
Mauricio (Salaboy) Salatino
 
PDF
Automatic Proactive Troubleshooting with IBM Rational Build Forge
Bill Duncan
 
PPT
Week8 Topic1 Translate Business Needs Into Technical Requirements
hapy
 
jBPM5 Community Training Module #5: Domain Specific Processes
Mauricio (Salaboy) Salatino
 
jBPM5 Community Training Module 4: jBPM5 APIs Overview + Hands On
Mauricio (Salaboy) Salatino
 
Attribute based access control
Elimity
 
Process automation report
Marc Gourvenec
 
Ca Service Desk Demo Scenarios
Emirates Computers
 
Event Driven Architecture
Chris Patterson
 
ITIL Service Desk Tools
ahmedshama
 
jBPM Community Training #2: The BPM Practice
Mauricio (Salaboy) Salatino
 
Automatic Proactive Troubleshooting with IBM Rational Build Forge
Bill Duncan
 
Week8 Topic1 Translate Business Needs Into Technical Requirements
hapy
 

Similar to Oracle SOA Suite 11g Troubleshooting Methodology (whitepaper) (20)

PDF
Top 5 performance problems in .net applications application performance mon...
KennaaTol
 
PPTX
DevOps - Continuous Integration, Continuous Delivery - let's talk
D Z
 
PPTX
Metric Abuse: Frequently Misused Metrics in Oracle
Steve Karam
 
PPT
Top 30 Scalability Mistakes
John Coggeshall
 
PPTX
The "Evils" of Optimization
BlackRabbitCoder
 
PDF
Effective Bug Tracking Systems: Theories and Implementation
IOSR Journals
 
PDF
Software Development Standard Operating Procedure
rupeshchanchal
 
PDF
WoMakersCode 2016 - Shit Happens
Jackson F. de A. Mafra
 
PPT
Defect MgmtBugDay Bangkok 2009: Defect Management
guest476528
 
PPTX
01 fundamentals of testing
Ilham Wahyudi
 
PDF
dist_systems.pdf
CherenetToma
 
PPT
Why test with flex unit
michael.labriola
 
PDF
Easy & Step-By-Step Ways of Finding Bugs in Software.pdf
Steve Wortham
 
PDF
Five Common Angular Mistakes
Backand Cohen
 
PPT
Teamwork Presentation
Pietro Polsinelli
 
PDF
Multithreading 101
Tim Penhey
 
PPT
Top 10 Scalability Mistakes
John Coggeshall
 
PDF
An ideal static analyzer, or why ideals are unachievable
PVS-Studio
 
PDF
0136 ideal static_analyzer
PVS-Studio
 
PDF
The Testing Planet Issue 4
Rosie Sherry
 
Top 5 performance problems in .net applications application performance mon...
KennaaTol
 
DevOps - Continuous Integration, Continuous Delivery - let's talk
D Z
 
Metric Abuse: Frequently Misused Metrics in Oracle
Steve Karam
 
Top 30 Scalability Mistakes
John Coggeshall
 
The "Evils" of Optimization
BlackRabbitCoder
 
Effective Bug Tracking Systems: Theories and Implementation
IOSR Journals
 
Software Development Standard Operating Procedure
rupeshchanchal
 
WoMakersCode 2016 - Shit Happens
Jackson F. de A. Mafra
 
Defect MgmtBugDay Bangkok 2009: Defect Management
guest476528
 
01 fundamentals of testing
Ilham Wahyudi
 
dist_systems.pdf
CherenetToma
 
Why test with flex unit
michael.labriola
 
Easy & Step-By-Step Ways of Finding Bugs in Software.pdf
Steve Wortham
 
Five Common Angular Mistakes
Backand Cohen
 
Teamwork Presentation
Pietro Polsinelli
 
Multithreading 101
Tim Penhey
 
Top 10 Scalability Mistakes
John Coggeshall
 
An ideal static analyzer, or why ideals are unachievable
PVS-Studio
 
0136 ideal static_analyzer
PVS-Studio
 
The Testing Planet Issue 4
Rosie Sherry
 
Ad

More from Revelation Technologies (20)

PDF
Operating System Security in the Cloud
Revelation Technologies
 
PDF
Getting Started with Terraform
Revelation Technologies
 
PDF
Getting Started with API Management
Revelation Technologies
 
PDF
Automating Cloud Operations: Everything You Wanted to Know about cURL and REST
Revelation Technologies
 
PDF
Getting Started with API Management – Why It's Needed On-prem and in the Cloud
Revelation Technologies
 
PDF
Automating Cloud Operations - Everything you wanted to know about cURL and RE...
Revelation Technologies
 
PDF
Introducing the Oracle Cloud Infrastructure (OCI) Best Practices Framework
Revelation Technologies
 
PDF
Everything You Need to Know About the Microsoft Azure and Oracle Cloud Interc...
Revelation Technologies
 
PDF
PTK Issue 72: Delivering a Platform on Demand
Revelation Technologies
 
PDF
PTK Issue 71: The Compute Cloud Performance Showdown
Revelation Technologies
 
PDF
Everything You Need to Know About the Microsoft Azure and Oracle Cloud Interc...
Revelation Technologies
 
PDF
Compute Cloud Performance Showdown: 18 Months Later (OCI, AWS, IBM Cloud, GCP...
Revelation Technologies
 
PDF
Compute Cloud Performance Showdown: 18 Months Later (OCI, AWS, IBM Cloud, GCP...
Revelation Technologies
 
PDF
The Microsoft Azure and Oracle Cloud Interconnect Everything You Need to Know
Revelation Technologies
 
PDF
Cloud Integration Strategy
Revelation Technologies
 
PDF
Compute Cloud Performance Showdown: Amazon Web Services, Oracle Cloud, IBM ...
Revelation Technologies
 
PDF
Securing your Oracle Fusion Middleware Environment, On-Prem and in the Cloud
Revelation Technologies
 
PDF
Hands-On with Oracle SOA Cloud Service
Revelation Technologies
 
PDF
Oracle BPM Suite Development: Getting Started
Revelation Technologies
 
PDF
Developing Web Services from Scratch - For DBAs and Database Developers
Revelation Technologies
 
Operating System Security in the Cloud
Revelation Technologies
 
Getting Started with Terraform
Revelation Technologies
 
Getting Started with API Management
Revelation Technologies
 
Automating Cloud Operations: Everything You Wanted to Know about cURL and REST
Revelation Technologies
 
Getting Started with API Management – Why It's Needed On-prem and in the Cloud
Revelation Technologies
 
Automating Cloud Operations - Everything you wanted to know about cURL and RE...
Revelation Technologies
 
Introducing the Oracle Cloud Infrastructure (OCI) Best Practices Framework
Revelation Technologies
 
Everything You Need to Know About the Microsoft Azure and Oracle Cloud Interc...
Revelation Technologies
 
PTK Issue 72: Delivering a Platform on Demand
Revelation Technologies
 
PTK Issue 71: The Compute Cloud Performance Showdown
Revelation Technologies
 
Everything You Need to Know About the Microsoft Azure and Oracle Cloud Interc...
Revelation Technologies
 
Compute Cloud Performance Showdown: 18 Months Later (OCI, AWS, IBM Cloud, GCP...
Revelation Technologies
 
Compute Cloud Performance Showdown: 18 Months Later (OCI, AWS, IBM Cloud, GCP...
Revelation Technologies
 
The Microsoft Azure and Oracle Cloud Interconnect Everything You Need to Know
Revelation Technologies
 
Cloud Integration Strategy
Revelation Technologies
 
Compute Cloud Performance Showdown: Amazon Web Services, Oracle Cloud, IBM ...
Revelation Technologies
 
Securing your Oracle Fusion Middleware Environment, On-Prem and in the Cloud
Revelation Technologies
 
Hands-On with Oracle SOA Cloud Service
Revelation Technologies
 
Oracle BPM Suite Development: Getting Started
Revelation Technologies
 
Developing Web Services from Scratch - For DBAs and Database Developers
Revelation Technologies
 
Ad

Recently uploaded (20)

PDF
What’s my job again? Slides from Mark Simos talk at 2025 Tampa BSides
Mark Simos
 
PDF
Mastering Financial Management in Direct Selling
Epixel MLM Software
 
PDF
Staying Human in a Machine- Accelerated World
Catalin Jora
 
PPTX
AI Penetration Testing Essentials: A Cybersecurity Guide for 2025
defencerabbit Team
 
PDF
How do you fast track Agentic automation use cases discovery?
DianaGray10
 
PDF
“Voice Interfaces on a Budget: Building Real-time Speech Recognition on Low-c...
Edge AI and Vision Alliance
 
PDF
Transcript: Book industry state of the nation 2025 - Tech Forum 2025
BookNet Canada
 
PDF
Book industry state of the nation 2025 - Tech Forum 2025
BookNet Canada
 
PDF
The 2025 InfraRed Report - Redpoint Ventures
Razin Mustafiz
 
PDF
Kit-Works Team Study_20250627_한달만에만든사내서비스키링(양다윗).pdf
Wonjun Hwang
 
PDF
Go Concurrency Real-World Patterns, Pitfalls, and Playground Battles.pdf
Emily Achieng
 
PDF
Newgen Beyond Frankenstein_Build vs Buy_Digital_version.pdf
darshakparmar
 
PPTX
Future Tech Innovations 2025 – A TechLists Insight
TechLists
 
PPTX
COMPARISON OF RASTER ANALYSIS TOOLS OF QGIS AND ARCGIS
Sharanya Sarkar
 
PDF
Newgen 2022-Forrester Newgen TEI_13 05 2022-The-Total-Economic-Impact-Newgen-...
darshakparmar
 
PDF
POV_ Why Enterprises Need to Find Value in ZERO.pdf
darshakparmar
 
PDF
“Computer Vision at Sea: Automated Fish Tracking for Sustainable Fishing,” a ...
Edge AI and Vision Alliance
 
PDF
SIZING YOUR AIR CONDITIONER---A PRACTICAL GUIDE.pdf
Muhammad Rizwan Akram
 
PDF
UPDF - AI PDF Editor & Converter Key Features
DealFuel
 
DOCX
Cryptography Quiz: test your knowledge of this important security concept.
Rajni Bhardwaj Grover
 
What’s my job again? Slides from Mark Simos talk at 2025 Tampa BSides
Mark Simos
 
Mastering Financial Management in Direct Selling
Epixel MLM Software
 
Staying Human in a Machine- Accelerated World
Catalin Jora
 
AI Penetration Testing Essentials: A Cybersecurity Guide for 2025
defencerabbit Team
 
How do you fast track Agentic automation use cases discovery?
DianaGray10
 
“Voice Interfaces on a Budget: Building Real-time Speech Recognition on Low-c...
Edge AI and Vision Alliance
 
Transcript: Book industry state of the nation 2025 - Tech Forum 2025
BookNet Canada
 
Book industry state of the nation 2025 - Tech Forum 2025
BookNet Canada
 
The 2025 InfraRed Report - Redpoint Ventures
Razin Mustafiz
 
Kit-Works Team Study_20250627_한달만에만든사내서비스키링(양다윗).pdf
Wonjun Hwang
 
Go Concurrency Real-World Patterns, Pitfalls, and Playground Battles.pdf
Emily Achieng
 
Newgen Beyond Frankenstein_Build vs Buy_Digital_version.pdf
darshakparmar
 
Future Tech Innovations 2025 – A TechLists Insight
TechLists
 
COMPARISON OF RASTER ANALYSIS TOOLS OF QGIS AND ARCGIS
Sharanya Sarkar
 
Newgen 2022-Forrester Newgen TEI_13 05 2022-The-Total-Economic-Impact-Newgen-...
darshakparmar
 
POV_ Why Enterprises Need to Find Value in ZERO.pdf
darshakparmar
 
“Computer Vision at Sea: Automated Fish Tracking for Sustainable Fishing,” a ...
Edge AI and Vision Alliance
 
SIZING YOUR AIR CONDITIONER---A PRACTICAL GUIDE.pdf
Muhammad Rizwan Akram
 
UPDF - AI PDF Editor & Converter Key Features
DealFuel
 
Cryptography Quiz: test your knowledge of this important security concept.
Rajni Bhardwaj Grover
 

Oracle SOA Suite 11g Troubleshooting Methodology (whitepaper)

  • 1. Middleware 1 Session 185 ORACLE SOA SUITE 11G TROUBLESHOOTING METHODOLOGY Harold A. Dost III, Raastech, Inc. ABSTRACT Most troubleshooting guides simply list out solutions to common errors. This paper introduces a troubleshooting methodology surrounding performance, composite instances, deployment, and logging. The goal is to better equip the reader with the ability to solve most problems as they pertain to the SOA infrastructure and its executed transactions. As well as, learn where to look, what to look for, and what do afterward. TARGET AUDIENCE This is intended for every Oracle SOA Suite 11g developer and administrator should read. EXECUTIVE SUMMARY There is no guarantee that every error is an easy fix away. However, this paper will provide the reader with a better understanding of where to look for errors, how to categorize them, and deal with them in an appropriate manner. With the tools explained later on, quicker resolutions may be achieved which will produce more efficient development and better support. BACKGROUND According to Splunk one of their clients, Macy’s, noted that tracking down the exact cause of a problem could be “exceedingly difficult.” It often required a team comprised of members from various IT functional areas to fix these problems. Even with these teams, resolutions still took days. In the past when an issue presented itself it was always the network admins to be blamed. Over time technology has improved in that area, therefore network issues at most companies are few and far between by comparison. Everything is connected higher in the stack through various integration servers and technologies. The blame is often shifted to the integration team, but much like people blaming a browser for a bad Internet connection, this can often be misdirected. Customers hold a company responsible to maintain near- continuous reliable services and by transitivity the integration team. This puts a lot of pressure onto the integration team to quickly determine if the error is something within their realm or if it needs a different group’s attention. One of the biggest pains with tracking down issues in middleware is that it is composed of so many layers. For example, a web application might make a call. The payload first goes through Oracle Enterprise Gateway (OEG), this is because it is going from the Internet to the company intranet. Then the company uses Oracle Service Bus (OSB) for all internal service calls to abstract naming and versions of services. Finally, the payload makes it to Oracle SOA Suite and it goes on from there to call other systems. Since the focus is on Oracle SOA Suite, below are a few issues. A custom ANT script is used to iterate through a list of composites and deploys them one at a time. After the 66th composite an OutofMemeory:PermGen error is thrown; an odd but repeatable error. A much more common error is: “Unable to access endpoint…” This error can have many explanations from a simple timeout, to a security issue such as an invalid certificate. Without knowing how to diagnose the source of these symptoms will slow down even the most senior developers and administrators. TECHNICAL DISCUSSIONS AND EXAMPLES Before learning how to solve these problems, it is first a good idea to step back and acknowledge that troubleshooting problems is an art. Like any other art it is part skill and part knowledge. For skill there each person has a certain level of natural inclination towards solving problems, some being better than others. Much of it deals with having a very methodical and scientific approach. The other half, knowledge, refers to a person’s intimacy with the product. Unless someone has the ability to deduce the topology of a system without ever using it is at hand, there needs to be some time spent working with and understanding the various subsystems of a product. To understand how SOA Suite works and how to fix errors there are many resources. Many people, not having an answer to an issue, will immediately jump onto the Internet and perform a series of queries on their favorite search engine. This can lead to various blogs and even some Oracle specific resources,
  • 2. Middleware 2 Session 185 such as the OTN discussion forums. This is often wasted time leading to solutions that aren’t related to the problem at hand. Finding no resolution, many will hop onto the Oracle support site to search for the existence of a patch. While none of these options are bad, if unable to properly direct searches this can be very time consuming, frustrating, and wasteful. The Internet should not be the only resource used. In fact, one’s brainstorming and knowledge should also be a resource on how to determine the issue at hand. Ideally once the source of the issue is tracked a resolution is obvious or at least achievable. If that is not the case then it’s time to resort to the aforementioned resources. The company may also have an error tracking and knowledge base of its own. Also, always remember talking to coworkers is useful, since often issues have been previously solved and forgotten. The first step in tracking down the error should be to classify the problem. For purposes of this paper, lets start by placing the issues into one of three major categories: deployment, runtime, and performance. Distinguishing between these categories may not be at first obvious, but after encountering a few different types of problems this will provide a better idea. Runtime errors are going to be an issue in the logic of integration; this can be actual code or configuration in the server. In certain cases the problem would be specific to a particular composite. Signs that only a composite is affected are usually obvious since the only errors showing related to that integration. However, there may also be issues that affect the entire infrastructure. For now, the focus is on singular composites and deployment. The quickest, and usually easiest, issues to troubleshoot are deployment related. Deployment of a composite is broken into different phases: cleanup, validation, compilation, and the deployment. The cleanup phase should never fail as it searches for existing packaged integrations and deletes them if they exist. Validation examines the code, and many errors related to bad references and XML. The compilation phase will provide further errors should they arise, but if successful this also packages the source into a JAR file to prepare for deployment. Finally, deployment occurs. The deployment process will reveal a number of issues, however they may not all be displayed from the deployer’s point of view. Normally that is not a problem, as most of the issues will be revealed at runtime. These issues are usually with the server configuration: data sources, queues, topics, etc. When dealing with a process that polls a database or file folder the processes will simply not start. The best way to identify the root cause here is to tail the out logs while performing a deployment. Commonly, the issue is a bad JNDI name or a directory that doesn’t exist. Most of these require coordination with an application administrator depending on the level of permissions that the developer has in the particular environment. Issues that can be determined by the developers themselves will be discussed with runtime errors. During runtime any number of errors can occur, but not all of them will be caused by individual composites. Some of them can be overarching issues that affect multiple integrations. Similar to deployment issues, runtime issues may be caused by problems in the code or in server configurations. Most code related issues will appear in the flow trace and will be obvious to solve. Most issues, even non-code related, will manifest as an error in the console but the root cause will be hidden in the logs. In the case of Figure 1, the error is a missing organization. This is a business fault and should be handled by the integration code or passed back to the calling application. Other issues can include errors like: “Cannot insert NULL into…” These issues may or may not need to be handled by the integration. Unfortunately, not all of the errors will appear in the logs all the time, or the error that does show is not descriptive enough to determine a resolution immediately. One such error is the “Unable to access the following endpoints…” error. Logging levels can be increased to various levels to obtain further information. However, there are many different loggers available, so always knowing which logger to modify can be difficult. The best way to decide which logger should be modified is by looking in the header of a log message. Next, finding the right level of logging can be difficult, because trace logging at times can be overly verbose leading to more time sifting through the noise. One of the best ways to find the right logging level is to increment by a couple levels at a time until the true problem is revealed. There are many signs that there is a problem with the performance of a system. Some of those signs being:  The Oracle Enterprise Manager Fusion Middleware Control is abnormally slow.  The completion time of composites is increased consistently across the board.  The size of the dehydration store is growing rapidly.  A large number of errors are appearing in the logs. <Aug 6, 2011 10:10:33 AM EDT> <Error> <oracle.soa.mediator.serviceEngine> <BEA- 000000> <Got an exception: oracle.fabric.common.FabricInvocationException: javax.xml.ws.soap.SOAPFaultException: Message: Organization 129024 not found. Stack trace: at Core.WebServices.Message.MessageWebService.SaveNotificati on(Organization organization, Notification notification) in c:Data1.0CoreMessageMessageWebService.svc.cs:line 100, detail=javax.xml.ws.soap.SOAPFaultException: Figure 1: Business Fault
  • 3. Middleware 3 Session 185 Knowing the server is experiencing any of these issues listed above means there is likely a performance issue. There are a number of places to look to track down the root cause. First, check if there is enough available space on the hard drives. A lack a space can result in drastic performance reductions. Secondly, be sure to check the processor, memory, and I/O statistics with a tool like vmstat to help narrow down which process is exactly hogging resources on the [virtual] machine. Other factors in performance can be the number of files open and the number of processes running. A runaway integration has the possibility to consume all file descriptors thereby degrading performance across the rest of the system. If issues arise like this, it is often a good idea in development to clear the logs and restart Weblogic while watching the logs for any errors that may be a precursor to the “too many files open” error. If nothing is found specific to SOA Suite, check other applications running, and be sure to check the OS logs (/var/log/messages). While errors can be a common reason for a slow environment, there could be other issues playing a role. A tuned JVM is the only one that will give the kind of performance demanded by production level environment; this is especially true when there are high volumes of transactions passing through the environment. If the application server is not already running in the JRockit JVM, it is highly recommended. Speed increases can be realized with little configuration. However, once JRockit is running there are a number of tools such as the JRockit Flight Recorder (JFR) that come with the JVM to further tune your instance as necessary. As of writing this paper, the Hotspot and JRockit JVMs will ship as one product with the release of JDK 8. This means the benefits of JRockit will be realized within the JVM. Tuning a JVM is not the only useful part interacting directly with your configuration settings. Additional information can be provided by your JVM as well. Performing a heap dump when a memory error occurs is one of those ways. The JVM is not the only part that should be monitored. Data sources are another critical component that should be monitored in the case of performance issues. It is possible that the available connection pool has been saturated with connections and is causing a bottleneck. If there is consistently an issue with a particular connection pool, involve a DBA to help understand why the pool may be getting full. There may be some SQL tuning that can be done so that queries and procedures run more efficiently shortening the length of connection times. In the end, even this paper can only gloss over the very complex art that is troubleshooting. There are many variables that can come into determining the cause such as security considerations, operating system, hardware, etc. Most issues that arise can be narrowed into runtime or infrastructure errors, performance issues, and deployment issues. Targeting the category can allow focus on where the true cause of the issue lay. For deployment issues, it is good to have an understanding of the overall deployment process. Also, knowing the purpose of the adf-config.xml can provide insight as to how the MDS is referenced and other important deployment related information. When dealing with errors determining whether there is a code specific issue or a system wide issue can prevent many long hours looking in the wrong place. Modifying logging levels can assist in this and allow for drilling into the true cause of the issue. APPENDICES JVM Performance Tuning Documentation http://docs.oracle.com/cd/E23943_01/web.1111/e13814 .pdf http://docs.oracle.com/cd/E15289_01/doc.40/e15060.p df Location of out.err (Used for deployment errors) Unix/Linux: /tmp/out.err Microsoft Windows: C:Users<user>AppDataLocalTempout.err Oracle ADF-config.xml Description http://docs.oracle.com/cd/E15586_01/web.1111/b3197 4/appendixa.htm#BGBIFEJE REFERENCES Splunk. Ensure the availability and performance of your critical applications using the genius of splunk. Retrieved from http://www.splunk.com/web_assets/pdfs/secure/Troubl eshooting_Critical_Applications.pdf