SlideShare a Scribd company logo
Dishonored 2 rendering engine
architecture overview

Jérémy	
  Virga,	
  Arkane	
  Studios	
  (	
  Lyon	
  /	
  France)	
  
Intro
Why	
  we’re	
  doing	
  this	
  ?	
  (except	
  because	
  programmers	
  want	
  to	
  have	
  fun…)	
  
Where  we  come  from
	
  
	
  
	
  
	
  
	
  
•  Only	
  one	
  thread	
  generaIng	
  GPU	
  commands	
  (mono	
  context)	
  
•  SynchronizaIon	
  point	
  to	
  exchange	
  data	
  
•  ONen	
  leaves	
  underused	
  CPU	
  cores	
  
MainThread	
   Game	
  logic	
  (N)	
   sync	
   Game	
  logic	
  	
  (N+1)	
   sync	
   Game	
  logic	
  	
  (N+2)	
   …	
  
RenderThread	
   Rendering	
  (N-­‐1)	
   Rendering	
  (N)	
   Rendering	
  (N+1)	
  
…	
  
One	
  frame	
  
Where  we  want  to  go
	
  
•  SequenIal	
  but	
  spread	
  on	
  all	
  cores	
  
•  Removes	
  an	
  extra	
  frame	
  latency	
  &	
  data	
  copies	
  
•  Requires	
  both	
  game	
  logic	
  and	
  rendering	
  completely	
  mulIthreaded	
  
MainThread	
   …	
   Game	
  logic	
  (N)	
   RenderLoop	
  (tasks	
  creaIon	
  &	
  submit)	
  (N)	
   …	
  
Worker1	
   Task	
   Task	
   Task	
  
Task	
  
Task	
   Task	
   Task	
  
Worker2	
   Task	
  Task	
   Task	
   Task	
   Task	
  
Worker3	
   Task	
  Task	
  Task	
   Task	
   Task	
  
Worker4	
   Task	
   Task	
  
Task	
  
Task	
   Task	
   Task	
   Task	
  
Worker5	
   Task	
   Task	
   Background	
  task	
  
…	
  
One	
  frame	
  
Where  we  want  to  go
	
  
Let’s	
  focus	
  on	
  rendering	
  part…	
  
MainThread	
   …	
   Game	
  logic	
   RenderLoop	
  (tasks	
  creaIon	
  &	
  submit)	
   …	
  
Worker1	
   Task	
   Task	
   Task	
  
Task	
  
Task	
   Task	
   Task	
  
Worker2	
   Task	
  Task	
   Task	
   Task	
   Task	
  
Worker3	
   Task	
  Task	
  Task	
   Task	
   Task	
  
Worker4	
   Task	
   Task	
  
Task	
  
Task	
   Task	
   Task	
   Task	
  
Worker5	
   Task	
   Task	
   Background	
  task	
  
…	
  
Where  we  want  to  go  (2)  :  Scheduling
MainThread	
   …	
   RenderLoop	
  (N)	
   …	
  
Worker1	
   Shadow	
  1	
   Post-­‐FX	
  
Worker2	
   Opaque	
   Shadow	
  2	
  
Worker3	
   Z-­‐prepass	
   Alpha	
  
…	
  
GPU	
  (N-­‐?)	
   …	
   Shadow	
  1	
   Shadow	
  2	
   Z	
  Prepass	
   Opaque	
   Alpha	
   Post-­‐FX	
   …	
  
Tasks	
  generaIng	
  GPU	
  commands	
  
“pure-­‐cpu”	
  Tasks	
  
	
  (no	
  graphic	
  context)	
  
GPU	
  execuIon	
  
•  CPU	
  execuIon	
  ordering	
  to	
  solve	
  read/write	
  data	
  dependencies	
  
•  MulI	
  GPU	
  contexts	
  to	
  generate	
  GPU	
  commands	
  from	
  any	
  thread	
  in	
  parallel	
  
•  Each	
  non	
  “pure-­‐cpu”	
  task	
  generates	
  a	
  «	
  local	
  »	
  command	
  buffer	
  
•  Fine	
  scheduling	
  control:	
  CPU	
  execuIon	
  (commands	
  recording)	
  should	
  be	
  driven	
  independently	
  from	
  GPU	
  frame	
  
order	
  requirements	
  (commands	
  replay)	
  
Where  we  want  to  go  (2)  :  Scheduling
MainThread	
   …	
   RenderLoop	
  (N)	
   …	
  
Worker1	
   Shadow	
  1	
   Post-­‐FX	
  
Worker2	
   Opaque	
   Shadow	
  2	
  
Worker3	
   Z-­‐prepass	
   Alpha	
  
…	
  
GPU	
  (N-­‐?)	
   …	
   Shadow	
  1	
   Shadow	
  2	
   Z	
  Prepass	
   Opaque	
   Alpha	
   Post-­‐FX	
   …	
  
Tasks	
  generaIng	
  GPU	
  commands	
  
“pure-­‐cpu”	
  Tasks	
  
	
  (no	
  graphic	
  context)	
  
GPU	
  execuIon	
  
•  CPU	
  execuIon	
  ordering	
  to	
  solve	
  read/write	
  data	
  dependencies	
  
•  MulI	
  GPU	
  contexts	
  to	
  generate	
  GPU	
  commands	
  from	
  any	
  thread	
  in	
  parallel	
  
•  Each	
  non	
  “pure-­‐cpu”	
  task	
  generates	
  a	
  «	
  local	
  »	
  command	
  buffer	
  
•  Fine	
  scheduling	
  control:	
  CPU	
  execuIon	
  (commands	
  recording)	
  should	
  be	
  driven	
  independently	
  from	
  GPU	
  frame	
  
order	
  requirements	
  (commands	
  replay)	
  
VoidEngine  &  Dishonored  2  rendering  facts
•  Environments/architecture	
  created	
  from	
  small	
  blocks	
  +	
  more	
  details	
  than	
  previous	
  game	
  
Ø Lot	
  of	
  objects	
  to	
  process,	
  Thousands	
  of	
  draw	
  calls	
  
•  Instanced	
  batch	
  draws	
  
Ø batches	
  generated	
  based	
  on	
  visibility	
  results	
  from	
  Umbra	
  
•  Dedicated	
  culling	
  for	
  shadow	
  casters	
  
•  Lot	
  of	
  shadow	
  casIng	
  lights,	
  all	
  dynamic	
  with	
  cache	
  system,	
  nothing	
  pre-­‐baked	
  
•  Several	
  passes	
  
•  Z	
  prepass	
  for	
  opaque,	
  separate	
  Z	
  prepass	
  for	
  alpha,	
  
Tiled	
  forward	
  shading	
  with	
  PBR,	
  extra	
  passes	
  dedicated	
  to	
  effects,	
  etc…	
  
•  In-­‐house	
  indirect	
  lighIng	
  system	
  &	
  IBL	
  cubemaps	
  network	
  
•  …	
  
VoidEngine  &  Dishonored  2  rendering  facts
•  Lot	
  of	
  «	
  almost-­‐independent	
  »	
  passes,	
  lot	
  of	
  data,	
  lot	
  of	
  work.	
  
•  Became	
  a	
  mess	
  to	
  organize,	
  mulIthreading	
  will	
  make	
  it	
  worst.	
  
•  Over	
  performances,	
  ideal	
  architecture	
  requirements	
  are:	
  
•  User-­‐friendly:	
  Easy	
  to	
  setup	
  and	
  insert	
  new	
  work.	
  
•  Readable:	
  Easy	
  to	
  understand	
  &	
  follow	
  frame	
  sequence	
  
•  Modular:	
  Easy	
  to	
  organize/rearrange/split/remove	
  passes	
  &	
  work	
  
	
  
…but	
  you	
  know	
  world	
  is	
  not	
  ideal.	
  So	
  let’s	
  try	
  to	
  make	
  it	
  not	
  too	
  ugly	
  ;)	
  
Split  the  renderLoop
Rendering	
  task	
  setup	
  
Rendering  task  setup:  dependencies
•  Two	
  singular	
  dependency	
  kinds	
  
•  «	
  CPU	
  dependencies	
  »	
  for	
  execuIon	
  scheduling	
  /	
  data	
  synchronizaIon	
  
•  -­‐>	
  most	
  criIcal	
  for	
  performances,	
  could	
  create	
  «	
  holes	
  »	
  in	
  the	
  execuIon	
  Imeline	
  
•  «	
  GPU	
  dependencies	
  »	
  for	
  submissions	
  ordering	
  
•  -­‐>	
  very	
  small	
  performance	
  impact	
  in	
  general,	
  required	
  to	
  have	
  consistent	
  GPU	
  frame	
  but	
  
doesn’t	
  affect	
  CPU	
  parallelism	
  
	
  
Task	
  A	
   Task	
  B	
   Task	
  C	
  
Task	
  A	
  
Task	
  B	
  
Task	
  C	
  
“CPU”	
  dependencies	
   “GPU”	
  dependencies	
  
Rendering  task  setup:  in/out
•  Explicit	
  input/output	
  declaraIons	
  
•  object	
  lists	
  
•  render	
  targets	
  read/write	
  
•  Buffers	
  
•  random	
  user	
  data	
  
•  etc…	
  
Task	
  A	
  
• RT	
  
• RT	
  
• Render	
  list	
  
Task	
  B	
  
• RT	
  
• buffer	
  
Task	
  C	
  
• RT	
  
• User	
  data	
  
Rendering  task  setup  (2):  chaining
•  Explicit	
  task	
  chaining:	
  input	
  could	
  comes	
  from	
  another	
  task	
  output	
  
•  used	
  for	
  automaIc	
  dependencies	
  checking	
  
•  Helps	
  for	
  readability	
  &	
  code	
  maintenance.	
  Could	
  remove	
  a	
  task	
  with	
  limited	
  
code	
  modificaIon.	
  
Task	
  A	
  
•  RT	
  0	
  (out)	
  
Task	
  B	
  
•  RT	
  0	
  from	
  B	
  (in)	
  
•  RT	
  1	
  (out)	
  
Task	
  C	
  
•  RT	
  1	
  from	
  C	
  (in)	
  
Rendering  task  setup  (2):  chaining
•  Skipped	
  condiIon	
  
•  a	
  task	
  could	
  be	
  skipped	
  by	
  runIme	
  depending	
  on	
  execuIon	
  context	
  (skipped	
  
effect,	
  etc…)	
  
Ø 	
  scheduler	
  will	
  automaIcally	
  fix	
  chaining	
  
Task	
  A	
  
• RT	
  0	
  (out)	
  
Task	
  C	
  
• RT	
  1	
  from	
  C	
  (in)	
  
RT	
  0	
  from	
  A	
  
Task	
  B	
  
• RT	
  0	
  from	
  B	
  (in)	
  
• RT	
  1	
  (out)	
  
Rendering  task  setup  (3):  advanced  opFons
•  «	
  Background	
  »	
  task:	
  low	
  priority,	
  render	
  loop	
  doesn’t	
  wait	
  for	
  it	
  at	
  
end	
  of	
  frame	
  
•  Submiqed	
  on	
  a	
  next	
  frame	
  if	
  not	
  ready	
  
•  «	
  forced	
  immediate	
  »	
  task:	
  actually	
  executed	
  inline	
  during	
  submission	
  
•  Uses	
  the	
  main	
  “immediate”	
  graphic	
  context	
  
•  To	
  workaround	
  graphic	
  middleware	
  or	
  plarorm	
  specific	
  API	
  limitaIons	
  
•  Keeps	
  frame	
  ordering	
  consistency	
  
Spreading  the  world:  AddiFonal  helpers
•  Supports	
  spawn	
  of	
  new	
  tasks	
  from	
  another	
  one	
  
•  -­‐>MulIple	
  producers,	
  mulIple	
  consumers	
  scheduling	
  
•  RunAsync(	
  …	
  );	
  	
  
•  To	
  convert	
  any	
  piece	
  of	
  code	
  into	
  asynchronous	
  call	
  
•  ParallelFor(…);	
  
•  To	
  split	
  processing	
  on	
  several	
  workers	
  in	
  just	
  one	
  line	
  of	
  code	
  
•  Interface	
  similar	
  to	
  Intel	
  TBB[1],	
  MicrosoN	
  PPL[2],	
  …	
  
•  RenderPass	
  
•  EncapsulaIon	
  of	
  several	
  tasks	
  sharing	
  dependencies	
  and/or	
  inputs.	
  
•  Scheduling	
  sIll	
  fully	
  flexible	
  at	
  task-­‐level	
  
•  E.g.	
  each	
  shadow	
  slice/part	
  is	
  a	
  task,	
  encapsulated	
  into	
  only	
  one	
  shadow	
  pass.	
  
Rendering  task  examples
•  Umbra	
  visibility	
  jobs	
  (cpu)	
  
•  Drawing	
  batches	
  gathering/sorIng	
  (cpu)	
  
•  Lights	
  sorIng	
  (cpu)	
  
•  DirecIonal	
  shadow	
  cascades	
  draws	
  (cpu/gpu)	
  
•  Local	
  (point/spot/area	
  light)	
  shadows	
  update	
  (cpu/gpu)	
  
•  Opaque	
  pass	
  draws	
  (cpu/gpu)	
  
•  Alpha	
  pass	
  draws	
  (cpu/gpu)	
  
•  …	
  etc	
  ~50-­‐70	
  tasks	
  currently	
  (~half	
  are	
  cpu/gpu)	
  
Results,  issues  &  guidelines
Results
•  We	
  got	
  ~40-­‐60%	
  renderLoop	
  duraIon	
  Ime	
  saved	
  on	
  first	
  draN	
  (on	
  6-­‐8	
  
cores	
  hardware)	
  
•  Excellent	
  results	
  on	
  latest	
  consoles.	
  SIll	
  improving	
  over	
  SDK	
  updates	
  
•  We	
  are	
  expecIng	
  the	
  best	
  results	
  on	
  latest	
  PC	
  APIs	
  (Mantle/DX12/Vulkan)	
  	
  
•  We	
  improved	
  those	
  results	
  significantly	
  by	
  tweaking	
  tasks	
  (see	
  guidelines)	
  
•  we	
  have	
  to	
  do	
  that	
  constantly	
  during	
  game	
  development	
  as	
  things	
  are	
  moving	
  
•  MulItask	
  overhead	
  VS	
  overall	
  performances	
  
•  Scheduling	
  cost,	
  submission	
  cost	
  
•  Cache	
  misses	
  easier	
  to	
  raise	
  (when	
  cache	
  is	
  shared	
  through	
  cores)	
  
•  You	
  should	
  sIll	
  get	
  benefits	
  
	
  
Issues
•  NOT	
  for	
  every	
  environment:	
  	
  
•  PC	
  D3D11,	
  efforts	
  were	
  made	
  on	
  some	
  recent	
  drivers,	
  but	
  result	
  depends	
  on	
  
IHV	
  (independent	
  hardware	
  vendor)	
  
•  From	
  really	
  good	
  improvements	
  to	
  horrible	
  performances	
  loss	
  
•  Could	
  rely	
  on	
  D3D11_FEATURE_DATA_THREADING::DriverCommandLists	
  with	
  recent	
  
drivers	
  
•  We	
  fallback	
  on	
  an	
  hybrid	
  mode	
  when	
  not	
  correctly	
  supported	
  
•  only	
  pure-­‐cpu	
  tasks	
  are	
  parallelized,	
  gpu-­‐tasks	
  run	
  on	
  just	
  one	
  worker,	
  with	
  only	
  one	
  graphic	
  
context.	
  
•  Gpu	
  dependencies	
  converted	
  to	
  cpu	
  ones	
  to	
  keep	
  frame	
  ordering	
  consistency	
  
Issues
•  …easy	
  to	
  break	
  rendering	
  with	
  random	
  arIfacts	
  hard	
  to	
  understand.	
  
•  We	
  developed	
  in-­‐house	
  debug	
  tools	
  &	
  commands	
  
•  Could	
  switch	
  on	
  the	
  fly	
  to	
  single	
  threaded	
  execuIon	
  
•  Could	
  display	
  on-­‐screen	
  intermediate	
  task’s	
  RT	
  outputs	
  
•  ExecuIon	
  Imeline	
  recording	
  
•  Could	
  record	
  submissions	
  ordering	
  of	
  a	
  buggy	
  frame	
  and	
  replay	
  it	
  
•  Dependency	
  graphs	
  generaIon	
  
•  …	
  sIll	
  evolving	
  
Issue  example:  the  «  renaming  »  case
Update	
  a	
  dynamic	
  GPU	
  resource	
  on	
  task	
  A.	
  	
  
Use	
  it	
  in	
  a	
  command	
  buffer	
  in	
  task	
  B	
  
•  Doesn’t	
  require	
  an	
  extra	
  CPU	
  dependency	
  between	
  them.	
  
From	
  the	
  CPU,	
  execuIng	
  update(A)	
  before,	
  during	
  or	
  aNer	
  binding(B)	
  is	
  completely	
  valid.	
  
	
  
MainThread	
   …	
   RenderLoop	
   …	
  
Worker1	
   A	
  (update)	
  
Worker2	
   B	
  (binding)	
  
…	
  
Issue  example:  the  «  renaming  »  case
•  On	
  PC	
  D3D11,	
  driver	
  handles	
  this	
  for	
  you	
  
•  On	
  update,	
  it	
  «	
  renames	
  »	
  the	
  resource	
  =	
  it	
  creates	
  another	
  copy	
  version	
  
•  On	
  binding	
  use,	
  it	
  adds	
  «	
  split	
  point	
  »	
  in	
  the	
  command	
  buffer	
  each	
  Ime	
  the	
  actual	
  
copy	
  version	
  behind	
  a	
  dynamic	
  resource	
  is	
  unknown	
  (=	
  not	
  updated	
  within	
  the	
  same	
  
local	
  command	
  buffer).	
  
•  On	
  submission,	
  it	
  patches	
  all	
  the	
  split	
  points	
  of	
  the	
  command	
  buffer	
  according	
  to	
  
other	
  preceding	
  submissions	
  
•  -­‐>	
  bad	
  performances	
  overhead	
  !	
  
•  On	
  consoles	
  &	
  new	
  PC	
  APIs:	
  manual	
  management	
  
•  Much	
  more	
  efficient	
  
•  Requires	
  your	
  knowledge	
  of	
  the	
  actual	
  renamed	
  «	
  version	
  »	
  to	
  use	
  in	
  the	
  binding	
  
task(B)	
  
•  -­‐>	
  Input/output	
  task	
  chaining	
  gives	
  that	
  
	
  
To  be  conFnued:  guidelines
•  Bench	
  it	
  !	
  
•  Use	
  low-­‐level	
  profiling	
  tools	
  to	
  observe	
  stalls,	
  holes	
  in	
  the	
  Imeline,	
  
preempIon	
  
•  PC:	
  MicrosoN	
  Concurrency	
  Visualizer	
  [3],	
  …	
  
•  Improve	
  work	
  split	
  /	
  CPU	
  dependencies	
  to	
  prevent	
  holes	
  /	
  improve	
  code	
  
paqerns	
  to	
  prevent	
  CPU	
  stalls	
  /	
  etc….	
  will	
  increase	
  results	
  significantly	
  
•  Be	
  careful	
  to	
  not	
  have	
  too	
  many	
  thread	
  context	
  switches.	
  
•  Tweak	
  core	
  affiniIes	
  of	
  your	
  tasks	
  (consoles)	
  
•  Granularity	
  of	
  split:	
  overhead	
  vs	
  performance	
  gain	
  
To  be  conFnued:  next  steps
•  Use	
  extra	
  GPU	
  engine	
  (Asynchronous	
  compute,	
  DMA,	
  …)	
  to	
  also	
  
improve	
  GPU	
  parallelism	
  –	
  consoles	
  &	
  new	
  PC	
  APIs	
  only	
  
•  Re-­‐use	
  tasks	
  GPU	
  dependencies	
  to	
  manage	
  GPU	
  queues	
  synchronizaIons	
  
•  Thinking	
  about	
  a	
  system	
  allowing	
  tasks	
  generaIng	
  very	
  small	
  
command	
  buffers	
  to	
  give	
  it	
  to	
  another	
  task	
  at	
  the	
  end,	
  instead	
  of	
  
registering	
  for	
  submission	
  directly.	
  	
  
•  -­‐>	
  hard	
  to	
  manage	
  correctly	
  submission	
  ordering	
  
QuesFons  ?
	
  
	
  
	
  
	
  
	
  
	
  
jvirga@arkane-­‐studios.com	
  
Bonus  slide:  kill  mutexes
•  Mutexes	
  are	
  your	
  nemesis.	
  
•  There	
  is	
  oNen	
  a	
  more	
  efficient	
  paqern	
  or	
  primiIve	
  to	
  avoid	
  using	
  
them.	
  
•  Use	
  spin	
  lock	
  when	
  you	
  know	
  the	
  lock	
  duraIon	
  Ime	
  is	
  really	
  small	
  
•  Use	
  lockless	
  queues,	
  etc…	
  
•  Pre-­‐allocate	
  containers	
  and	
  use	
  them	
  with	
  atomic	
  indexes	
  increment	
  
•  Use	
  Read/Write	
  mutex	
  when	
  you	
  know	
  there	
  are	
  much	
  more	
  read	
  than	
  write	
  
on	
  the	
  data	
  (several	
  concurrent	
  reads	
  allowed,	
  exclusive	
  write)	
  
•  Use	
  thread	
  local	
  storage	
  in	
  code	
  called	
  concurrently	
  	
  
•  …	
  
Bonus  slide:  scheduler  implementaFon  details
•  RenderLoop	
  
•  (A)	
  For	
  each	
  Task,	
  PushTask()	
  
•  readyToStart	
  (CPU	
  dependencies	
  +	
  task	
  not	
  skipped	
  by	
  runIme)	
  &&	
  there’s	
  an	
  available	
  
worker	
  ?	
  
•  Send	
  signal	
  to	
  the	
  worker	
  
•  else	
  
•  Place	
  in	
  queue	
  
•  (B)	
  Wait	
  for	
  a	
  pending	
  task	
  submission.	
  
•  (C)	
  For	
  each	
  pending	
  submission	
  
•  readyToSubmit	
  (GPU	
  dependencies)	
  ?	
  
•  Send	
  command	
  buffer	
  to	
  GPU	
  queue	
  
•  Release/Recycle	
  it	
  (plarorm	
  dependent)	
  
•  Repeat	
  (B)	
  and	
  (C)	
  unIl	
  all	
  frame	
  tasks	
  are	
  completed	
  &	
  submiqed	
  –	
  except	
  for	
  
“background”	
  tasks	
  
	
  
	
  
Bonus  slide:  scheduler  implementaFon  details
•  Worker	
  
•  (A)	
  Wait	
  for	
  a	
  task	
  readyToStart	
  signal	
  
•  (B)	
  If	
  the	
  task	
  requires	
  command	
  buffer,	
  assign	
  it	
  a	
  graphic	
  context	
  
•  (C)	
  Execute	
  the	
  task	
  
•  (D)	
  Close	
  the	
  command	
  buffer	
  
•  (E)	
  Place	
  task	
  into	
  pending	
  submission	
  queue	
  if	
  command	
  buffer	
  actually	
  filled	
  
•  (D)	
  check	
  for	
  any	
  other	
  available	
  task	
  to	
  run	
  in	
  scheduler	
  queue	
  	
  
•  return	
  to	
  (B)	
  else	
  return	
  to	
  (A)	
  
Bonus  slide:  Concurrency  Visualizer
•  Low-­‐level:	
  catch	
  CPU	
  core	
  stalls,	
  memory	
  management,	
  preempIon,	
  sleep,	
  IO	
  
•  Blocking	
  &	
  unblocking	
  call	
  stacks	
  
•  Timings	
  
•  Markers	
  API	
  to	
  make	
  it	
  readable	
  
•  Observe	
  context	
  switches	
  
	
  
	
  
	
  
	
  
Catch	
  context	
  switches	
  
	
  
Bonus  slide:  parallelFor  sample
Bonus  slide:  chaining  sample
References
•  [1]	
  Intel	
  Threading	
  Building	
  Blocks	
  (library)	
  
hqps://www.threadingbuildingblocks.org/	
  
•  [2]	
  MicrosoN	
  Parallel	
  Paqerns	
  Library	
  
hqps://msdn.microsoN.com/en-­‐us/library/vstudio/dd492418.aspx	
  
•  [3]	
  MicrosoN	
  Concurrency	
  Visualizer	
  (PC	
  profiling	
  tool)	
  
•  Bundled	
  with	
  Visual	
  Studio	
  2012	
  
•  OpIonal	
  extension	
  since	
  2013	
  
hqps://visualstudiogallery.msdn.microsoN.com/24b56e51-­‐fcc2-­‐423f-­‐b811-­‐
f16f3fa3af7a	
  

More Related Content

PDF
Umbra Ignite 2015: Graham Wihlidal – Adapting a technology stream to ever-evo...
Umbra Software
 
PDF
Embedded Graphics Drivers in Mesa (ELCE 2019)
Igalia
 
PPSX
Rendering Battlefield 4 with Mantle by Yuriy ODonnell
AMD Developer Central
 
PPTX
Parallel Futures of a Game Engine
repii
 
PDF
HC-4020, Enhancing OpenCL performance in AfterShot Pro with HSA, by Michael W...
AMD Developer Central
 
PPTX
Crossing platforms: bringing call of duty to mobile – Unite Copenhagen 2019
Unity Technologies
 
PPTX
Unite Berlin 2018 - Book of the Dead Optimizing Performance for High End Cons...
Unity Technologies
 
PPTX
Parallel Futures of a Game Engine (v2.0)
repii
 
Umbra Ignite 2015: Graham Wihlidal – Adapting a technology stream to ever-evo...
Umbra Software
 
Embedded Graphics Drivers in Mesa (ELCE 2019)
Igalia
 
Rendering Battlefield 4 with Mantle by Yuriy ODonnell
AMD Developer Central
 
Parallel Futures of a Game Engine
repii
 
HC-4020, Enhancing OpenCL performance in AfterShot Pro with HSA, by Michael W...
AMD Developer Central
 
Crossing platforms: bringing call of duty to mobile – Unite Copenhagen 2019
Unity Technologies
 
Unite Berlin 2018 - Book of the Dead Optimizing Performance for High End Cons...
Unity Technologies
 
Parallel Futures of a Game Engine (v2.0)
repii
 

What's hot (20)

PPSX
Rendering Battlefield 4 with Mantle by Johan Andersson - AMD at GDC14
AMD Developer Central
 
PDF
PlayStation: Cutting Edge Techniques
Slide_N
 
PPSX
Introduction to Direct 3D 12 by Ivan Nevraev
AMD Developer Central
 
PPSX
Direct3D12 and the Future of Graphics APIs by Dave Oldcorn
AMD Developer Central
 
PDF
Computer Vision Powered by Heterogeneous System Architecture (HSA) by Dr. Ha...
AMD Developer Central
 
PPTX
4K Checkerboard in Battlefield 1 and Mass Effect Andromeda
Electronic Arts / DICE
 
PPTX
FrameGraph: Extensible Rendering Architecture in Frostbite
Electronic Arts / DICE
 
PPTX
Future Directions for Compute-for-Graphics
Electronic Arts / DICE
 
PPTX
Low-level Graphics APIs
repii
 
PDF
【Unite 2017 Tokyo】スクリプタブル・レンダーパイプラインのカスタマイズと拡張
Unite2017Tokyo
 
PPTX
Rendering Battlefield 4 with Mantle
Electronic Arts / DICE
 
PPTX
Getting Space Pirate Trainer* to Perform on Intel® Graphics
Intel® Software
 
PDF
Design your 3d game engine
Daosheng Mu
 
PPTX
Parallelizing Conqueror's Blade
Intel® Software
 
PPTX
Mantle for Developers
Electronic Arts / DICE
 
PDF
RapidFire - the Easy Route to low Latency Cloud Gaming Solutions - AMD at GDC14
AMD Developer Central
 
PPT
BitSquid Tech: Benefits of a data-driven renderer
tobias_persson
 
PPTX
Scale CPU Experiences: Maximize Unity* Performance Using the Entity Component...
Intel® Software
 
PDF
Sephy engine development document
Jaejun Kim
 
PPTX
Battlefield 4 + Frostbite + Mantle
Electronic Arts / DICE
 
Rendering Battlefield 4 with Mantle by Johan Andersson - AMD at GDC14
AMD Developer Central
 
PlayStation: Cutting Edge Techniques
Slide_N
 
Introduction to Direct 3D 12 by Ivan Nevraev
AMD Developer Central
 
Direct3D12 and the Future of Graphics APIs by Dave Oldcorn
AMD Developer Central
 
Computer Vision Powered by Heterogeneous System Architecture (HSA) by Dr. Ha...
AMD Developer Central
 
4K Checkerboard in Battlefield 1 and Mass Effect Andromeda
Electronic Arts / DICE
 
FrameGraph: Extensible Rendering Architecture in Frostbite
Electronic Arts / DICE
 
Future Directions for Compute-for-Graphics
Electronic Arts / DICE
 
Low-level Graphics APIs
repii
 
【Unite 2017 Tokyo】スクリプタブル・レンダーパイプラインのカスタマイズと拡張
Unite2017Tokyo
 
Rendering Battlefield 4 with Mantle
Electronic Arts / DICE
 
Getting Space Pirate Trainer* to Perform on Intel® Graphics
Intel® Software
 
Design your 3d game engine
Daosheng Mu
 
Parallelizing Conqueror's Blade
Intel® Software
 
Mantle for Developers
Electronic Arts / DICE
 
RapidFire - the Easy Route to low Latency Cloud Gaming Solutions - AMD at GDC14
AMD Developer Central
 
BitSquid Tech: Benefits of a data-driven renderer
tobias_persson
 
Scale CPU Experiences: Maximize Unity* Performance Using the Entity Component...
Intel® Software
 
Sephy engine development document
Jaejun Kim
 
Battlefield 4 + Frostbite + Mantle
Electronic Arts / DICE
 
Ad

Similar to Umbra Ignite 2015: Jérémy Virga – Dishonored 2 rendering engine architecture overview – moving to multitasking (20)

PPT
Parallel computing with Gpu
Rohit Khatana
 
PPTX
Parallel Graphics in Frostbite - Current & Future (Siggraph 2009)
repii
 
PDF
Challenges in GPU compilers
AnastasiaStulova
 
PDF
Jvm profiling under the hood
RichardWarburton
 
PPTX
Gpu with cuda architecture
Dhaval Kaneria
 
PDF
Computer Graphics - Lecture 01 - 3D Programming I
💻 Anton Gerdelan
 
PPTX
Exploring Compiler Optimization Opportunities for the OpenMP 4.x Accelerator...
Akihiro Hayashi
 
PDF
3 boyd direct3_d12 (1)
mistercteam
 
PDF
Woden 2: Developing a modern 3D graphics engine in Smalltalk
ESUG
 
PDF
Programar para GPUs
Alcides Fonseca
 
PDF
Apache Flink: Better, Faster & Uncut - Piotr Nowojski, data Artisans
Evention
 
PDF
Compute API –Past & Future
Ofer Rosenberg
 
PPTX
GPU in Computer Science advance topic .pptx
HamzaAli998966
 
PDF
Kernel Recipes 2018 - KernelShark 1.0; What's new and what's coming - Steven ...
Anne Nicolas
 
PPTX
[Unite Seoul 2019] Mali GPU Architecture and Mobile Studio
Owen Wu
 
PDF
Smedberg niklas bringing_aaa_graphics
changehee lee
 
PDF
Mirko Damiani - An Embedded soft real time distributed system in Go
linuxlab_conf
 
PPTX
Create Amazing VFX with the Visual Effect Graph
Unity Technologies
 
PPTX
Exploiting GPU's for Columnar DataFrrames by Kiran Lonikar
Spark Summit
 
PPTX
GPU Algorithms and trends 2018
Prabindh Sundareson
 
Parallel computing with Gpu
Rohit Khatana
 
Parallel Graphics in Frostbite - Current & Future (Siggraph 2009)
repii
 
Challenges in GPU compilers
AnastasiaStulova
 
Jvm profiling under the hood
RichardWarburton
 
Gpu with cuda architecture
Dhaval Kaneria
 
Computer Graphics - Lecture 01 - 3D Programming I
💻 Anton Gerdelan
 
Exploring Compiler Optimization Opportunities for the OpenMP 4.x Accelerator...
Akihiro Hayashi
 
3 boyd direct3_d12 (1)
mistercteam
 
Woden 2: Developing a modern 3D graphics engine in Smalltalk
ESUG
 
Programar para GPUs
Alcides Fonseca
 
Apache Flink: Better, Faster & Uncut - Piotr Nowojski, data Artisans
Evention
 
Compute API –Past & Future
Ofer Rosenberg
 
GPU in Computer Science advance topic .pptx
HamzaAli998966
 
Kernel Recipes 2018 - KernelShark 1.0; What's new and what's coming - Steven ...
Anne Nicolas
 
[Unite Seoul 2019] Mali GPU Architecture and Mobile Studio
Owen Wu
 
Smedberg niklas bringing_aaa_graphics
changehee lee
 
Mirko Damiani - An Embedded soft real time distributed system in Go
linuxlab_conf
 
Create Amazing VFX with the Visual Effect Graph
Unity Technologies
 
Exploiting GPU's for Columnar DataFrrames by Kiran Lonikar
Spark Summit
 
GPU Algorithms and trends 2018
Prabindh Sundareson
 
Ad

More from Umbra Software (8)

PPTX
GDC16: Improving geometry culling for Deus Ex: Mankind Divided by Nicolas Trudel
Umbra Software
 
PPTX
GDC16: Arbitrary amount of 3D data running on Gear VR by Vinh Truong
Umbra Software
 
PPTX
GDC2014: Boosting your ARM mobile 3D rendering performance with Umbra
Umbra Software
 
PDF
Umbra Ignite 2015: – Remy Chinchilla & Kevin Cerdà AAA indie production for ...
Umbra Software
 
PDF
Umbra Ignite 2015: Balázs Török – The blanket that’s always too short
Umbra Software
 
PPTX
Umbra Ignite 2015: Rulon Raymond – The State of Skinning – a dive into modern...
Umbra Software
 
PDF
Umbra Ignite 2015: Thor Gunnarsson & Reynir Hardarson – Nailing AAA quality i...
Umbra Software
 
PDF
Umbra Ignite 2015: Alex Evans – Learning from failure – prototypes, R&D, iter...
Umbra Software
 
GDC16: Improving geometry culling for Deus Ex: Mankind Divided by Nicolas Trudel
Umbra Software
 
GDC16: Arbitrary amount of 3D data running on Gear VR by Vinh Truong
Umbra Software
 
GDC2014: Boosting your ARM mobile 3D rendering performance with Umbra
Umbra Software
 
Umbra Ignite 2015: – Remy Chinchilla & Kevin Cerdà AAA indie production for ...
Umbra Software
 
Umbra Ignite 2015: Balázs Török – The blanket that’s always too short
Umbra Software
 
Umbra Ignite 2015: Rulon Raymond – The State of Skinning – a dive into modern...
Umbra Software
 
Umbra Ignite 2015: Thor Gunnarsson & Reynir Hardarson – Nailing AAA quality i...
Umbra Software
 
Umbra Ignite 2015: Alex Evans – Learning from failure – prototypes, R&D, iter...
Umbra Software
 

Recently uploaded (20)

PDF
Oracle AI Vector Search- Getting Started and what's new in 2025- AIOUG Yatra ...
Sandesh Rao
 
PDF
Structs to JSON: How Go Powers REST APIs
Emily Achieng
 
PPTX
cloud computing vai.pptx for the project
vaibhavdobariyal79
 
PDF
The Future of Mobile Is Context-Aware—Are You Ready?
iProgrammer Solutions Private Limited
 
PDF
Data_Analytics_vs_Data_Science_vs_BI_by_CA_Suvidha_Chaplot.pdf
CA Suvidha Chaplot
 
PPTX
AI and Robotics for Human Well-being.pptx
JAYMIN SUTHAR
 
PDF
Responsible AI and AI Ethics - By Sylvester Ebhonu
Sylvester Ebhonu
 
PPTX
IT Runs Better with ThousandEyes AI-driven Assurance
ThousandEyes
 
PDF
Automating ArcGIS Content Discovery with FME: A Real World Use Case
Safe Software
 
PDF
Software Development Methodologies in 2025
KodekX
 
PPTX
OA presentation.pptx OA presentation.pptx
pateldhruv002338
 
PDF
Peak of Data & AI Encore - Real-Time Insights & Scalable Editing with ArcGIS
Safe Software
 
PDF
Doc9.....................................
SofiaCollazos
 
PDF
Tea4chat - another LLM Project by Kerem Atam
a0m0rajab1
 
PDF
AI Unleashed - Shaping the Future -Starting Today - AIOUG Yatra 2025 - For Co...
Sandesh Rao
 
PDF
AI-Cloud-Business-Management-Platforms-The-Key-to-Efficiency-Growth.pdf
Artjoker Software Development Company
 
PDF
Using Anchore and DefectDojo to Stand Up Your DevSecOps Function
Anchore
 
PPTX
New ThousandEyes Product Innovations: Cisco Live June 2025
ThousandEyes
 
PPTX
Applied-Statistics-Mastering-Data-Driven-Decisions.pptx
parmaryashparmaryash
 
PDF
Unlocking the Future- AI Agents Meet Oracle Database 23ai - AIOUG Yatra 2025.pdf
Sandesh Rao
 
Oracle AI Vector Search- Getting Started and what's new in 2025- AIOUG Yatra ...
Sandesh Rao
 
Structs to JSON: How Go Powers REST APIs
Emily Achieng
 
cloud computing vai.pptx for the project
vaibhavdobariyal79
 
The Future of Mobile Is Context-Aware—Are You Ready?
iProgrammer Solutions Private Limited
 
Data_Analytics_vs_Data_Science_vs_BI_by_CA_Suvidha_Chaplot.pdf
CA Suvidha Chaplot
 
AI and Robotics for Human Well-being.pptx
JAYMIN SUTHAR
 
Responsible AI and AI Ethics - By Sylvester Ebhonu
Sylvester Ebhonu
 
IT Runs Better with ThousandEyes AI-driven Assurance
ThousandEyes
 
Automating ArcGIS Content Discovery with FME: A Real World Use Case
Safe Software
 
Software Development Methodologies in 2025
KodekX
 
OA presentation.pptx OA presentation.pptx
pateldhruv002338
 
Peak of Data & AI Encore - Real-Time Insights & Scalable Editing with ArcGIS
Safe Software
 
Doc9.....................................
SofiaCollazos
 
Tea4chat - another LLM Project by Kerem Atam
a0m0rajab1
 
AI Unleashed - Shaping the Future -Starting Today - AIOUG Yatra 2025 - For Co...
Sandesh Rao
 
AI-Cloud-Business-Management-Platforms-The-Key-to-Efficiency-Growth.pdf
Artjoker Software Development Company
 
Using Anchore and DefectDojo to Stand Up Your DevSecOps Function
Anchore
 
New ThousandEyes Product Innovations: Cisco Live June 2025
ThousandEyes
 
Applied-Statistics-Mastering-Data-Driven-Decisions.pptx
parmaryashparmaryash
 
Unlocking the Future- AI Agents Meet Oracle Database 23ai - AIOUG Yatra 2025.pdf
Sandesh Rao
 

Umbra Ignite 2015: Jérémy Virga – Dishonored 2 rendering engine architecture overview – moving to multitasking

  • 1. Dishonored 2 rendering engine architecture overview Jérémy  Virga,  Arkane  Studios  (  Lyon  /  France)  
  • 2. Intro Why  we’re  doing  this  ?  (except  because  programmers  want  to  have  fun…)  
  • 3. Where  we  come  from           •  Only  one  thread  generaIng  GPU  commands  (mono  context)   •  SynchronizaIon  point  to  exchange  data   •  ONen  leaves  underused  CPU  cores   MainThread   Game  logic  (N)   sync   Game  logic    (N+1)   sync   Game  logic    (N+2)   …   RenderThread   Rendering  (N-­‐1)   Rendering  (N)   Rendering  (N+1)   …   One  frame  
  • 4. Where  we  want  to  go   •  SequenIal  but  spread  on  all  cores   •  Removes  an  extra  frame  latency  &  data  copies   •  Requires  both  game  logic  and  rendering  completely  mulIthreaded   MainThread   …   Game  logic  (N)   RenderLoop  (tasks  creaIon  &  submit)  (N)   …   Worker1   Task   Task   Task   Task   Task   Task   Task   Worker2   Task  Task   Task   Task   Task   Worker3   Task  Task  Task   Task   Task   Worker4   Task   Task   Task   Task   Task   Task   Task   Worker5   Task   Task   Background  task   …   One  frame  
  • 5. Where  we  want  to  go   Let’s  focus  on  rendering  part…   MainThread   …   Game  logic   RenderLoop  (tasks  creaIon  &  submit)   …   Worker1   Task   Task   Task   Task   Task   Task   Task   Worker2   Task  Task   Task   Task   Task   Worker3   Task  Task  Task   Task   Task   Worker4   Task   Task   Task   Task   Task   Task   Task   Worker5   Task   Task   Background  task   …  
  • 6. Where  we  want  to  go  (2)  :  Scheduling MainThread   …   RenderLoop  (N)   …   Worker1   Shadow  1   Post-­‐FX   Worker2   Opaque   Shadow  2   Worker3   Z-­‐prepass   Alpha   …   GPU  (N-­‐?)   …   Shadow  1   Shadow  2   Z  Prepass   Opaque   Alpha   Post-­‐FX   …   Tasks  generaIng  GPU  commands   “pure-­‐cpu”  Tasks    (no  graphic  context)   GPU  execuIon   •  CPU  execuIon  ordering  to  solve  read/write  data  dependencies   •  MulI  GPU  contexts  to  generate  GPU  commands  from  any  thread  in  parallel   •  Each  non  “pure-­‐cpu”  task  generates  a  «  local  »  command  buffer   •  Fine  scheduling  control:  CPU  execuIon  (commands  recording)  should  be  driven  independently  from  GPU  frame   order  requirements  (commands  replay)  
  • 7. Where  we  want  to  go  (2)  :  Scheduling MainThread   …   RenderLoop  (N)   …   Worker1   Shadow  1   Post-­‐FX   Worker2   Opaque   Shadow  2   Worker3   Z-­‐prepass   Alpha   …   GPU  (N-­‐?)   …   Shadow  1   Shadow  2   Z  Prepass   Opaque   Alpha   Post-­‐FX   …   Tasks  generaIng  GPU  commands   “pure-­‐cpu”  Tasks    (no  graphic  context)   GPU  execuIon   •  CPU  execuIon  ordering  to  solve  read/write  data  dependencies   •  MulI  GPU  contexts  to  generate  GPU  commands  from  any  thread  in  parallel   •  Each  non  “pure-­‐cpu”  task  generates  a  «  local  »  command  buffer   •  Fine  scheduling  control:  CPU  execuIon  (commands  recording)  should  be  driven  independently  from  GPU  frame   order  requirements  (commands  replay)  
  • 8. VoidEngine  &  Dishonored  2  rendering  facts •  Environments/architecture  created  from  small  blocks  +  more  details  than  previous  game   Ø Lot  of  objects  to  process,  Thousands  of  draw  calls   •  Instanced  batch  draws   Ø batches  generated  based  on  visibility  results  from  Umbra   •  Dedicated  culling  for  shadow  casters   •  Lot  of  shadow  casIng  lights,  all  dynamic  with  cache  system,  nothing  pre-­‐baked   •  Several  passes   •  Z  prepass  for  opaque,  separate  Z  prepass  for  alpha,   Tiled  forward  shading  with  PBR,  extra  passes  dedicated  to  effects,  etc…   •  In-­‐house  indirect  lighIng  system  &  IBL  cubemaps  network   •  …  
  • 9. VoidEngine  &  Dishonored  2  rendering  facts •  Lot  of  «  almost-­‐independent  »  passes,  lot  of  data,  lot  of  work.   •  Became  a  mess  to  organize,  mulIthreading  will  make  it  worst.   •  Over  performances,  ideal  architecture  requirements  are:   •  User-­‐friendly:  Easy  to  setup  and  insert  new  work.   •  Readable:  Easy  to  understand  &  follow  frame  sequence   •  Modular:  Easy  to  organize/rearrange/split/remove  passes  &  work     …but  you  know  world  is  not  ideal.  So  let’s  try  to  make  it  not  too  ugly  ;)  
  • 11. Rendering  task  setup:  dependencies •  Two  singular  dependency  kinds   •  «  CPU  dependencies  »  for  execuIon  scheduling  /  data  synchronizaIon   •  -­‐>  most  criIcal  for  performances,  could  create  «  holes  »  in  the  execuIon  Imeline   •  «  GPU  dependencies  »  for  submissions  ordering   •  -­‐>  very  small  performance  impact  in  general,  required  to  have  consistent  GPU  frame  but   doesn’t  affect  CPU  parallelism     Task  A   Task  B   Task  C   Task  A   Task  B   Task  C   “CPU”  dependencies   “GPU”  dependencies  
  • 12. Rendering  task  setup:  in/out •  Explicit  input/output  declaraIons   •  object  lists   •  render  targets  read/write   •  Buffers   •  random  user  data   •  etc…   Task  A   • RT   • RT   • Render  list   Task  B   • RT   • buffer   Task  C   • RT   • User  data  
  • 13. Rendering  task  setup  (2):  chaining •  Explicit  task  chaining:  input  could  comes  from  another  task  output   •  used  for  automaIc  dependencies  checking   •  Helps  for  readability  &  code  maintenance.  Could  remove  a  task  with  limited   code  modificaIon.   Task  A   •  RT  0  (out)   Task  B   •  RT  0  from  B  (in)   •  RT  1  (out)   Task  C   •  RT  1  from  C  (in)  
  • 14. Rendering  task  setup  (2):  chaining •  Skipped  condiIon   •  a  task  could  be  skipped  by  runIme  depending  on  execuIon  context  (skipped   effect,  etc…)   Ø   scheduler  will  automaIcally  fix  chaining   Task  A   • RT  0  (out)   Task  C   • RT  1  from  C  (in)   RT  0  from  A   Task  B   • RT  0  from  B  (in)   • RT  1  (out)  
  • 15. Rendering  task  setup  (3):  advanced  opFons •  «  Background  »  task:  low  priority,  render  loop  doesn’t  wait  for  it  at   end  of  frame   •  Submiqed  on  a  next  frame  if  not  ready   •  «  forced  immediate  »  task:  actually  executed  inline  during  submission   •  Uses  the  main  “immediate”  graphic  context   •  To  workaround  graphic  middleware  or  plarorm  specific  API  limitaIons   •  Keeps  frame  ordering  consistency  
  • 16. Spreading  the  world:  AddiFonal  helpers •  Supports  spawn  of  new  tasks  from  another  one   •  -­‐>MulIple  producers,  mulIple  consumers  scheduling   •  RunAsync(  …  );     •  To  convert  any  piece  of  code  into  asynchronous  call   •  ParallelFor(…);   •  To  split  processing  on  several  workers  in  just  one  line  of  code   •  Interface  similar  to  Intel  TBB[1],  MicrosoN  PPL[2],  …   •  RenderPass   •  EncapsulaIon  of  several  tasks  sharing  dependencies  and/or  inputs.   •  Scheduling  sIll  fully  flexible  at  task-­‐level   •  E.g.  each  shadow  slice/part  is  a  task,  encapsulated  into  only  one  shadow  pass.  
  • 17. Rendering  task  examples •  Umbra  visibility  jobs  (cpu)   •  Drawing  batches  gathering/sorIng  (cpu)   •  Lights  sorIng  (cpu)   •  DirecIonal  shadow  cascades  draws  (cpu/gpu)   •  Local  (point/spot/area  light)  shadows  update  (cpu/gpu)   •  Opaque  pass  draws  (cpu/gpu)   •  Alpha  pass  draws  (cpu/gpu)   •  …  etc  ~50-­‐70  tasks  currently  (~half  are  cpu/gpu)  
  • 18. Results,  issues  &  guidelines
  • 19. Results •  We  got  ~40-­‐60%  renderLoop  duraIon  Ime  saved  on  first  draN  (on  6-­‐8   cores  hardware)   •  Excellent  results  on  latest  consoles.  SIll  improving  over  SDK  updates   •  We  are  expecIng  the  best  results  on  latest  PC  APIs  (Mantle/DX12/Vulkan)     •  We  improved  those  results  significantly  by  tweaking  tasks  (see  guidelines)   •  we  have  to  do  that  constantly  during  game  development  as  things  are  moving   •  MulItask  overhead  VS  overall  performances   •  Scheduling  cost,  submission  cost   •  Cache  misses  easier  to  raise  (when  cache  is  shared  through  cores)   •  You  should  sIll  get  benefits    
  • 20. Issues •  NOT  for  every  environment:     •  PC  D3D11,  efforts  were  made  on  some  recent  drivers,  but  result  depends  on   IHV  (independent  hardware  vendor)   •  From  really  good  improvements  to  horrible  performances  loss   •  Could  rely  on  D3D11_FEATURE_DATA_THREADING::DriverCommandLists  with  recent   drivers   •  We  fallback  on  an  hybrid  mode  when  not  correctly  supported   •  only  pure-­‐cpu  tasks  are  parallelized,  gpu-­‐tasks  run  on  just  one  worker,  with  only  one  graphic   context.   •  Gpu  dependencies  converted  to  cpu  ones  to  keep  frame  ordering  consistency  
  • 21. Issues •  …easy  to  break  rendering  with  random  arIfacts  hard  to  understand.   •  We  developed  in-­‐house  debug  tools  &  commands   •  Could  switch  on  the  fly  to  single  threaded  execuIon   •  Could  display  on-­‐screen  intermediate  task’s  RT  outputs   •  ExecuIon  Imeline  recording   •  Could  record  submissions  ordering  of  a  buggy  frame  and  replay  it   •  Dependency  graphs  generaIon   •  …  sIll  evolving  
  • 22. Issue  example:  the  «  renaming  »  case Update  a  dynamic  GPU  resource  on  task  A.     Use  it  in  a  command  buffer  in  task  B   •  Doesn’t  require  an  extra  CPU  dependency  between  them.   From  the  CPU,  execuIng  update(A)  before,  during  or  aNer  binding(B)  is  completely  valid.     MainThread   …   RenderLoop   …   Worker1   A  (update)   Worker2   B  (binding)   …  
  • 23. Issue  example:  the  «  renaming  »  case •  On  PC  D3D11,  driver  handles  this  for  you   •  On  update,  it  «  renames  »  the  resource  =  it  creates  another  copy  version   •  On  binding  use,  it  adds  «  split  point  »  in  the  command  buffer  each  Ime  the  actual   copy  version  behind  a  dynamic  resource  is  unknown  (=  not  updated  within  the  same   local  command  buffer).   •  On  submission,  it  patches  all  the  split  points  of  the  command  buffer  according  to   other  preceding  submissions   •  -­‐>  bad  performances  overhead  !   •  On  consoles  &  new  PC  APIs:  manual  management   •  Much  more  efficient   •  Requires  your  knowledge  of  the  actual  renamed  «  version  »  to  use  in  the  binding   task(B)   •  -­‐>  Input/output  task  chaining  gives  that    
  • 24. To  be  conFnued:  guidelines •  Bench  it  !   •  Use  low-­‐level  profiling  tools  to  observe  stalls,  holes  in  the  Imeline,   preempIon   •  PC:  MicrosoN  Concurrency  Visualizer  [3],  …   •  Improve  work  split  /  CPU  dependencies  to  prevent  holes  /  improve  code   paqerns  to  prevent  CPU  stalls  /  etc….  will  increase  results  significantly   •  Be  careful  to  not  have  too  many  thread  context  switches.   •  Tweak  core  affiniIes  of  your  tasks  (consoles)   •  Granularity  of  split:  overhead  vs  performance  gain  
  • 25. To  be  conFnued:  next  steps •  Use  extra  GPU  engine  (Asynchronous  compute,  DMA,  …)  to  also   improve  GPU  parallelism  –  consoles  &  new  PC  APIs  only   •  Re-­‐use  tasks  GPU  dependencies  to  manage  GPU  queues  synchronizaIons   •  Thinking  about  a  system  allowing  tasks  generaIng  very  small   command  buffers  to  give  it  to  another  task  at  the  end,  instead  of   registering  for  submission  directly.     •  -­‐>  hard  to  manage  correctly  submission  ordering  
  • 26. QuesFons  ?             jvirga@arkane-­‐studios.com  
  • 27. Bonus  slide:  kill  mutexes •  Mutexes  are  your  nemesis.   •  There  is  oNen  a  more  efficient  paqern  or  primiIve  to  avoid  using   them.   •  Use  spin  lock  when  you  know  the  lock  duraIon  Ime  is  really  small   •  Use  lockless  queues,  etc…   •  Pre-­‐allocate  containers  and  use  them  with  atomic  indexes  increment   •  Use  Read/Write  mutex  when  you  know  there  are  much  more  read  than  write   on  the  data  (several  concurrent  reads  allowed,  exclusive  write)   •  Use  thread  local  storage  in  code  called  concurrently     •  …  
  • 28. Bonus  slide:  scheduler  implementaFon  details •  RenderLoop   •  (A)  For  each  Task,  PushTask()   •  readyToStart  (CPU  dependencies  +  task  not  skipped  by  runIme)  &&  there’s  an  available   worker  ?   •  Send  signal  to  the  worker   •  else   •  Place  in  queue   •  (B)  Wait  for  a  pending  task  submission.   •  (C)  For  each  pending  submission   •  readyToSubmit  (GPU  dependencies)  ?   •  Send  command  buffer  to  GPU  queue   •  Release/Recycle  it  (plarorm  dependent)   •  Repeat  (B)  and  (C)  unIl  all  frame  tasks  are  completed  &  submiqed  –  except  for   “background”  tasks      
  • 29. Bonus  slide:  scheduler  implementaFon  details •  Worker   •  (A)  Wait  for  a  task  readyToStart  signal   •  (B)  If  the  task  requires  command  buffer,  assign  it  a  graphic  context   •  (C)  Execute  the  task   •  (D)  Close  the  command  buffer   •  (E)  Place  task  into  pending  submission  queue  if  command  buffer  actually  filled   •  (D)  check  for  any  other  available  task  to  run  in  scheduler  queue     •  return  to  (B)  else  return  to  (A)  
  • 30. Bonus  slide:  Concurrency  Visualizer •  Low-­‐level:  catch  CPU  core  stalls,  memory  management,  preempIon,  sleep,  IO   •  Blocking  &  unblocking  call  stacks   •  Timings   •  Markers  API  to  make  it  readable   •  Observe  context  switches           Catch  context  switches    
  • 33. References •  [1]  Intel  Threading  Building  Blocks  (library)   hqps://www.threadingbuildingblocks.org/   •  [2]  MicrosoN  Parallel  Paqerns  Library   hqps://msdn.microsoN.com/en-­‐us/library/vstudio/dd492418.aspx   •  [3]  MicrosoN  Concurrency  Visualizer  (PC  profiling  tool)   •  Bundled  with  Visual  Studio  2012   •  OpIonal  extension  since  2013   hqps://visualstudiogallery.msdn.microsoN.com/24b56e51-­‐fcc2-­‐423f-­‐b811-­‐ f16f3fa3af7a