War stories
from .NET team
.NET Core Summer event 2019 – Prague, CZ
Karel Zikmund – @ziki_cz
Agenda
• Stories
• Investigations on .NET team
• Not just from me
• Lessons learned on the way
You won’t see any:
• Source code
• Debugger
Not needed: Deep .NET knowledge
Not on agenda
My First Serious Investigation
• Build lab for Windows component
• Build break 1x per week
• AccessViolation dialog hangs machine
• Toolset updated to 2.0 RTM
• Repro:
• Once in ~50 runs
• Overnight run: 247 crashes out of 77,006 runs (0.3%)
My First Serious Investigation - quotes
• "The actual crash is occurring on some boilerplate stack checking
code …“
• “Karel is relatively new to the code base so he indicated it might take
some time to understand what’s going on”
mscorwks!UTSemReadWrite::UnlockRead+0xe [f:rtmndpclrsrcutilcodeutsem.cpp @ 357]
mscorwks!CMDSemReadWrite::~CMDSemReadWrite+0x14 [f:rtm...mdencrwutil.cpp @ 1299]
mscorwks!RegMeta::DefineParam+0x196 [f:rtmndpclrsrcmdcompileremit.cpp @ 2719]
cscomp!EMITTER::EmitParamProp
cscomp!ParamAttrBind::Init
cscomp!ParamAttrBind::CompileParamList
cscomp!CLSDREC::compileMethod
cscomp!CLSDREC::CompileMember
cscomp!CLSDREC::EnumMembersInEmitOrder
cscomp!CLSDREC::compileAggregate
cscomp!CLSDREC::compileNamespace
cscomp!COMPILER::CompileAll
cscomp!COMPILER::Compile
cscomp!CController::RunCompiler
cscomp!CController::Compile
csc!main
My First Serious Investigation
My First Serious Investigation
• Who corrupts stack?
• GC?
• NO!
• Changed value between caller and callee
• Single bit changed
• Who corrupts it?
• GC card table updates?
• Of course NOT!
• What about HW?
• Naw!
• Or maybe?
My First Serious Investigation
• Does it by a chance reproduce on only one machine?
• Answer: How did you know?
• But why always the same callstack?
• Good question, no good answer … magic
• Lesson learned: Debugging HW errors is costly and hard
• Always ask: Does it repro on more than 1 machine?
Another MetaData story
MetaData format background:
• Basically database – rows and columns
• Example – TypeDef table:
• Indexes into tables/heaps are either 2B or 4B
• What happens if last TypeDef has no methods?
• MethodList = Number of methods + 1 = max + 1
• What happens if there is 0xffff methods?
Flags TypeName TypeNamespace Extends MethodList
(Public) “Foo” “Awesome.Story” … Method #10
(Private) “Bar” “Awesome.Story” … Method #11
Another MetaData story
• II.24.2.6 “#~ stream”
• If e is a simple index into a table with index i, it is stored using 2 bytes if table i has less than
2^16 rows, otherwise it is stored using 4 bytes.
• II.22.37 TypeDef : 0x02
• 21. If MethodList is non-null, it shall index a valid row in the MethodDef table, where valid
means 1 <= row <= rowcount+1 [ERROR]
• How do you fix it?
• “I’m on the fence whether we should (fix it), given it looks like people hit this about once in 17
years”
• https://github.com/dotnet/corefx/issues/29554
• Lesson learned: Not all bugs have to be fixed
Breaking changes – Intro
• Everyone wants fix for their bug
• But nobody wants to be broken
• Observation: 10% of fixes have unintended side-effects
• Extreme case: Perf improvement can break app
• How many customers?
• Lesson learned: Everything has risk of breaking someone
Breaking changes – Last build
• Finance app crashing – “last” build of Windows 8 on arm (Surface RT)
• Latent bug (introduced months ago)
• Bug triggered by:
1. Method in NGen image has to be across 8KB pages
2. GC has to be triggered at least twice when it’s on stack
• Unrelated change caused “unlucky” method order for:
• System.Net.Configuration.DefaultProxySectionInternal..ctor
• Lesson learned: Anything, really ANYTHING, has risk of breaking
Breaking changes – Huge impact
• Patch to .NET Framework broke certain tax SW
• Printing tax forms
• Update pushed few days before tax deadline in US
• Note: Printing was tested on both sides (Microsoft & tax SW
company)
• But only into file, not to printer
• Lessons learned: Be extra cautious around sensitive dates
Networking – Security issue
• January: Researcher running ML models on Cosmos
• Suspicion about buffers – more logging
• March: Repro gone
• May: Similar report
• +2 weeks: It blows up (more teams & impact)
• All hands on-deck
• Small repro (20 min, then 1 min) … yay!
• TTD trace (iDNA / TTT) … bonus & life saver
Networking – Security issue
• Root-cause: HTTP pipelining under stress
• 13 years old bug (.NET 2.0)
Response 1
Request 1
Server
Response 1
Request 1
Server
Request 2
Response 2
Networking – Security issue
Request 1
Server
Request 2Request 3
Response 1Response 2
Networking – Security issue
Request 1
Server
Request 2Request 3
Response 1Response 2
Networking – Security issue
• We have workaround (disable pipelining) – perf impact
• Worked fix …
• Verifying fix …
• Repro fails after 4h 
• Same symptoms
• Repro sensitive to cloud network load (8-17)
• TTD (iDNA / TTT) does not work 
• Suspicion about buffers again
Networking – Security issue
• Bad buffer lifetime management – on sending side!
• 5 years old bug (.NET 4.5.2)
• Trigger found:
• Thanks to Skype team – 24h deployment of experiments
• Change in .NET 4.7.1
• Fix around the problematic area
• Making the opportunity window SMALLER!
• … counter-intuitive
• Code review – similar bug on receiving side (5 years old)
• Same symptoms as HTTP pipelining
Networking – Security issue
• Why so many customers/services hit it at once?
• Maybe Spectre & Meltdown fixes roll out?
• or just … magic
• Lesson learned: Weird coincidences can happen …
Developer’s pride in multi-threading
• School project (2000-2003)
• Game simulation server – heavily multi-threaded
• https://github.com/karelz/WarPlusPlus (nostalgia)
• Classic deadlock – 2 threads locking A and B in different order
• Deadlock avoidance started make sense
• WinRT binder (2010)
• Binder is tricky – GC interaction (NO_GC range)
• Type routed to WinMD file, assembly meaningless
• Negotiated on namespace only in 1 assembly
• Multiple reviews, discussions with architects
• Bugs start to come in after shipping (NullReferenceException)
Optimizations
• Once upon a time, … there was a service in Microsoft
• List vs. array data structure perf
• Perspectives:
1. The data structure will have in practice 3-5 items
2. There 3 hops between servers for each request!!!
• Lesson learned: Avoid premature optimizations … at all cost
Breaking changes – Below you
• RavenDB – blue screen after KB4487017 on .NET Core!
• dotnet/coreclr#22597
• PrefetchVirtualMemory
• Kernel memory
management bug
Lessons learned
• Always ask: Does it repro on more than 1 machine?
• Debugging HW bugs is costly
• Some bugs happen once in 17 years
• Spec bugs are hard to fix
• MetaData format bug
• Anything, really ANYTHING, has risk of breaking someone
• Innocent changes can trigger latent bugs elsewhere
• Impact may be huge – e.g. during tax season
• Always try to create small repro
• Make your and everyone’s life easier
• TTD (iDNA / TTT) is life saver
• … sometimes there is just … magic
@ziki_cz
Thank you
• Feedback welcome
• Twitter DM, email, in-person, etc.
• Survey
• What you liked vs. not?
• Too rushed?
• Hard to understand?
• Boring?
• Didn’t meet your expectations?
@ziki_cz

More Related Content

PPTX
.NET Core Summer event 2019 in Brno, CZ - War stories from .NET team -- Karel...
PPTX
NDC Oslo 2019 - War stories from .NET team -- Karel Zikmund
PPTX
.NET Core Summer event 2019 in NL - War stories from .NET team -- Karel Zikmund
PPTX
.NET Core Summer event 2019 in Vienna, AT - War stories from .NET team -- Kar...
PDF
50 Shades of Fuzzing by Peter Hlavaty & Marco Grassi
PDF
XFLTReat: a new dimension in tunnelling
PPTX
Dock ir incident response in a containerized, immutable, continually deploy...
PDF
Esage on non-existent 0-days, stable binary exploits and user interaction
.NET Core Summer event 2019 in Brno, CZ - War stories from .NET team -- Karel...
NDC Oslo 2019 - War stories from .NET team -- Karel Zikmund
.NET Core Summer event 2019 in NL - War stories from .NET team -- Karel Zikmund
.NET Core Summer event 2019 in Vienna, AT - War stories from .NET team -- Kar...
50 Shades of Fuzzing by Peter Hlavaty & Marco Grassi
XFLTReat: a new dimension in tunnelling
Dock ir incident response in a containerized, immutable, continually deploy...
Esage on non-existent 0-days, stable binary exploits and user interaction

Similar to .NET Core Summer event 2019 in Prague, CZ - War stories from .NET team -- Karel Zikmund (20)

PPTX
.NET Core Summer event 2019 in Linz, AT - War stories from .NET team -- Karel...
PPTX
Patching Windows Executables with the Backdoor Factory | DerbyCon 2013
PPTX
Vulnerability Inheritance in ICS (English)
PDF
Fixing twitter
PDF
Fixing_Twitter
PDF
Fixing Twitter Improving The Performance And Scalability Of The Worlds Most ...
PDF
Fixing Twitter Improving The Performance And Scalability Of The Worlds Most ...
PDF
Surge2012
PDF
Chirp 2010: Scaling Twitter
PPTX
lecture03_EmbeddedSoftware for Beginners
PPTX
Debugging multiplayer games
PDF
There's no magic... until you talk about databases
PDF
John adams talk cloudy
PPTX
On non existent 0-days, stable binary exploits and
PPTX
You didnt see it’s coming? "Dawn of hardened Windows Kernel"
PPTX
BSides MCR 2016: From CSV to CMD to qwerty
PPTX
Case Study of the Unexplained
PPTX
Security research over Windows #defcon china
PDF
Introduction to multicore .ppt
PPT
5 Pitfalls to Avoid with MongoDB
.NET Core Summer event 2019 in Linz, AT - War stories from .NET team -- Karel...
Patching Windows Executables with the Backdoor Factory | DerbyCon 2013
Vulnerability Inheritance in ICS (English)
Fixing twitter
Fixing_Twitter
Fixing Twitter Improving The Performance And Scalability Of The Worlds Most ...
Fixing Twitter Improving The Performance And Scalability Of The Worlds Most ...
Surge2012
Chirp 2010: Scaling Twitter
lecture03_EmbeddedSoftware for Beginners
Debugging multiplayer games
There's no magic... until you talk about databases
John adams talk cloudy
On non existent 0-days, stable binary exploits and
You didnt see it’s coming? "Dawn of hardened Windows Kernel"
BSides MCR 2016: From CSV to CMD to qwerty
Case Study of the Unexplained
Security research over Windows #defcon china
Introduction to multicore .ppt
5 Pitfalls to Avoid with MongoDB
Ad

More from Karel Zikmund (17)

PPTX
.NET Conf 2022 - Networking in .NET 7
PPTX
NDC London 2020 - Challenges of Managing CoreFx Repo -- Karel Zikmund
PPTX
NDC Sydney 2019 - Async Demystified -- Karel Zikmund
PPTX
WUG Days 2022 Brno - Networking in .NET 7.0 and YARP -- Karel Zikmund
PDF
.NET Core Summer event 2019 in Vienna, AT - .NET 5 - Future of .NET on Mobile...
PPTX
.NET Core Summer event 2019 in Brno, CZ - Async demystified -- Karel Zikmund
PPTX
.NET Core Summer event 2019 in Brno, CZ - .NET Core Networking stack and perf...
PPTX
DotNext 2017 in Moscow - Challenges of Managing CoreFX repo -- Karel Zikmund
PPTX
DotNext 2017 in Moscow - .NET Core Networking stack and Performance -- Karel ...
PPTX
.NET MeetUp Brno 2017 - Microsoft Engineering teams in Europe -- Karel Zikmund
PPTX
.NET MeetUp Brno 2017 - Xamarin .NET internals -- Marek Safar
PPTX
.NET MeetUp Brno - Challenges of Managing CoreFX repo -- Karel Zikmund
PPTX
.NET Fringe 2017 - Challenges of Managing CoreFX repo -- Karel Zikmund
PPTX
.NET MeetUp Prague 2017 - Challenges of Managing CoreFX repo -- Karel Zikmund
PPTX
.NET MeetUp Prague 2017 - .NET Standard -- Karel Zikmund
PPTX
.NET MeetUp Amsterdam 2017 - Challenges of Managing CoreFX repo -- Karel Zikmund
PPTX
.NET MeetUp Amsterdam 2017 - .NET Standard -- Karel Zikmund
.NET Conf 2022 - Networking in .NET 7
NDC London 2020 - Challenges of Managing CoreFx Repo -- Karel Zikmund
NDC Sydney 2019 - Async Demystified -- Karel Zikmund
WUG Days 2022 Brno - Networking in .NET 7.0 and YARP -- Karel Zikmund
.NET Core Summer event 2019 in Vienna, AT - .NET 5 - Future of .NET on Mobile...
.NET Core Summer event 2019 in Brno, CZ - Async demystified -- Karel Zikmund
.NET Core Summer event 2019 in Brno, CZ - .NET Core Networking stack and perf...
DotNext 2017 in Moscow - Challenges of Managing CoreFX repo -- Karel Zikmund
DotNext 2017 in Moscow - .NET Core Networking stack and Performance -- Karel ...
.NET MeetUp Brno 2017 - Microsoft Engineering teams in Europe -- Karel Zikmund
.NET MeetUp Brno 2017 - Xamarin .NET internals -- Marek Safar
.NET MeetUp Brno - Challenges of Managing CoreFX repo -- Karel Zikmund
.NET Fringe 2017 - Challenges of Managing CoreFX repo -- Karel Zikmund
.NET MeetUp Prague 2017 - Challenges of Managing CoreFX repo -- Karel Zikmund
.NET MeetUp Prague 2017 - .NET Standard -- Karel Zikmund
.NET MeetUp Amsterdam 2017 - Challenges of Managing CoreFX repo -- Karel Zikmund
.NET MeetUp Amsterdam 2017 - .NET Standard -- Karel Zikmund
Ad

Recently uploaded (20)

PDF
Crypto Loss And Recovery Guide By Expert Recovery Agency.
PPTX
Beige and Black Minimalist Project Deck Presentation (1).pptx
PDF
MiniTool Power Data Recovery 12.6 Crack + Portable (Latest Version 2025)
PPTX
WJQSJXNAZJVCVSAXJHBZKSJXKJKXJSBHJBJEHHJB
PPTX
Lesson-3-Operation-System-Support.pptx-I
PDF
Engineering Document Management System (EDMS)
PPTX
Human Computer Interaction lecture Chapter 2.pptx
PPTX
SAP Business AI_L1 Overview_EXTERNAL.pptx
PDF
Streamlining Project Management in Microsoft Project, Planner, and Teams with...
PPTX
ESDS_SAP Application Cloud Offerings.pptx
PDF
PDF-XChange Editor Plus 10.7.0.398.0 Crack Free Download Latest 2025
PPTX
Why 2025 Is the Best Year to Hire Software Developers in India
PDF
CapCut PRO for PC Crack New Download (Fully Activated 2025)
PPTX
Chapter_05_System Modeling for software engineering
PDF
What Makes a Great Data Visualization Consulting Service.pdf
PDF
Ragic Data Security Overview: Certifications, Compliance, and Network Safegua...
PDF
MAGIX Sound Forge Pro CrackSerial Key Keygen
PPTX
Human-Computer Interaction for Lecture 1
PPTX
Human-Computer Interaction for Lecture 2
PDF
WhatsApp Chatbots The Key to Scalable Customer Support.pdf
Crypto Loss And Recovery Guide By Expert Recovery Agency.
Beige and Black Minimalist Project Deck Presentation (1).pptx
MiniTool Power Data Recovery 12.6 Crack + Portable (Latest Version 2025)
WJQSJXNAZJVCVSAXJHBZKSJXKJKXJSBHJBJEHHJB
Lesson-3-Operation-System-Support.pptx-I
Engineering Document Management System (EDMS)
Human Computer Interaction lecture Chapter 2.pptx
SAP Business AI_L1 Overview_EXTERNAL.pptx
Streamlining Project Management in Microsoft Project, Planner, and Teams with...
ESDS_SAP Application Cloud Offerings.pptx
PDF-XChange Editor Plus 10.7.0.398.0 Crack Free Download Latest 2025
Why 2025 Is the Best Year to Hire Software Developers in India
CapCut PRO for PC Crack New Download (Fully Activated 2025)
Chapter_05_System Modeling for software engineering
What Makes a Great Data Visualization Consulting Service.pdf
Ragic Data Security Overview: Certifications, Compliance, and Network Safegua...
MAGIX Sound Forge Pro CrackSerial Key Keygen
Human-Computer Interaction for Lecture 1
Human-Computer Interaction for Lecture 2
WhatsApp Chatbots The Key to Scalable Customer Support.pdf

.NET Core Summer event 2019 in Prague, CZ - War stories from .NET team -- Karel Zikmund

  • 1. War stories from .NET team .NET Core Summer event 2019 – Prague, CZ Karel Zikmund – @ziki_cz
  • 2. Agenda • Stories • Investigations on .NET team • Not just from me • Lessons learned on the way You won’t see any: • Source code • Debugger Not needed: Deep .NET knowledge Not on agenda
  • 3. My First Serious Investigation • Build lab for Windows component • Build break 1x per week • AccessViolation dialog hangs machine • Toolset updated to 2.0 RTM • Repro: • Once in ~50 runs • Overnight run: 247 crashes out of 77,006 runs (0.3%)
  • 4. My First Serious Investigation - quotes • "The actual crash is occurring on some boilerplate stack checking code …“ • “Karel is relatively new to the code base so he indicated it might take some time to understand what’s going on”
  • 5. mscorwks!UTSemReadWrite::UnlockRead+0xe [f:rtmndpclrsrcutilcodeutsem.cpp @ 357] mscorwks!CMDSemReadWrite::~CMDSemReadWrite+0x14 [f:rtm...mdencrwutil.cpp @ 1299] mscorwks!RegMeta::DefineParam+0x196 [f:rtmndpclrsrcmdcompileremit.cpp @ 2719] cscomp!EMITTER::EmitParamProp cscomp!ParamAttrBind::Init cscomp!ParamAttrBind::CompileParamList cscomp!CLSDREC::compileMethod cscomp!CLSDREC::CompileMember cscomp!CLSDREC::EnumMembersInEmitOrder cscomp!CLSDREC::compileAggregate cscomp!CLSDREC::compileNamespace cscomp!COMPILER::CompileAll cscomp!COMPILER::Compile cscomp!CController::RunCompiler cscomp!CController::Compile csc!main My First Serious Investigation
  • 6. My First Serious Investigation • Who corrupts stack? • GC? • NO! • Changed value between caller and callee • Single bit changed • Who corrupts it? • GC card table updates? • Of course NOT! • What about HW? • Naw! • Or maybe?
  • 7. My First Serious Investigation • Does it by a chance reproduce on only one machine? • Answer: How did you know? • But why always the same callstack? • Good question, no good answer … magic • Lesson learned: Debugging HW errors is costly and hard • Always ask: Does it repro on more than 1 machine?
  • 8. Another MetaData story MetaData format background: • Basically database – rows and columns • Example – TypeDef table: • Indexes into tables/heaps are either 2B or 4B • What happens if last TypeDef has no methods? • MethodList = Number of methods + 1 = max + 1 • What happens if there is 0xffff methods? Flags TypeName TypeNamespace Extends MethodList (Public) “Foo” “Awesome.Story” … Method #10 (Private) “Bar” “Awesome.Story” … Method #11
  • 9. Another MetaData story • II.24.2.6 “#~ stream” • If e is a simple index into a table with index i, it is stored using 2 bytes if table i has less than 2^16 rows, otherwise it is stored using 4 bytes. • II.22.37 TypeDef : 0x02 • 21. If MethodList is non-null, it shall index a valid row in the MethodDef table, where valid means 1 <= row <= rowcount+1 [ERROR] • How do you fix it? • “I’m on the fence whether we should (fix it), given it looks like people hit this about once in 17 years” • https://github.com/dotnet/corefx/issues/29554 • Lesson learned: Not all bugs have to be fixed
  • 10. Breaking changes – Intro • Everyone wants fix for their bug • But nobody wants to be broken • Observation: 10% of fixes have unintended side-effects • Extreme case: Perf improvement can break app • How many customers? • Lesson learned: Everything has risk of breaking someone
  • 11. Breaking changes – Last build • Finance app crashing – “last” build of Windows 8 on arm (Surface RT) • Latent bug (introduced months ago) • Bug triggered by: 1. Method in NGen image has to be across 8KB pages 2. GC has to be triggered at least twice when it’s on stack • Unrelated change caused “unlucky” method order for: • System.Net.Configuration.DefaultProxySectionInternal..ctor • Lesson learned: Anything, really ANYTHING, has risk of breaking
  • 12. Breaking changes – Huge impact • Patch to .NET Framework broke certain tax SW • Printing tax forms • Update pushed few days before tax deadline in US • Note: Printing was tested on both sides (Microsoft & tax SW company) • But only into file, not to printer • Lessons learned: Be extra cautious around sensitive dates
  • 13. Networking – Security issue • January: Researcher running ML models on Cosmos • Suspicion about buffers – more logging • March: Repro gone • May: Similar report • +2 weeks: It blows up (more teams & impact) • All hands on-deck • Small repro (20 min, then 1 min) … yay! • TTD trace (iDNA / TTT) … bonus & life saver
  • 14. Networking – Security issue • Root-cause: HTTP pipelining under stress • 13 years old bug (.NET 2.0) Response 1 Request 1 Server Response 1 Request 1 Server Request 2 Response 2
  • 15. Networking – Security issue Request 1 Server Request 2Request 3 Response 1Response 2
  • 16. Networking – Security issue Request 1 Server Request 2Request 3 Response 1Response 2
  • 17. Networking – Security issue • We have workaround (disable pipelining) – perf impact • Worked fix … • Verifying fix … • Repro fails after 4h  • Same symptoms • Repro sensitive to cloud network load (8-17) • TTD (iDNA / TTT) does not work  • Suspicion about buffers again
  • 18. Networking – Security issue • Bad buffer lifetime management – on sending side! • 5 years old bug (.NET 4.5.2) • Trigger found: • Thanks to Skype team – 24h deployment of experiments • Change in .NET 4.7.1 • Fix around the problematic area • Making the opportunity window SMALLER! • … counter-intuitive • Code review – similar bug on receiving side (5 years old) • Same symptoms as HTTP pipelining
  • 19. Networking – Security issue • Why so many customers/services hit it at once? • Maybe Spectre & Meltdown fixes roll out? • or just … magic • Lesson learned: Weird coincidences can happen …
  • 20. Developer’s pride in multi-threading • School project (2000-2003) • Game simulation server – heavily multi-threaded • https://github.com/karelz/WarPlusPlus (nostalgia) • Classic deadlock – 2 threads locking A and B in different order • Deadlock avoidance started make sense • WinRT binder (2010) • Binder is tricky – GC interaction (NO_GC range) • Type routed to WinMD file, assembly meaningless • Negotiated on namespace only in 1 assembly • Multiple reviews, discussions with architects • Bugs start to come in after shipping (NullReferenceException)
  • 21. Optimizations • Once upon a time, … there was a service in Microsoft • List vs. array data structure perf • Perspectives: 1. The data structure will have in practice 3-5 items 2. There 3 hops between servers for each request!!! • Lesson learned: Avoid premature optimizations … at all cost
  • 22. Breaking changes – Below you • RavenDB – blue screen after KB4487017 on .NET Core! • dotnet/coreclr#22597 • PrefetchVirtualMemory • Kernel memory management bug
  • 23. Lessons learned • Always ask: Does it repro on more than 1 machine? • Debugging HW bugs is costly • Some bugs happen once in 17 years • Spec bugs are hard to fix • MetaData format bug • Anything, really ANYTHING, has risk of breaking someone • Innocent changes can trigger latent bugs elsewhere • Impact may be huge – e.g. during tax season • Always try to create small repro • Make your and everyone’s life easier • TTD (iDNA / TTT) is life saver • … sometimes there is just … magic @ziki_cz
  • 24. Thank you • Feedback welcome • Twitter DM, email, in-person, etc. • Survey • What you liked vs. not? • Too rushed? • Hard to understand? • Boring? • Didn’t meet your expectations? @ziki_cz

Editor's Notes

  • #2: Quickly about me: .NET team for almost 14 years Started as junior / out of college on Runtime – C++, pieces like Metadata, TypeSystem, Assembly Loader Later on moved to manager role Then moved to BCL (Base Class Libraries) – Networking area mainly (HttpClient) … working in open-source (.NET Core) Community manager of dotnet/corefx repo
  • #3: Lessons learned – maybe useful to you Maybe just helps you understand what is happening on the other side / below you I already had few people confirm they hit some/all situations Were able to identify with problems and recommendations
  • #4: 2006 January – 3 months in MS Large code base, dozens of machines, productivity impact on larger team Crash – “hang dialog” with AV msbuild -> C# compiler Recently upgraded toolset to 2.0 RTM (.NET Framework, not Core ) Repro – great Getting heap dumps We get to see callstack … but before that, some quotes
  • #5: … in the metadata writer code
  • #6: Simplified callstack for readability AV in MetaData emitting – defining a parameter Basically stack corruption (dangerous) Proper RW lock Who corrupts memory? …
  • #7: GC? … not Roslyn – this is native, no GC Why something else? C# compiler is deterministic Go into assembly (x86) – what is arguments, vs. locals * Great exercise to learn/refresh all this in here
  • #8: Costly and hard … and requires quite some expertise Variants: Different machine setup? … driver bugs Extreme from Maoni: Real HW?
  • #9: 1 year old story – 2018 May First background on MetaData Compressed indexes = just schema which says 2B, 4B … variable between files, but static/stable and given per file MethodList = Start of list of methods, INCLUSIVE
  • #10: How do you fix that? … You don’t … spec bug / format bug Changing rules means rewriting & recompiling all tools (CCI and command line tools like ildasm, or UI Reflector, ILSpy, Visual Studio, debuggers, profilers, …) Compensate? Rearranging fields/methods/params in a way the last one does not need the +1. Nasty Emitting fake type/method with field/method/param to push row count to 2^16. Also nasty Using 0 as valid value? Readers will be surprised, maybe other bugs?
  • #11: Read slides
  • #12: OEM getting builds 2 days Paranoia
  • #13: Sensitive dates like tax date, shopping season? (December) … online stores usually have stop on any changes
  • #14: Last July (2018) Story starts 8 months earlier in December 2017 Is it server or client problem? … wireshark traces Around Feb, we know it is client - .NET or Windows March – repro is gone (they upgraded cluster) (fast forward 2 months) May another email thread – similar symptoms Back and forth Heated Realize it is 2 different products on the thread And then couple of more start coming in span of 2 weeks Impact on one customer is huge Potential: Data loss Information disclosure – mixing data in multi-tenant scenarios 3-4 weeks of all-hands on deck + 24/7 We had iDNA trace (TTD / TTT)
  • #16: What happens when requests are cancelled? If 1st – close connection If last – remove it & and mark for closing If in middle – remove it & and mark for closing
  • #17: Bad things can happen – imagine you asked: “Does the data exist?” … data loss Multi-tenant scenarios: “Give me data about customer X” … data about Y
  • #18: Added logging (ETW) – reused buffers Old code – track down bad buffer management
  • #21: Story begins at University (2000-2003) - Large project (2.5 years, 1M+ lines of code, 5 people)
  • #22: Something like SharePoint Online, Bing, O365 OneNote – where response to customer matters Caring about perf early (with measurements!) != premature optimization
  • #23: February case Blue screen – not something CoreCLR should be able to cause Used PrefetchVirtualMemory to optimize perf – rarely used When you’re on cutting edge, pushing the limits, you should be prepared for anything …
  • #25: Help me do better job next time