War stories
from .NET team
.NET Core Summer event 2019 – Vienna, AT
Karel Zikmund – @ziki_cz
Agenda
• Stories
• Investigations on .NET team
• Not just from me
• Lessons learned on the way
You won’t see any:
• Source code
• Debugger
Not needed: Deep .NET knowledge
Not on agenda
My First Serious Investigation
• Build lab for Windows component
• Build break 1x per week
• AccessViolation dialog hangs machine
• Toolset updated to 2.0 RTM
• Repro:
• Once in ~50 runs
• Overnight run: 247 crashes out of 77,006 runs (0.3%)
My First Serious Investigation - quotes
• "The actual crash is occurring on some boilerplate stack checking
code …“
• “Karel is relatively new to the code base so he indicated it might take
some time to understand what’s going on”
mscorwks!UTSemReadWrite::UnlockRead+0xe [f:rtmndpclrsrcutilcodeutsem.cpp @ 357]
mscorwks!CMDSemReadWrite::~CMDSemReadWrite+0x14 [f:rtm...mdencrwutil.cpp @ 1299]
mscorwks!RegMeta::DefineParam+0x196 [f:rtmndpclrsrcmdcompileremit.cpp @ 2719]
cscomp!EMITTER::EmitParamProp
cscomp!ParamAttrBind::Init
cscomp!ParamAttrBind::CompileParamList
cscomp!CLSDREC::compileMethod
cscomp!CLSDREC::CompileMember
cscomp!CLSDREC::EnumMembersInEmitOrder
cscomp!CLSDREC::compileAggregate
cscomp!CLSDREC::compileNamespace
cscomp!COMPILER::CompileAll
cscomp!COMPILER::Compile
cscomp!CController::RunCompiler
cscomp!CController::Compile
csc!main
My First Serious Investigation
My First Serious Investigation
• Who corrupts stack?
• GC?
• NO!
• Changed value between caller and callee
• Single bit changed
• Who corrupts it?
• GC card table updates?
• Of course NOT!
• What about HW?
• Naw!
• Or maybe?
My First Serious Investigation
• Does it by a chance reproduce on only one machine?
• Answer: How did you know?
• But why always the same callstack?
• Good question, no good answer … magic
• Lesson learned: Debugging HW errors is costly and hard
• Always ask: Does it repro on more than 1 machine?
Another MetaData story
MetaData format background:
• Basically database – rows and columns
• Example – TypeDef table:
• Indexes into tables/heaps are either 2B or 4B
• What happens if last TypeDef has no methods?
• MethodList = Number of methods + 1 = max + 1
• What happens if there is 0xffff methods?
Flags TypeName TypeNamespace Extends MethodList
(Public) “Foo” “Awesome.Story” … Method #10
(Private) “Bar” “Awesome.Story” … Method #11
Another MetaData story
• II.24.2.6 “#~ stream”
• If e is a simple index into a table with index i, it is stored using 2 bytes if table i has less than
2^16 rows, otherwise it is stored using 4 bytes.
• II.22.37 TypeDef : 0x02
• 21. If MethodList is non-null, it shall index a valid row in the MethodDef table, where valid
means 1 <= row <= rowcount+1 [ERROR]
• How do you fix it?
• “I’m on the fence whether we should (fix it), given it looks like people hit this about once in 17
years”
• https://github.com/dotnet/corefx/issues/29554
• Lesson learned: Not all bugs have to be fixed
Breaking changes – Intro
• Everyone wants fix for their bug
• But nobody wants to be broken
• Observation: 10% of fixes have unintended side-effects
• Extreme case: Perf improvement can break app
• How many customers?
• Lesson learned: Everything has risk of breaking someone
Breaking changes – Last build
• Finance app crashing – “last” build of Windows 8 on arm (Surface RT)
• Latent bug (introduced months ago)
• Bug triggered by:
1. Method in NGen image has to be across 8KB pages
2. GC has to be triggered at least twice when it’s on stack
• Unrelated change caused “unlucky” method order for:
• System.Net.Configuration.DefaultProxySectionInternal..ctor
• Lesson learned: Anything, really ANYTHING, has risk of breaking
Breaking changes – Huge impact
• Patch to .NET Framework broke certain tax SW
• Printing tax forms
• Update pushed few days before tax deadline in US
• Note: Printing was tested on both sides (Microsoft & tax SW
company)
• But only into file, not to printer
• Lessons learned: Be extra cautious around sensitive dates
Networking – Security issue
• January: Researcher running ML models on Cosmos
• Suspicion about buffers – more logging
• March: Repro gone
• May: Similar report
• +2 weeks: It blows up (more teams & impact)
• All hands on-deck
• Small repro (20 min, then 1 min) … yay!
• TTD trace (iDNA / TTT) … bonus & life saver
Networking – Security issue
• Root-cause: HTTP pipelining under stress
• 13 years old bug (.NET 2.0)
Response 1
Request 1
Server
Response 1
Request 1
Server
Request 2
Response 2
Networking – Security issue
Request 1
Server
Request 2Request 3
Response 1Response 2
Networking – Security issue
Request 1
Server
Request 2Request 3
Response 1Response 2
Networking – Security issue
• We have workaround (disable pipelining) – perf impact
• Worked fix …
• Verifying fix …
• Repro fails after 4h 
• Same symptoms
• Repro sensitive to cloud network load (8-17)
• TTD (iDNA / TTT) does not work 
• Suspicion about buffers again
Networking – Security issue
• Bad buffer lifetime management – on sending side!
• 5 years old bug (.NET 4.5.2)
• Trigger found:
• Thanks to Skype team – 24h deployment of experiments
• Change in .NET 4.7.1
• Fix around the problematic area
• Making the opportunity window SMALLER!
• … counter-intuitive
• Code review – similar bug on receiving side (5 years old)
• Same symptoms as HTTP pipelining
Networking – Security issue
• Why so many customers/services hit it at once?
• Maybe Spectre & Meltdown fixes roll out?
• or just … magic
• Lesson learned: Weird coincidences can happen …
Lessons learned
• Always ask: Does it repro on more than 1 machine?
• Debugging HW bugs is costly
• Some bugs happen once in 17 years
• Spec bugs are hard to fix
• MetaData format bug
• Anything, really ANYTHING, has risk of breaking someone
• Innocent changes can trigger latent bugs elsewhere
• Impact may be huge – e.g. during tax season
• Always try to create small repro
• Make your and everyone’s life easier
• TTD (iDNA / TTT) is life saver
• … sometimes there is just … magic
@ziki_cz
Thank you
• Feedback welcome
• Twitter DM, email, in-person, etc.
• Survey
• What you liked vs. not?
• Too rushed?
• Hard to understand?
• Boring?
• Didn’t meet your expectations?
@ziki_cz

More Related Content

PPTX
.NET Core Summer event 2019 in NL - War stories from .NET team -- Karel Zikmund
PPTX
NDC Oslo 2019 - War stories from .NET team -- Karel Zikmund
PPTX
.NET Core Summer event 2019 in Brno, CZ - War stories from .NET team -- Karel...
PPTX
.NET Core Summer event 2019 in Linz, AT - War stories from .NET team -- Karel...
PPTX
.NET Core Summer event 2019 in Prague, CZ - War stories from .NET team -- Kar...
PPTX
Final presentation
PDF
XFLTReat: a new dimension in tunnelling
PPTX
Dock ir incident response in a containerized, immutable, continually deploy...
.NET Core Summer event 2019 in NL - War stories from .NET team -- Karel Zikmund
NDC Oslo 2019 - War stories from .NET team -- Karel Zikmund
.NET Core Summer event 2019 in Brno, CZ - War stories from .NET team -- Karel...
.NET Core Summer event 2019 in Linz, AT - War stories from .NET team -- Karel...
.NET Core Summer event 2019 in Prague, CZ - War stories from .NET team -- Kar...
Final presentation
XFLTReat: a new dimension in tunnelling
Dock ir incident response in a containerized, immutable, continually deploy...

What's hot (17)

PDF
YearUp: Hacking for Jobs
PDF
CNIT 126 2: Malware Analysis in Virtual Machines & 3: Basic Dynamic Analysis
PDF
CNIT 127 Ch 16: Fault Injection and 17: The Art of Fuzzing
PDF
Honeypots, Cybercompetitions, and Bug Bounties
PPTX
Adversarial Post-Ex: Lessons From The Pros
PPTX
Practical Malware Analysis: Ch 2 Malware Analysis in Virtual Machines & 3: Ba...
PPTX
OTP, Concurrency and Testing Strategies
PPTX
Breadcrumbs to Loaves: BSides Austin '17
PDF
CNIT 152: 6. Scope & 7. Live Data Collection
PDF
Sans london april sans at night - tearing apart a fileless malware sample
PDF
CheckPlease: Payload-Agnostic Targeted Malware
PDF
3. Security Engineering
PDF
Network Forensics and Practical Packet Analysis
PDF
CNIT 126 11. Malware Behavior
PDF
CNIT 126: 8: Debugging
PPT
Case study
PPTX
Introduction to Penetration Testing
YearUp: Hacking for Jobs
CNIT 126 2: Malware Analysis in Virtual Machines & 3: Basic Dynamic Analysis
CNIT 127 Ch 16: Fault Injection and 17: The Art of Fuzzing
Honeypots, Cybercompetitions, and Bug Bounties
Adversarial Post-Ex: Lessons From The Pros
Practical Malware Analysis: Ch 2 Malware Analysis in Virtual Machines & 3: Ba...
OTP, Concurrency and Testing Strategies
Breadcrumbs to Loaves: BSides Austin '17
CNIT 152: 6. Scope & 7. Live Data Collection
Sans london april sans at night - tearing apart a fileless malware sample
CheckPlease: Payload-Agnostic Targeted Malware
3. Security Engineering
Network Forensics and Practical Packet Analysis
CNIT 126 11. Malware Behavior
CNIT 126: 8: Debugging
Case study
Introduction to Penetration Testing
Ad

Similar to .NET Core Summer event 2019 in Vienna, AT - War stories from .NET team -- Karel Zikmund (20)

PPTX
Patching Windows Executables with the Backdoor Factory | DerbyCon 2013
PDF
Fixing twitter
PDF
Fixing_Twitter
PDF
Fixing Twitter Improving The Performance And Scalability Of The Worlds Most ...
PDF
Fixing Twitter Improving The Performance And Scalability Of The Worlds Most ...
PPTX
Case Study of the Unexplained
PPTX
Debugging multiplayer games
PDF
Surge2012
PPTX
NDC London 2020 - Challenges of Managing CoreFx Repo -- Karel Zikmund
PPTX
Asufe juniors-training session2
PPTX
It summit 150604 cb_wcl_ld_kmh_v6_to_publish
PDF
Chirp 2010: Scaling Twitter
PPTX
Vulnerability Inheritance in ICS (English)
PDF
Using ~300 Billion DNS Queries to Analyse the TLD Name Collision Problem
PPTX
lecture03_EmbeddedSoftware for Beginners
PDF
ShaREing Is Caring
PDF
John adams talk cloudy
PDF
Introduction to multicore .ppt
PDF
Silicon Valley Code Camp 2015 - Advanced MongoDB - The Sequel
PPTX
Security research over Windows #defcon china
Patching Windows Executables with the Backdoor Factory | DerbyCon 2013
Fixing twitter
Fixing_Twitter
Fixing Twitter Improving The Performance And Scalability Of The Worlds Most ...
Fixing Twitter Improving The Performance And Scalability Of The Worlds Most ...
Case Study of the Unexplained
Debugging multiplayer games
Surge2012
NDC London 2020 - Challenges of Managing CoreFx Repo -- Karel Zikmund
Asufe juniors-training session2
It summit 150604 cb_wcl_ld_kmh_v6_to_publish
Chirp 2010: Scaling Twitter
Vulnerability Inheritance in ICS (English)
Using ~300 Billion DNS Queries to Analyse the TLD Name Collision Problem
lecture03_EmbeddedSoftware for Beginners
ShaREing Is Caring
John adams talk cloudy
Introduction to multicore .ppt
Silicon Valley Code Camp 2015 - Advanced MongoDB - The Sequel
Security research over Windows #defcon china
Ad

More from Karel Zikmund (16)

PPTX
.NET Conf 2022 - Networking in .NET 7
PPTX
NDC Sydney 2019 - Async Demystified -- Karel Zikmund
PPTX
WUG Days 2022 Brno - Networking in .NET 7.0 and YARP -- Karel Zikmund
PDF
.NET Core Summer event 2019 in Vienna, AT - .NET 5 - Future of .NET on Mobile...
PPTX
.NET Core Summer event 2019 in Brno, CZ - Async demystified -- Karel Zikmund
PPTX
.NET Core Summer event 2019 in Brno, CZ - .NET Core Networking stack and perf...
PPTX
DotNext 2017 in Moscow - Challenges of Managing CoreFX repo -- Karel Zikmund
PPTX
DotNext 2017 in Moscow - .NET Core Networking stack and Performance -- Karel ...
PPTX
.NET MeetUp Brno 2017 - Microsoft Engineering teams in Europe -- Karel Zikmund
PPTX
.NET MeetUp Brno 2017 - Xamarin .NET internals -- Marek Safar
PPTX
.NET MeetUp Brno - Challenges of Managing CoreFX repo -- Karel Zikmund
PPTX
.NET Fringe 2017 - Challenges of Managing CoreFX repo -- Karel Zikmund
PPTX
.NET MeetUp Prague 2017 - Challenges of Managing CoreFX repo -- Karel Zikmund
PPTX
.NET MeetUp Prague 2017 - .NET Standard -- Karel Zikmund
PPTX
.NET MeetUp Amsterdam 2017 - Challenges of Managing CoreFX repo -- Karel Zikmund
PPTX
.NET MeetUp Amsterdam 2017 - .NET Standard -- Karel Zikmund
.NET Conf 2022 - Networking in .NET 7
NDC Sydney 2019 - Async Demystified -- Karel Zikmund
WUG Days 2022 Brno - Networking in .NET 7.0 and YARP -- Karel Zikmund
.NET Core Summer event 2019 in Vienna, AT - .NET 5 - Future of .NET on Mobile...
.NET Core Summer event 2019 in Brno, CZ - Async demystified -- Karel Zikmund
.NET Core Summer event 2019 in Brno, CZ - .NET Core Networking stack and perf...
DotNext 2017 in Moscow - Challenges of Managing CoreFX repo -- Karel Zikmund
DotNext 2017 in Moscow - .NET Core Networking stack and Performance -- Karel ...
.NET MeetUp Brno 2017 - Microsoft Engineering teams in Europe -- Karel Zikmund
.NET MeetUp Brno 2017 - Xamarin .NET internals -- Marek Safar
.NET MeetUp Brno - Challenges of Managing CoreFX repo -- Karel Zikmund
.NET Fringe 2017 - Challenges of Managing CoreFX repo -- Karel Zikmund
.NET MeetUp Prague 2017 - Challenges of Managing CoreFX repo -- Karel Zikmund
.NET MeetUp Prague 2017 - .NET Standard -- Karel Zikmund
.NET MeetUp Amsterdam 2017 - Challenges of Managing CoreFX repo -- Karel Zikmund
.NET MeetUp Amsterdam 2017 - .NET Standard -- Karel Zikmund

Recently uploaded (20)

PDF
What Makes a Great Data Visualization Consulting Service.pdf
PPTX
Human-Computer Interaction for Lecture 2
PDF
infoteam HELLAS company profile 2025 presentation
PDF
Streamlining Project Management in Microsoft Project, Planner, and Teams with...
PPTX
Lesson-3-Operation-System-Support.pptx-I
PPTX
A Spider Diagram, also known as a Radial Diagram or Mind Map.
PDF
Internet Download Manager IDM Crack powerful download accelerator New Version...
PDF
WhatsApp Chatbots The Key to Scalable Customer Support.pdf
PDF
Mobile App Backend Development with WordPress REST API: The Complete eBook
PPTX
Folder Lock 10.1.9 Crack With Serial Key
PPTX
Presentation - Summer Internship at Samatrix.io_template_2.pptx
PDF
Sanket Mhaiskar Resume - Senior Software Engineer (Backend, AI)
PPTX
Human Computer Interaction lecture Chapter 2.pptx
PPTX
WJQSJXNAZJVCVSAXJHBZKSJXKJKXJSBHJBJEHHJB
PDF
Crypto Loss And Recovery Guide By Expert Recovery Agency.
PPTX
Plex Media Server 1.28.2.6151 With Crac5 2022 Free .
PPTX
Beige and Black Minimalist Project Deck Presentation (1).pptx
PDF
Top 10 Project Management Software for Small Teams in 2025.pdf
PPTX
Chapter 1 - Transaction Processing and Mgt.pptx
PDF
Engineering Document Management System (EDMS)
What Makes a Great Data Visualization Consulting Service.pdf
Human-Computer Interaction for Lecture 2
infoteam HELLAS company profile 2025 presentation
Streamlining Project Management in Microsoft Project, Planner, and Teams with...
Lesson-3-Operation-System-Support.pptx-I
A Spider Diagram, also known as a Radial Diagram or Mind Map.
Internet Download Manager IDM Crack powerful download accelerator New Version...
WhatsApp Chatbots The Key to Scalable Customer Support.pdf
Mobile App Backend Development with WordPress REST API: The Complete eBook
Folder Lock 10.1.9 Crack With Serial Key
Presentation - Summer Internship at Samatrix.io_template_2.pptx
Sanket Mhaiskar Resume - Senior Software Engineer (Backend, AI)
Human Computer Interaction lecture Chapter 2.pptx
WJQSJXNAZJVCVSAXJHBZKSJXKJKXJSBHJBJEHHJB
Crypto Loss And Recovery Guide By Expert Recovery Agency.
Plex Media Server 1.28.2.6151 With Crac5 2022 Free .
Beige and Black Minimalist Project Deck Presentation (1).pptx
Top 10 Project Management Software for Small Teams in 2025.pdf
Chapter 1 - Transaction Processing and Mgt.pptx
Engineering Document Management System (EDMS)

.NET Core Summer event 2019 in Vienna, AT - War stories from .NET team -- Karel Zikmund

  • 1. War stories from .NET team .NET Core Summer event 2019 – Vienna, AT Karel Zikmund – @ziki_cz
  • 2. Agenda • Stories • Investigations on .NET team • Not just from me • Lessons learned on the way You won’t see any: • Source code • Debugger Not needed: Deep .NET knowledge Not on agenda
  • 3. My First Serious Investigation • Build lab for Windows component • Build break 1x per week • AccessViolation dialog hangs machine • Toolset updated to 2.0 RTM • Repro: • Once in ~50 runs • Overnight run: 247 crashes out of 77,006 runs (0.3%)
  • 4. My First Serious Investigation - quotes • "The actual crash is occurring on some boilerplate stack checking code …“ • “Karel is relatively new to the code base so he indicated it might take some time to understand what’s going on”
  • 5. mscorwks!UTSemReadWrite::UnlockRead+0xe [f:rtmndpclrsrcutilcodeutsem.cpp @ 357] mscorwks!CMDSemReadWrite::~CMDSemReadWrite+0x14 [f:rtm...mdencrwutil.cpp @ 1299] mscorwks!RegMeta::DefineParam+0x196 [f:rtmndpclrsrcmdcompileremit.cpp @ 2719] cscomp!EMITTER::EmitParamProp cscomp!ParamAttrBind::Init cscomp!ParamAttrBind::CompileParamList cscomp!CLSDREC::compileMethod cscomp!CLSDREC::CompileMember cscomp!CLSDREC::EnumMembersInEmitOrder cscomp!CLSDREC::compileAggregate cscomp!CLSDREC::compileNamespace cscomp!COMPILER::CompileAll cscomp!COMPILER::Compile cscomp!CController::RunCompiler cscomp!CController::Compile csc!main My First Serious Investigation
  • 6. My First Serious Investigation • Who corrupts stack? • GC? • NO! • Changed value between caller and callee • Single bit changed • Who corrupts it? • GC card table updates? • Of course NOT! • What about HW? • Naw! • Or maybe?
  • 7. My First Serious Investigation • Does it by a chance reproduce on only one machine? • Answer: How did you know? • But why always the same callstack? • Good question, no good answer … magic • Lesson learned: Debugging HW errors is costly and hard • Always ask: Does it repro on more than 1 machine?
  • 8. Another MetaData story MetaData format background: • Basically database – rows and columns • Example – TypeDef table: • Indexes into tables/heaps are either 2B or 4B • What happens if last TypeDef has no methods? • MethodList = Number of methods + 1 = max + 1 • What happens if there is 0xffff methods? Flags TypeName TypeNamespace Extends MethodList (Public) “Foo” “Awesome.Story” … Method #10 (Private) “Bar” “Awesome.Story” … Method #11
  • 9. Another MetaData story • II.24.2.6 “#~ stream” • If e is a simple index into a table with index i, it is stored using 2 bytes if table i has less than 2^16 rows, otherwise it is stored using 4 bytes. • II.22.37 TypeDef : 0x02 • 21. If MethodList is non-null, it shall index a valid row in the MethodDef table, where valid means 1 <= row <= rowcount+1 [ERROR] • How do you fix it? • “I’m on the fence whether we should (fix it), given it looks like people hit this about once in 17 years” • https://github.com/dotnet/corefx/issues/29554 • Lesson learned: Not all bugs have to be fixed
  • 10. Breaking changes – Intro • Everyone wants fix for their bug • But nobody wants to be broken • Observation: 10% of fixes have unintended side-effects • Extreme case: Perf improvement can break app • How many customers? • Lesson learned: Everything has risk of breaking someone
  • 11. Breaking changes – Last build • Finance app crashing – “last” build of Windows 8 on arm (Surface RT) • Latent bug (introduced months ago) • Bug triggered by: 1. Method in NGen image has to be across 8KB pages 2. GC has to be triggered at least twice when it’s on stack • Unrelated change caused “unlucky” method order for: • System.Net.Configuration.DefaultProxySectionInternal..ctor • Lesson learned: Anything, really ANYTHING, has risk of breaking
  • 12. Breaking changes – Huge impact • Patch to .NET Framework broke certain tax SW • Printing tax forms • Update pushed few days before tax deadline in US • Note: Printing was tested on both sides (Microsoft & tax SW company) • But only into file, not to printer • Lessons learned: Be extra cautious around sensitive dates
  • 13. Networking – Security issue • January: Researcher running ML models on Cosmos • Suspicion about buffers – more logging • March: Repro gone • May: Similar report • +2 weeks: It blows up (more teams & impact) • All hands on-deck • Small repro (20 min, then 1 min) … yay! • TTD trace (iDNA / TTT) … bonus & life saver
  • 14. Networking – Security issue • Root-cause: HTTP pipelining under stress • 13 years old bug (.NET 2.0) Response 1 Request 1 Server Response 1 Request 1 Server Request 2 Response 2
  • 15. Networking – Security issue Request 1 Server Request 2Request 3 Response 1Response 2
  • 16. Networking – Security issue Request 1 Server Request 2Request 3 Response 1Response 2
  • 17. Networking – Security issue • We have workaround (disable pipelining) – perf impact • Worked fix … • Verifying fix … • Repro fails after 4h  • Same symptoms • Repro sensitive to cloud network load (8-17) • TTD (iDNA / TTT) does not work  • Suspicion about buffers again
  • 18. Networking – Security issue • Bad buffer lifetime management – on sending side! • 5 years old bug (.NET 4.5.2) • Trigger found: • Thanks to Skype team – 24h deployment of experiments • Change in .NET 4.7.1 • Fix around the problematic area • Making the opportunity window SMALLER! • … counter-intuitive • Code review – similar bug on receiving side (5 years old) • Same symptoms as HTTP pipelining
  • 19. Networking – Security issue • Why so many customers/services hit it at once? • Maybe Spectre & Meltdown fixes roll out? • or just … magic • Lesson learned: Weird coincidences can happen …
  • 20. Lessons learned • Always ask: Does it repro on more than 1 machine? • Debugging HW bugs is costly • Some bugs happen once in 17 years • Spec bugs are hard to fix • MetaData format bug • Anything, really ANYTHING, has risk of breaking someone • Innocent changes can trigger latent bugs elsewhere • Impact may be huge – e.g. during tax season • Always try to create small repro • Make your and everyone’s life easier • TTD (iDNA / TTT) is life saver • … sometimes there is just … magic @ziki_cz
  • 21. Thank you • Feedback welcome • Twitter DM, email, in-person, etc. • Survey • What you liked vs. not? • Too rushed? • Hard to understand? • Boring? • Didn’t meet your expectations? @ziki_cz

Editor's Notes

  • #2: Quickly about me: .NET team for almost 14 years Started as junior / out of college on Runtime – C++, pieces like Metadata, TypeSystem, Assembly Loader Later on moved to manager role Then moved to BCL (Base Class Libraries) – Networking area mainly (HttpClient) … working in open-source (.NET Core) Community manager of dotnet/corefx repo
  • #3: Lessons learned – maybe useful to you Maybe just helps you understand what is happening on the other side / below you I already had few people confirm they hit some/all situations Were able to identify with problems and recommendations
  • #4: 2006 January – 3 months in MS Large code base, dozens of machines, productivity impact on larger team Crash – “hang dialog” with AV msbuild -> C# compiler Recently upgraded toolset to 2.0 RTM (.NET Framework, not Core ) Repro – great Getting heap dumps We get to see callstack … but before that, some quotes
  • #5: … in the metadata writer code
  • #6: Simplified callstack for readability AV in MetaData emitting – defining a parameter Basically stack corruption (dangerous) Proper RW lock Who corrupts memory? …
  • #7: GC? … not Roslyn – this is native, no GC Why something else? C# compiler is deterministic Go into assembly (x86) – what is arguments, vs. locals * Great exercise to learn/refresh all this in here
  • #8: Costly and hard … and requires quite some expertise Variants: Different machine setup? … driver bugs Extreme from Maoni: Real HW?
  • #9: 1 year old story – 2018 May First background on MetaData Compressed indexes = just schema which says 2B, 4B … variable between files, but static/stable and given per file MethodList = Start of list of methods, INCLUSIVE
  • #10: How do you fix that? … You don’t … spec bug / format bug Changing rules means rewriting & recompiling all tools (CCI and command line tools like ildasm, or UI Reflector, ILSpy, Visual Studio, debuggers, profilers, …) Compensate? Rearranging fields/methods/params in a way the last one does not need the +1. Nasty Emitting fake type/method with field/method/param to push row count to 2^16. Also nasty Using 0 as valid value? Readers will be surprised, maybe other bugs?
  • #11: Read slides
  • #12: OEM getting builds 2 days Paranoia
  • #13: Sensitive dates like tax date, shopping season? (December) … online stores usually have stop on any changes
  • #14: Last July (2018) Story starts 8 months earlier in December 2017 Is it server or client problem? … wireshark traces Around Feb, we know it is client - .NET or Windows March – repro is gone (they upgraded cluster) (fast forward 2 months) May another email thread – similar symptoms Back and forth Heated Realize it is 2 different products on the thread And then couple of more start coming in span of 2 weeks Impact on one customer is huge Potential: Data loss Information disclosure – mixing data in multi-tenant scenarios 3-4 weeks of all-hands on deck + 24/7 We had iDNA trace (TTD / TTT)
  • #16: What happens when requests are cancelled? If 1st – close connection If last – remove it & and mark for closing If in middle – remove it & and mark for closing
  • #17: Bad things can happen – imagine you asked: “Does the data exist?” … data loss Multi-tenant scenarios: “Give me data about customer X” … data about Y
  • #18: Added logging (ETW) – reused buffers Old code – track down bad buffer management
  • #22: Help me do better job next time