This is a text transcription of the slides from the "Windows: a software engineering odyssey" talk given on Microsoft culture by Mark Lucovsky in 2000. This is hosted here because I wanted to link to the slides, but the only formats available online were powerpoint and slide-per-page HTML where each page is basically a screenshot of a powerpoint slide. If you're looking for something on current Microsoft culture, try these links.
Agenda
- History of NT
- Design Goals/Culture
- NT 3.1 vs. Win2k
- The next 10 years
NT timeline: first 10 years
- 2/89: Coding begins
- 7/93: NT 3.1 ships
- 9/94: NT 3.5 ships
- 5/95: NT 3.51 ships
- 7/96: NT 4.0 ships
- 12/99: NT 5.0 a.k.a. Windows 2000 ships
Unix timeline: first 20 years
- 69: coding begins
- 71: first edition -- PDP 11/20
- 73: fourth edition -- rewritten in C
- 75: fifth edition -- leaves Bell Labs, basis for BSD 1.x
- 79 -- one of the best
- 82 System III
- 84 4.2 BSD
- 89 SVR4 unification of Xenix, BSD, System V
History of NT
- Team forms 11/89
- Six guys from DEC
- One guy from MS
- Built from the ground up
- Advanced PC OS
- Designed for desktop & server
- Secure, scalable, SMP design
- All new code
- Schedule: 18 months (only missed our date by 3 years)
History of NT, cont.
- Initial effort targeted at Intel i860 code-named N10, hence the name NT which doubled as N-Ten and New Technology
- Most dev done on i860 simulator running OS/2 1.2
- Microsoft built a single board i860 computer code-named Dazzle, including the supporting chipset; ran full kernel, memory management, etc. on the machine
- Compiler came from Metaware with weekly UUCP updates sent to my Sun-4/200
- MS wrote a PE/Coff linker and a graphical cross debugger
Design longevity
- OS code has a long lifetime
- You have to base your OS on solid design principles
- You have to set goals; not everything can be at the top of the list
- You have to design for evolution in hardware, usage patterns, etc.
- Only way to succeed is to base your design on a solid architectural foundation
- Development environments never get enough attention
Goal setting
- First job was to establish high level goals
- Portability: ability to target more than one processor, avoid assembler, abstract away machine dependencies. Purposely started the i386 port very late to avoid falling into a typical Microsoft x86 centric design
- Reliability: nothing should be able to crash the OS. Anything that crashes the OS is a bug. Very radical thinking inside MS considering Win16 was co-operative multi-tasking in a single address space, and OS/2 had similar attributes with respect to memory isolation
- Extensibility: ability to extend OS over time
- Compatibility: with DOS, OS/2, POSIX, or other popular runtimes; this is the foundation work that allowed us to invent windows two years into NT OS/2 development
- performance: all of the above are more important than raw speed!
NS OS/2 design workbook
- Design of executive captured in functional specs
- Written by engineers, for engineers
- Every functional interface was defined and reviewed
- Small teams can do this efficiently
- Making this process scale is an almost impossible challenge
- Senior developers are inundated with spec reviews and the value of their feedback becomes meaningless
- You have to spread review duties broadly and everyone must share the culture
Developing a culture
- To scale a dev team, you need to establish a culture
- Common way of evaluating designs, making tradeoffs, etc.
- Common way of developing code and reacting to problems (build breaks, critical bugs, etc.)
- Common way of establishing ownership of problems
- Goal setting can be the foundation for the culture
- Keeping culture alive as a team grows is a huge challenge
The NT culture
- Portability, reliability, security, and extensibility ingrained as the teams top priority
- Every decision was made in the context of these design goals
- Everyone owns all the code, so whenever something is busted anyone has a right and a duty to fix it
- Works in small groups (< 150 people) where people cover for each other
- Fails miserably in large groups
- Sloppiness is not tolerated
- Great idea, but very difficult to nurture as group grows
- Abuse and intimidation gets way out of control; can't keep calling people stupid and except them to listen
- A successful culture has to accept that mistakes will happen
NT 3.1 vs. Windows 2000
- Dev teams
- Source control
- Process management
- Serialized development
- Defects
Development team
- NT 3.1
- Starts small (6), slowly grows to 200 people
- NT culture was commonly understood by all
- Windows 2000
- Mass assimilation of other teams into the NT team
- NT 4.0 had 800 developers, Windows 2000 had 1400
- Original NT culture practiced by the old timers in the group, but keeping the culture alive was difficult due to growth, physical separation, etc.
- Diluted culture leads to conflict
- Accountability: I don't "own" the code that is busted, see Mark!
- reliability vs. new features
- 64-bit portability vs. new features
Source control system (NT 3.1)
- Internally developed, maintained by a non-NT tools team
- No branch capability, but not needed for small team
- 10-12 well isolated source "projects", 6M LOC
- Informal project separation worked well
- minimal obscure source level dependencies
- Small hard drive could easily hold entire source tree
- Developer could easily stay in sync with changes made to the system
Source control system (Windows 2000)
- Windows team takes ownership of source control system, which is on life support
- Branch capability sorely needed, tree copies used as substitutes, so merging is a nightmare
- 180 source "projects", 29M LOC
- No project separation, reaching "up and over" was very common as developers tried to minimize what they had to carry on their machines to get their jobs done
- Full source base required about 50Gb of disk space
- To keep a machine in sync was a huge chore (1 week to set up, 2 hours per day to sync)
Process management (NT 3.1)
- Safe sync period in effect for 4 hours each day; all other times, the rule is check-in when ready
- Build lab syncs during morning safe sync period, which starts a complete build
- Build breaks are corrected manually during the build process (1-2 breaks were normal)
- Complete build time is 5 hours on 486/50
- Build is boot tested with some very minimal testing before release to stress testing
- Defects corrected with incremental build fixed
- 4pm, stress testing on ~100 machines begins
Process management (Windows 2000)
- Developers not allowed to change source tree without explicit, email/written permission
- Build lab manually approves each check-in using a combination of email, web, and a bug tracking database
- Build lab approves about 100 changes each day and manually issues the appropriate sync and build commands
- Build breaks are corrected manually; when they occur, all further build processing is halted
- A developer that mistypes a build instruction can stop the build lab, which stops over 5000 people
- Complete build time is 8 hours on 4-way PIII Xeon 550 with 50Gb disk and 512k cache
- Build is boot tested and assuming we get a boot, extensive baseline testing begins
- Testing is a mostly manual, semi-automated process
- Defects occurring in the boot or test phase must be corrected before the build is "released" for stress testing
- 4pm, stress testing on ~1000 machines begins
Team size
Product | Devs | Testers |
NT 3.1 | 200 | 140 |
NT 3.5 | 300 | 230 |
NT 3.51 | 450 | 325 |
NT 4.0 | 800 | 700 |
Win2k | 1400 | 1700 |
Serialized Development
- The model from NT 3.1 to 2000
- All developers on team check in to a single main line branch
- Master build lab syncs to main branch and builds releases from that branch
- Checked in defect affects everyone waiting for results
Defect rates and serialization
- Compile time or run time bugs that occur in a dev's office only affect that dev
- Once a defect is checked in, the number of people affected by the defect increases
- Best devs are going to check in a runtime or compile time mistake at least twice a year
- Best devs will be able to code with a checked in compile time or run time break very quickly (20 minutes end-to-end)
- As the code base gets larger, and as the team gets larger, these numbers typically double
Defect rates data
- With serialized development
- Good, small, teams operate efficiently
- Even the absolute best large teams are always broken and always serialized
Product | Team # | Defects/dev-yr | Fix time / defect | Defects / day | Total fix time |
NT 3.1 | 200 | 2 | 20m | 1 | 20m |
NT 3.5 | 300 | 2 | 25m | 1.6 | 41m |
NT 3.51 | 450 | 2 | 30m | 2.5 | 1.2h |
NT 4.0 | 800 | 3 | 35m | 6.6 | 3.8h |
Win2k | 1400 | 4 | 40m | 15.3 | 10.2h |
Dev environment summary
- NT 3.1
- Fast and loose; lots of fun & energy
- Few barriers to getting work done
- Defects serialized as parts of the process, but didn't stop the whole machine; minimal downtime
- Windows 2000
- Source control system bursting at the seams
- Excessive process management serialized the entire dev process; 1 defect stops 1400 devs, 5000 team members
- Resource required to build a complete instance of NT were excessive, giving few developers a way to be sucessful
Focused fixes
- Source control
- Source code restructuring
- Make the large team work like a set of small teams
- Windows is already organized into reasonable sized dev teams
- Goal is to allow these teams to work as a team when contributing source code changes rather than as a group of individuals that happen to work for the same VP
- Parallel development, team level independence
- Automated builds
Source control system
- New system identified 3/99 (SourceDepot)
- Native branch support
- Scalable high speed client-server architecture
- New machine setup 3 hours vs. 1 week
- Normal sync 5 minutes vs. 2 hours
- Transition to SourceDepot done on live Win2k code base
- Hand built SLM -> SourceDepot migration system allowed us to keep in sync with the old system while transitioning to SourceDepot without changing the code layout.
Source code restructuring
- 16 depots for covering each major area of source code
- Organization is focused on:
- Minimizing cross project dependencies to reduce defect rate
- Sizing projects to compile in a reasonable about of time
- To build a project, all you need is the code for that project and that public/root project
- Cross project sharing is explicit
New tree layout
- The new tree layout features
- Root project houses public
- 15 additional projects hang off the root
- No nested projects
- All projects build independently
- Cross project dependencies resolved via public, public/internal usnig checked in interfaces
Team level independence
- Each team determines its own check-in policy, enable rapid, frequent check ins
- Teams are isolated from mistakes by other teams
- When errors occur, only the tema causing the error is affected
- A build, boot, or test break only affects a small subset of the product group
- Each team has their own view of the source tree, their own mini build lab, and builds and entire installable build
- Any developer with adequate resources can easily duplicate a mini build lab
- Build and release a completely installable Windows system
- Teams integrate their changes into the "main" trunk one at a time, so there is a high degree of accountability when something goes wrong in "main"
- Build breaks will happen, but they are easily localized to the branch level, not the main product codeline
- Teams are isolated from mistakes made by other teams
- When errors occur, they affect smaller teams
- A build, boot, or test break only affects a small subset of the Windows development team
- Each team has their own view of the source tree and their own mini buikld lab
- Each team's lab is enlisted in all projects and builds all projects
- Each team needs resources able to build an NT system
- Each team's build lab builds, tests, and mini-bvt's a complete standalone system
Automated builds
- Build lab runs 100% hands off
- 10am and 10pm full sync and full build
- Build failures are auto detected and mailed to the team
- Sucessful builds are automatically released with automatic notification to the team
- Each VBL can build:
- 4 platforms (x86 fre/chk, ia64 fre/chk) = 8 builkds/day, 56/week
- No manual steps at all
- 7 VBLs in Win2k group
- Majority of builds work, but failures when they occur are isolated to a single team
Productivity gains
- Developers can easily switch from working on release N to release N+1
- Developers in one team will not be impacted by mistakes/changes made by other teams
- Developers have long, frequent checkin windows (Base team has 24x7 checkin window with manual approval used during Win2k)
- Source control system is fast and reliable
- Testing is done on complete builds instead of assorted collections of private binaries
- What is in the source control system is what is tested