Reading large codebases
Casey asked me:
Any tips on how to read large codebases - especially for more novice programmers?
As it happens, I think that this is a really great question. I think that part of what makes someone a good developer is the ability to go through a codebase and figure out what is going on. In your career you are going to come into an existing project and be expected to pick up what is going on there. Or, even more nefarious, you may have a project dumped in your lap and expected to figure it out all on your own.
The worst scenario for that is when you are brought in to replace “those incontinent* bastards” that failed the project, and you are expected to somehow get things working. But another common scenarios for this include being asked to maintain a codebase written by a person who left the company. And finally, of course, if you are using any Open Source projects, there is a strong likelihood that you’ll be asked to “can you extend this to also do this”, or maybe you are just curious.
Especially for novice programmers, I would strongly recommend that you’ll do just that. See the rest of the post for actual details on how I do that, but do go ahead and read code.
I usually approach new codebases with a blind eye toward documentations / external details. I want to start without preconceptions about how things are working. I try to figure out the structure of the project from the on disk structure. That alone can tell you a lot. I usually try to figure out the architecture from that. Is there a core part of the system? How is it split, etc.
Then I find the lowest level part of the code just start reading it. Usually in blind alphabetical order. Find a file, read it all the way, next file, etc. I try to keep notes (you can see some examples of those in the blog) about how things are hooked together, but mostly, I am trying to get a feel for the code. There is a lot of code that is usually part of the project style, it can be things like precondition checks, logging, error handling, etc. Those things you can learn to recognize early and then can usually just skip them to read the interesting bits.
I usually don’t try to read too deeply at this point, I am trying to get a feeling about the scope of things. This file is responsible for X and do so by calling Y & Z, but it isn’t really important to me to know every little detail at that point. Oh, and I keep notes, a lot of notes. Usually they aren’t really notes but more a list of questions, which I fill / answer as I understand more. After going through the lowest level I can find, I usually try to do a vertical slice. Again, this is most so I can figure out how things are laid out and working. That means that the next time that I am going to go through this, I’ll have a better idea about the structure / architecture of this.
Next, I’ll usually head to the interesting bits. The part of the system that make it interesting to me rather than something that is off the shelve.
That is pretty much it, there isn’t much to it. I am pretty much just going over the code and trying to first find the shape & structure, then I dive into the unique parts and figure out how they are made.
In the meantime, especially if this is hard, I’ll try to go over any documentation exists, if any. At this point, I should have a much better idea about how things are setup that I’ll be able to really go through the docs a lot more quickly.
* I started writing incompetent, but this is funnier.
Comments
Commit history is also an excellent source, it let's you see which parts of the code are touched to implement functionality and helps tie the pieces together.
great post, and really practical advice.
Ummm..... "incontinent" is defined as having no or insufficient voluntary control over urination or defecation.
I think you meant incompetent, not having or showing the necessary skills to do something successfully.
I wouldn't go near a project with incontinent developers because the code would be smelly... get it :P
Khalid, did you see the little star near that?
I noticed it after I posted my comment. Sorry I was laughing at the idea of developers being incontinent bastards that I had to go google the definition to be sure that I haven't been using it incorrectly.
Didn't read yet, but can be useful -- http://www.cs.umd.edu/~basili/publications/technical/T114.pdf
I am curious why you prefer going from the bottom to the top but not vice versa
Idsa, Because it usually give you a lot more insight into what is really happening. If you go from the top, you see nice interfaces, usually, but I want to look at the gory details.
Great post, btw do you use pen and paper for keeping notes or do you use something like evernote or trello ...
Ayende do you have a list of codebases that would be suitable for novice programmers to begin with reading? Some good codebases that you can see for example how DI & IoC is implemented, patterns of Context aware code, error handling strategy...
Visar, I use a combination of notepad for notes and pen & paper if I need to visualize something.
And I don't know if newbies should read DI / IoC code. Those are pretty advanced, and they tend to do a lot of cheats. However, they should be reading application's code, or dedicated samples. You don't need more than this to understand IoC: http://ayende.com/blog/2886/building-an-ioc-container-in-15-lines-of-code
And once you understand how it works conceptually, you can learn how to apply it, then you learn how it actually works.
Great read. One thing that I usually do as well is run through the code with the debugger. Pick a feature, set a breakpoint at the beginning and step through it. I find that very helpful when I'm trying to get a deep understanding of a section of code.
For me it works best, when first you get big picture - what this software does (or what it supose to do), description in words, pictures, datagrams etc. When you imagine how it works, you start looking into code. Start debugining session with one feature and go from very begining to end (of feature), then select another. If you have thousands of code lines and files, there is no much sence to start blindly reading files, because you can get wrong impression how it works (if you select some special behaviour files, that is there for some specific reasone, or from old old days).
I'm agreeing with Dainius. If I look at the codebase we're currently maintaining, vast parts of it are obsolete (and often outdated), and you'd waste valuable time investigating those.
Then there are the "legacy" projects: things that are still up and running but have not substantially been touched in ages. These are usually full of bad practices and the only time you'd need to open those if there was an issue involving them or if something needed to be deactivated.
I cannot imagine anyone entering our team and not getting any guidance on where to start, what to ignore, being explained how we work, what the coding guidelines are,... All I've heard from colleagues are similar stories, and the last time I've ever gotten a large codebase dumped in my lap without any explanation and was told "there is this problem, fix it" was over a decade ago.
Great advice. For larger code bases, I would also recommend using a tool to help you find your way around. When I first started poking around the Microsoft Shared Source CLI (Rotor) code base, I used a tool Joel Pobar mentioned in a blog post called Source Insight. It's not free (it's couple hundred USD as I recall), but man did it make spelunking through all that native code a little less of a Sisyphean task (Rotor has more than 3 million lines of code, though my interest was just in the execution engine). The most important thing is to read code...but there are definitely tools out there to make a easier.
Bert, The counter point for that is this post: http://pythonsweetness.tumblr.com/post/64740079543/how-to-lose-172-222-a-second-for-45-minutes
Obsolete / outdated code either should be removed, or should be treated as value code.
Ayende, I agree that we should have taken better care of the codebase, but the reality of life is that there simply was no time for this. We're a team of contractors in a major corporation and the applications we maintain are supporting the business: reporting data, aiding with planning,...
Unfortunately, upgrading old code to current standards is not regarded a priority since "the business" will not see any noticeable impact, since everything will work as it previously did. (If we did the upgrade correctly, because otherwise we might actually introduce bugs into a system that has been running smoothly for years.)
Any major development we're planning on doing requires a ROI justification, and unfortunately some of these upgrades will take a significant amount of time and manpower. "Ease of maintenance" does not really come into play, since we're rarely touching those projects anyway.
The reality is that there's a long list of improvements and developments the business would actually like/need, and those take precedence over spending time and effort on upgrading code that works "perfectly fine".
My best hope is that these outdated projects eventually end up becoming obsolete, and that we can remove them from the repository (which we've recently started doing). Sometimes we don't even know that certain applications have become obsolete ages ago, and that we have been sending or receiving useless data blobs to/from other systems. That's the reality of maintaining a codebase that has several million lines of code and numerous applications.
Do you have any code bases that you consider "mandatory reading"?
Andreas, The stuff that interest me are mostly databases, so I read a lot of their code.
How would you tackle the NServiceBus codebase? It uses custom build scripts and ILMerge too combine stuff into single assemblies. The directory structure is not relevant at all. I'm curious, because I find that particular codebase to be VERY hard to read, but others not so hard. Is there a structure you recommend to allow easier reading for future developers?
Kim, I have gone through the NServiceBus codebase: http://ayende.com/blog/3207/nservicebus-review
Nice post with a lot of good thoughts. I usually complement source code reading with a quick stack walkthrough for some simple samples. Nothing like seeing a working program from a whitebox point of view.
Comment preview