This book aims to caputure everything that I have found incredibly useful in delivering a successful embedded project, but is often overlooked (or not known) by embedded teams "in the field".
This thread started as a topic list, built in collaboration with a number of other engineers (todo: list everyone from the original embedded collective!) as a table of contents.
I am treating the twitter thread as the rough draft of this book, and capturing the tweets here, largely unedited, for now. In the future, I may decide to edit this and prepare it to be an actual book. For now, enjoy my passing twitter thoughts.
This book is being written as an open source project! If you want to add any topics, feel free to open an issue or PR on the GitHub repo. Please open an issue first to discuss before writing a giant chapter!
We'll start in the "Build" category, everything that is related to actually building your firmware project. We called this the "100 level course".
You'd be surprised how many embedded teams out there STILL aren't using version control, either git, svn, hg, or whatever else. The tool you use doesn't matter, but version control is important for a couple reasons: You need a way to refer to specific versions of your software on a change-by-change basis.
What version did you send to be tested? What is the version that is on the unit that is acting funny?
Is it the real 0.4.0, or did someone just forget to bump the version number?
You also want to see how your project changed over time, and at some point, figure out where/why some feature or bug got introduced. Version control is like having "save points" for your code, and let you easily "roll back time" to test things out, or go back to a known-good.
Version control also is a necessary building block for a bunch of the other things we are going to talk about later. Without having some kind of version control, this all gets much more complicated.
It's okay if you never learned these tools! A lot of universities didn't start teaching them until relatively recently (they didn't when I was in school 2007-2011).
But seriously, getting moderately good at version control is like learning a software development superpower.
Ideally, your whole team should learn a little version control, even the hardware and non-technical members of your team. This helps them work more seamlessly with your firmware team to test and verify things.
You should also talk to your team about how you use version control (or any tool, really), and come up with an agreed-upon approach.
When do you do merges? When do you rebase? Should the code on
main always build? always pass tests? Is
main only for production releases?
The actual strategy doesn't matter, but consistency does! So pick something, write it down, and change it if you need to! (this will be a recurring theme for the whole book).
@PhilKoopman calls this having "just enough paper", and I love this term and use it with my clients a lot.
Also, side note, you should go buy Koopman's "Better Embedded System Software" book. It is seriously one of the best references for introducing teams to important topics inside and outside of the safety critical domain.
It explains a lot of systems stuff you should know amazingly.
Speaking about "just enough paper", that brings us to the next topic!
This one is a stepping stone for later, but is actually really important to do as early as possible in any non-trivial project.
The gist here is:
You should have a list of every tool you use. List the tools, the versions you use, and how you use them.
Not just GCC, WHICH VERSION of GCC? The 2019-q4 release? Or some locally built custom fork? If possible, keep the installer (or a SHA of it) somewhere too.
You probably use more tools than you think!
- debugger/tools like openocd/j-link
- binary utilities like objcopy/nm
- serial port tools
- and more
You also should have a "reference" environment, e.g. "Ubuntu 20.04" or "Windows 10 64-bit". Document it!
Now, you probably don't need to document the editor you use (unless it's an IDE, like IAR that is also a compiler), but you do probably want to document "side" things that are relevant, like your version of git, etc. Now, in avionics, we would have periodic checks that EVERYONE had the right versions installed, especially when running offical tests, or building official firmwares.
This isn't usually necessary, but you might want to have one "reference" PC or VM set up EXACTLY like this. Why? Because you should be doing all your "official builds" on this reference machine. Sending a unit for test? Shipping a firmware?
USE YOUR REFERENCE ENVIRONMENT!
This makes your firmware MUCH more reproducible!
If a firmware is acting funny, check if you are using the reference environment! Did you upgrade GCC? Or start using a different linker?
That might be the source of your problem! At least it's something to check.
Later on, we'll replace this environment with an official "build server", but having one clear reference PC or VM (or container) sitting around is good enough for now.
Keep this PC/VM somewhere safe after the project ends, too! You want to be able to re-create if needed.
If you need to update the firmware "with just one simple fix" in 2-3 years, or even 3-6 months, can you tell me what tools you used?
Do you still have the 2018-q2 installer for arm-none-eabi-gcc? Will there still be a reliable download link somewhere?
This is where VM snapshots or exported containers shine: You can put them on a redundant backup or cloud server somewhere, and take them off the shelf later. (Keep a copy of your VM tool too!)
Oh, and the libraries you use are part of this environment too! What version of LwIP are you using? The vendors BSP/HAL? that one library you randomly downloaded from github?
That should be either:
- In version control (that you own, not someone elses repo), OR
- In your docs
Plus, you have version control! Why not just put this in a text file in your source code repo? It's a good excuse to start a
docs/ folder. Call the file
software-development-plan.txt, and impress your managers!
Plus then you can't lose it, it's with the code!
Going back to @PhilKoopman's "Just enough paper", you don't need a fancy document or tool to capture this, use whatever you have already! Text file in a repo? Great! Page on your company wiki? Sure, as long as people actually look at that.
You gotta keep it up to date, though, and this is important! It's okay to change versions as you go, but make sure you always keep the doc up to date, and your "reference environment" up to date with what's in the doc!
You should double-check this on every "official" build.
What does it mean to have a "one touch build"? Anyone on your team should be able to make a working firmware with one click, or one shell command or script.
This seems like a really simple thing, but if you have to:
- open the IDE
- change a config value
- copy the hex
- run a specific objcopy command
- edit the hex file in one specific place
- merge in some font data or images or something
Someone is going to make a mistake!
Open source tools like cargo, make, cmake, ninja, or whatever make this way easier to automate! Wrap it in a shell script or whatever you need.
IDEs like IAR are a little harder, but many still have a "headless" or CLI mode you can use, if you read the docs.
Have different configurations? RAM builds? Debug builds? Semi-debugging "whoops" builds?
Have a single command or button for each! You want to make sure your team is all building the firmware in the EXACT same way.
You'll notice consistency as a running theme here. You want to eliminate potential sources of human error.
make release is a lot more direct than a 10 step list from an email 6 months ago.
Plus, you can put this in version control, and document it in your build environment!
Now when someone new starts on your team, or you bring in a contractor, you don't have to sit with them for a day (or week) to get their environment set up! Just send them the docs, and when "make release" works, they are good to go!
This step is also really important for later when we start talking about testing, or continuous integration. That's still a bit far off, but we're starting to take the first steps in that direction!
Again, what tool you use doesn't matter. Just do something, and document it!
Oh, and this script/makefile/cargo.toml or whatever is also important documentation!
It documents all the command line arguments you use in all of your tools, so you can't forget what flags you pass to your linker, or how you merge your image assets into the final binary.
This is documentation you don't even have to think about, but it's just as important as the tool versioning document you wrote.
This one tends to get a bit contentious, but it doesn't have to be! A style guide covers a couple things, mostly how you format your code (tabs, spaces, etc).
Now, I'm going to start off by saying: the actual style you choose doesn't really matter. Tabs, spaces, same line brackets, whatever. Do whatever your team can agree on.
But be strict about it! The rules can change over time, but the key here is (again) consistency!
Basically, if you ever have a debate about how to format something, just immediately talk to the team, make the decision, and stick with it.
You should definitely write all your decisions down somewhere everyone on your project can see.
docs/style-guide.txt is a good start!
Now, there are also now tools to automate this, so your team doesn't do it manually.
For rust there is
cargo fmt, for C/C++ there are tools like
clang-tidy. Find one that automates the things you want, or has a default you (mostly) like!
And make sure you have "one touch formatting" as well! In Rust, this is usually just
cargo fmt, but for C/C++ you might want to have another script that does this for you, so everyone knows how to do it.
Set the expectation that every change in version control is formatted.
Having a tool to do this is a "great equalizer", it's not person A vs person B's opinions, you just do what the tool does. It takes the thought off the problem.
It also makes it easy to document your style: It's documented in the config file for your formatting tool! Having a consistent code formatting helps in a couple ways:
- It makes odd things harder harder to hide. You'd be surprised at how good people are at noticing something "looks odd", even if they don't know why! Capitalize on this.
- It reduces unproductive discussions during code review. Don't waste time making 20 formatting nitpicks, just one "run the formatter" comment, then don't talk about format in code review!
- It makes changing the style the same as changing your code: If you want to change the rules? Okay! Open a pull request to change the formatter config file, and discuss the pros/cons there. Now you have a record of why decisions were made too!
As a note: you should probably have team policy on what to do whenever you change the formatting rules.
Do you fix the whole repo at once? Or do you reformat things as you go?
I have a preference to the first, but your team may disagree. Write down your decision!
Code formatters aren't perfect, and you may want to exclude some files or blocks of code from automatic formatting. This is pretty easy to do with most formatters, but make sure you have a good reason before doing it!
Code formatters also have the side effect of "touching" lots of your code at once.
I suggest keeping 'code formatting' as separate commits or separate pull requests to any functional change to the code. This makes the changes easier to review.
I tend to make my functional changes, open a review, get feedback, then do a formatting commit before merging the PR. This works for me, but find something that works with your team!
While you're writing down style, you might also want to think about documenting how you do other things in your project consistently:
- What do your commit messages look like?
- What do your issues/tasks look like?
- How are files and folders named?
These are things that are harder to automate, but worth writing down! Again, consistency is the most important policy, so decide something for now, and write it down!
Any time you have a "should we do X or Y" discussion, WRITE IT DOWN and never have it again.
Don't worry about capturing everything at the start, just keep it up to date! Again, this helps bring people on to your team smoothly, and will make your code reviews (we'll get to those later) a bit more stress free.
Compiler warnings are a very important thing to manage, especially for languages like C/C++ (vs Rust) where safety is a bit more optional.
For greenfield/new projects, you should start your project with as many warnings turned on as possible! This is valuable feedback, and the compiler trying to save your butt while you're developing!
I highly recommend this post from @MemfaultHQ as a good primer if you're not sure what to set:
It talks about which set of warnings and errors have the best relevance, as well as "signal to noise ratio".
Add these to your "one touch build" scripts!
Furthermore, I'd like to reinforce that by default, you should treat all warnings as errors (-Werror in C/C++). AT LEAST in your "official builds", but it is a good habit to keep. You can always disable warnings on a line-by-line or file-by-file basis.
In Rust, I'd generally suggest
#![deny(warnings)] for APPLICATIONS, but not for libraries. Once you have automated testing, this is something that can be configured on the command line as well.
This thread is a good discussion about this:
For legacy code, or code "out of your control", like LwIP, or vendor provided HALs, warnings are a bit more complicated.
I'd suggest you run the code at least once with the "strict" warning settings, and spend a reasonable amount of time reviewing these.
This is part of the due diligence required when using someone else's code! Your customers don't care if the bug was in someone else's code, it's in your hardware!
You should fix any bugs you find (and report or PR the issue/fixes!), and repeat this any time you update the lib.
Once you've done this, you might want to change/reduce the warning levels for these library files, but retain the warnings on the rest of your code.
This isn't okay for avionics, but I'd say it's "above average" for consumer projects.
Ideally if you use C/C++, you should try having your code built by at least two compilers. This is useful for a couple reasons:
- It keeps you honest about writing portable code.
- Different compilers have different warnings. You want all the info you can get!
- When using GCC/Clang, this is a great way to get a "second opinion" for free (in terms of cost).
That being said, sometimes it can be some work to keep code totally portable, and building on both platforms. You can cheat a bit with your second compiler...
For example, if my main compiler is GCC, I don't really care if Clang produces a working binary at the end (that's a 'nice to have'). This can help with some of the painful portability details - just stub out the hard parts for now, and clean up later if you can.
If you are using a proprietary compiler, having a (mostly) working build using open source tools is also a good backup plan, if you decide to switch off that platform in the future.
OSS compilers also tend to update at a faster rate, so you can try to get new warnings.
Going back to disabling "noisy" warnings, wherever you do this (in a Makefile, in the code), you should get in the habit about documenting WHY you are disabling this, and if possible, what would need to change to re-enable this in the future.
// Disable noisy warning isn't useful to anyone.
// Disabling W0234 warning as we read 'uninitalized' values from RAM, which are placed there by the bootloader. This can be removed if we change to XYZ configuration scheme is useful to EVERYONE, even if you never remove that.
My opinion is: You can always break or bend the rules, but you should always expect to justify why. It should take more effort to break the rules than to follow them, so you only break them when it is absolutely necessary.
I'm going to break Code Review into two main kinds:
- Continuous Reviews
- Inspection Reviews
The first of these, you are more likely to have run into. The goal of Continuous Reviews is to focus on the (ideally small) changes that are made to the software, one at a time.
If you do reviews during a Pull or Merge Request on GitHub/GitLab, these are a good example!
By focusing primarily on the changes, you get a better chance to see "micro" level changes. Did someone change what a function does? Did someone add a bunch of new code? Does what they changed make sense?
These reviews are great for giving directed feedback (could this be done more simply? Is this adequately tested?), and for checking hard-to-see in the "large scale" bugs, like off-by-one errors, or forgetting to validate some input or check whether a pointer is null (in C/C++).
These reviews typically aren't always great for seeing "the big picture" though. Has this file gotten too large? Is some struct or class doing "too much" now? Does it still make sense with the original plan? That can be hard to see when it happens one change at a time.
The good thing is that these reviews are quick! You only have to review 10s or 100s of lines of changes at a time usually, which means you can get feedback from others quickly. That's a valuable part of the process! You should generally make sure every change gets a review first.
The other kind of reviews, Inspection Reviews, are less common outside of the safety critical industry, but I think they are a valuable counterpart to Continuous Reviews.
Inspection Reviews focus on the "macro scale", rather than the "micro scale".
This is a great place to look at your system architecture, where resources are being used, whether your software is consistent, whether there are large scale fixes needed that could cause problems for your project in the future.
Inspection reviews typically involve:
Having multiple people sit down with the code (on a shared screen or monitor, or even printed out on paper), as well as any documentation, architecture diagrams, or any other planning docs you have done.
You want to have someone "drive" and explain each part of the code, and allow the team to ask "why" questions, as well as considering "does this make sense?"
This is not the time for nitpicks, this is the time to put on your "systems engineering hat".
Think about how the different parts of your system interact with each other, either through message passing, function calls, shared memory, semaphores/mutexes, or any other way.
Is the system matching the original design? Or is it growing in an odd organic way?
Inspection reviews can take quite a bit longer to complete (hours, days), and will involve multiple people.
Because of this, it's hard to run them too often. I'd suggest AT LEAST doing this while getting ready for a big release (e.g. testing build, alpha release), or right after.
If it's a while before you release, maybe consider having one every 4-6 weeks or so, depending on how much is going on with your project. Sometimes it helps you from spending weeks working in the wrong direction!
For both kinds of review, there is one important concept to keep in mind: Respect.
You want Code Review to be valuable, and that means you want people to give honest feedback and you want people to be receptive to this feedback
You're working together for the project to succeed.
This is NEVER the place to make fun of someone's code, to insult what they have done, or any other unprofessional behavior.
Reviews are an AMAZING chance to share knowledge. If the reviewer can't understand the code/change, that's a sign you need better comments/docs!
Sidenote: I draw a distinction between "business formal" and "professional" in reviews here. Emojis are welcome! Shitposting is not.
You want people to look forward to reviews, because it is a chance for them to share what they've done, and get valuable feedback!
Also, new or "junior" developers can sometimes be the best reviewers! They are still learning, and that is a chance for established or more experienced devs to share knowledge, and to spend time looking at the changes they've made from a teaching lens.
I know I've caught bugs of my own while explaining why my code was right (spoiler: it wasn't) to an intern. We both learned a lot that day.
Sure, senior devs can help you see things too, but any second set of eyes will bring a positive influence to your code.
You should also automate any steps of the review you possibly can.
- Code formatting? Automate it.
- Code LINTing (we'll talk about that later)? Automate it.
- Do the tests pass? Automate it.
Review time is EXPENSIVE, use it on the important stuff that you can't automate.
This also helps to retain as much of your "frustration budget" as possible. People are more interested in getting interesting feedback.
Getting 37 comments about "this should be 4 spaces" is a good way to sap all the energy people have on reviews, and bring you no value.
I find that there are a few common obstacles to effective reviews:
- Not enough time
- Not enough (experienced) people
- Not sure what to do during a review
These are all valid in some cases, but reviews are IMPORTANT. You need to make room for them.
Regarding "Not Enough Time": This is largely a management issue. The time it takes to do a review, and iterate on the feedback, should be included in time estimates. You should include them, and your manager or client should expect and respect them.
Code reviews help you catch issues EARLY, before they get buried deep into hard-to-reproduce situations. They help you spend less time testing, less time debugging, and less time recalling devices from the field.
(Good) managers think a lot about risk, in the context of a project. Reviews are one of the best ways to significantly de-risk a project, with respect to defects, and schedule slips.
It's easier to budget 10% extra time for code reviews, than an "oops" 60% schedule overrun.
Regarding "Not enough (experienced) people", this is one I unfortunately see on a lot of embedded projects. It happens the most when you only have one "main" developer on a team, and maybe 1-2 people who sometimes help out.
This is a dangerous place to be (speaking to managers).
Ideally, no project would only ever have one developer on it. It makes projects risky to that developer getting sick, or getting pulled off to some other urgent project.
Still, if you find yourself in this situation, there are options:
If your company has similarly skilled embedded people, just on another project, consider finding a "review buddy" on another team. You should learn about their project, and they should learn about yours.
In a pinch/crunch, you can also join the other project on short notice.
If you are the ONLY embedded person in your team or group, this is also a liability. Try to find someone, anyone, who is interested in embedded or the kind of project you are working on. Can't find anyone? Insist your manager does reviews with you until they hire someone else.
These reviews will be like having an intern review your code: You should explain EVERYTHING to them, the goal isn't necessarily for them to make good comments, but to instead ask questions that make you think about your code in a new way, maybe uncovering bugs.
Regarding "Not sure what to do during a review": This one is totally valid! If you've never worked on a project with good reviews, it can be hard to feel like you're "doing it right".
When in doubt, just have a conversation about your changes!
Ask about things you don't understand! Anything you talk about would probably be good as comments for the long term as well.
See something that smells funny? Ask about it! Chances are you'll catch something odd, or something that could be more clear. Or you'll learn something!
Also, take note of the kinds of questions that were valuable in a review.
Eventually, it will make sense to collect these into a review checklist you can use to help jog your memory and ask good questions.
Do all pointers get checked for NULL? Is input always validated?
This checklist is also a great training tool for new interns. Eventually I'll write a good baseline checklist to start with, but often, the details of your project makes it hard to have a "one size fits all" checklist. What works for avionics may not work for consumer devices.
Really at the end of the day, the usefulness of reviews boil down to a one thing:
It's good to have different perspectives of the code that you write.
You get one for each of:
- You write it
- You test it
- You explain it to someone else
- Someone else gives their perspective
Reviews give you a chance for (at least) two new perspectives on your code (per reviewer!). Don't waste these! Going from one perspective to three is a huge strategic advantage, and increases the chances of getting things right the first time.
So! That's reviews. Think about this the next time you are tempted to just type "Looks Good To Me" and hit merge without thinking about it, or when you think about committing directly to
main without opening a PR first.
This is all of the ways that you observe your running code to see if it is working correctly - or try to figure out why it isn't.
This one is the easiest form of debugging, because you don't need any special equipment. Usually you just need one or more LEDs, or the usual output of your system (motors, speaker, etc.)
Although your eyes aren't a great way to debug super fast things, like whether you are sending the correct data or not, there is a lot you can tell with just your eyes.
For example, you can blink an LED to show that the device is still alive, or performing some task. If the LED stops blinking? You know it got stuck somewhere.
Have a couple LEDs? Make one blink for each task. This way you can see if just one part of your system is getting stuck.
Try and use something dynamic (like a blink) so you can still see "signs of life".
Another clever trick to see "how far in a sequence" a device gets is to use a couple LEDs to output a binary value that represents a state in a state machine.
If you have two LEDs and four states:
This doesn't work for states that change rapidly (faster than your eyes can see), but if your system is getting "stuck" somewhere, walking through the states in your state machine can show you exactly where it stopped!
It also really helps to take notes about what you change, and what the effects are. As soon as you take notes, its a science experiment!
Try to guess what the effect of the changes in your code will make on the resulting output. This way you can verify if that was correct or not.
Later, we can reuse these same techniques along with tools like an oscilloscope or logic analyzer to see even faster changing patterns.
You can also use things like RGB LEDs to show colors, which can help simplify things. "green" is easier to remember than OFF, ON, OFF.
You can also get secondary colors in there too, without even worrying about fancier techniques, like PWM:
Note: some people are colorblind, and so RGB is less useful. It's always good to know if anyone on your team has troubles with things like this, so you might choose to use 3 discrete LEDs instead of an RGB if this impacts your team.
Also good to keep in mind for released products.
On a recent project, I had a wireless device getting "stuck". I didn't have a debugger attached, so I wanted to see the problem at a glance.
I used Red with a long blink for "background task", Green for "Received Packet", and Blue for "Sent Packet". Short Red blink was for errors.
I could watch it blink through colors at different rates, and realized there wasn't a problem with my wireless device at all! It stopped receiving messages at some point. The problem was actually on the OTHER wireless device connected to my PC!
This problem was really intermittent, sometimes taking 1-6 hours to show up. But because it was blinking LEDs, I could just look from my couch whenever it was acting funny, and see exactly what was going on.
I wrote a library in Rust to help out with this called
It also lets you do stuff like have patterns with different lengths (short, long, medium blinks), or even stuff like morse code!
You can get a surprising amount of info from simple LEDs.
But in summary, it's helpful to think about the tools you have for observing at hand already, and to think about your expectations for how your system works vs. how it is actually behaving.
Sometimes you need fancy tools, sometimes a simple LED is even better.
This is actually one of the most common debugging techniques used in practice, and while it tends to get a little bit of shade sometimes, there's a surprising amount of depth here.
When developing desktop programs, it's super common to just "print to the console" when you are developing. Whether that is messages like "Here", "Here2", "why are we here", or more structured logging with timestamps or line numbers, this is printf/log debugging!
Now, most embedded systems don't have a "console" like a desktop program, but the status quo here is typically using a serial port or UART to print text, and to use a USB-to-Serial adapter to see this logging information.
Most development boards have this built-in these days.
For the most part, this is pretty equivalent! You "print" text in your program, and you see the output on a command prompt (or console window) on your PC.
Many RTOS or other environment also automatically wire this up to
printf in C, so you might not even have to set it up.
BUT, since we don't have an operating system like Linux managing our output console, there are a couple "gotchas" to use print debugging on embedded systems.
For one, the serial/uart link is generally relatively slow. Often this is 115200 baud.
Now, 115200 is more than fast enough for humans! I guarantee you can't read that fast. BUT, if you dump a lot of text to the console in a loop, this means your embedded system will do one of two things:
- It will block (or wait) until it is done sending the message
- It will buffer up the message you add, and slowly drain out the message in the background (if you have something to manage this for you with DMA/interrupts or an RTOS task.
The problem is that at 64MHz, it takes 22k times longer to send the data than to just copy memory.
Even if you have buffering, eventually your buffer will fill up, and then you have to wait, or lose messages!
If your system blocks to send, then you can introduce unexpected delays in your program that don't exist when you aren't printing. This can cause timing heisenbugs:
"When I print, it works! But when I don't print, it doesn't work!"
This is often because the delay from printing is enough time for something to complete, or for a message to be actually received, etc.
This is really confusing when it happens to beginners OR experienced folks!
Also, it's important to realize that formatting (in C, C++, AND Rust) is relatively "expensive" to do!
There's a huge difference between
printf("Hello %0.02f", 42.0), in terms of memory usage and runtime speed.
This adds to the timing instability or performance loss, and can significantly increase your code size.
There are often "formatting alternatives" or "lightweight printf" options, but they often reduce functionality (for better or worse).
Okay, but the story isn't all bad. To it's credit, there are a couple things print/log debugging can do that is MUCH better than alternatives, like debugging with GDB.
Firstly, the knowledge required for this is much lower, and more accessible for beginners.
The amount of clarity those simple "here", "here2", "lol", "what" messages can give when you are trying to figure out what a program is doing is amazingly powerful!
Plus you can take that log output, and analyze it "after the fact" to follow what your code was doing, and why!
Like with LED debugging, this works best using the scientific method:
- Guess at what your system is doing
- Add print statements
- Run, and analyze the results
- Repeat until everything is clear
I do this all the time, on embedded and on desktop!
Second: Debuggers like GDB typically require "stopping the world" to step through code at "human speed". This means that a lot of odd things can happen:
- Interrupts may not fire as normal
- You might miss messages or events
- Hardware can "time out"
(as a note, it is often possible to write gdb scripts to alleviate these issues, but I'd consider this fairly "arcane knowledge" still, unfortunately).
This isn't to say GDB debugging is bad, but it's just a different tool with different qualities to print debugging!
With print debugging, even with a slow UART, it is way less "timing intrusive" to log-as-you-go, and see the ordering of what your system does at "full speed".
For example if you are doing USB or Bluetooth, these protocols are very timing sensitive!
If you hit a breakpoint and pause for three seconds, you've totally lost your USB/BT connection! Now you have to reconnect to debug further.
With printing, you can probably log and move on, without disrupting the more timing sensitive parts (as long as you don't log too much).
Especially when combined with an accurate timestamp (provided by the device), this "sequential log of information" can make-or-break tracking down hard bugs.
In terms of mediating the cost of print debugging, there are a couple "power user" techniques that are great to add to your toolbox. At least for Cortex-M devices, these include:
- Memoized logging
- Deferred Formatting
Semihosting is a technique of "logging over the debugger", instead of over a serial port.
From a plus side: It's very easy to implement, and if you have a JTAG/SWD debugger, you don't need a separate serial port! All these messages just go through the debugger.
The downside is it is SLOOOOOOW, especially with cheap debuggers. Sometimes taking 100s of ms just to send a single message!
But it can be useful for debugging before you even get a serial port working, and works on any Cortex-M chip!
The second is ITM/SWO, which is like a "debugger accelerated serial port". The upside is that it is MUCH faster than Semihosting, and only a bit more complicated to use.
The downside is it requires debugger HW support (the cheap ones usually don't), or a separate fast USB-UART.
But you can often drive these incredibly fast, definitely above the MHz speed, sometimes up to the 10s of MHz, which means the logging delay (without formatting) is much lower.
The third is RTT, which is somewhere in the middle.
Like Semihosting, it operates over your existing SWD/JTAG connection, so no additional HW needed. However it uses a more efficient technique (a ringbuffer on your MCU memory) to have much better performance.
Memoized logging is a technique that requires a little setup, but can help if you really need to log a lot of info.
Basically the idea is instead of printing something "human readable", you print a short "message ID" instead.
So instead of "Hello, world!", you just send the number 1.
An easy way to do this is to use an enum for "status codes", so you never log text, just numbers! This means you don't need
printf, and send way fewer bytes over the wire.
The downside to this is that (typically) this requires a little more planning than just
printf, because your device and host both need to speak a common protocol!
The last item here is "deferred logging". The gist here, is instead of formatting
printf("Hello: %0.02f", 42.0") on the device (which takes time and resources), you just send "Hello: %0.02f" and 42.0 over the wire, and let a more powerful system do the formatting for you.
If you're developing embedded systems in Rust, you're in for a treat! Ferrous has a tool called
defmt which does memoization and deferred logging for you automatically!
We also have a tool called
probe-run which handles setting up RTT for you as well!
You can check out the tools here, to get an idea of how they work.
Observing your embedded system is all about looking at what your devices do after they are up and running, often when they leave the development lab or bench for the first time.
Embedded systems, especially early in the development process, tend to reboot pretty often. They can reboot for a lot of reasons:
- A firmware update
- A hardware bug
- A software bug
- Low battery
- Coming in and out of sleep mode
And a lot more reasons!
The problem is, you can't always tell why. Your system was sitting there all fine, and then it blinked the "reboot" sequence, or you saw the boot screen flash for a second.
What was that? An off-by-one error in the code? Did the power supply hiccup? Have your device tell you!
Often, your CPU will have status register that reports some reasons. These often include:
- Power brown-out
- Debugger commanded reboot
- Clean Power-On
- Watchdog timeout
- Reboot due to exceptions
These are one cause for resets, but there are often others.
If you hit a
unwrap!(), or a failed
assert in your code, you may do a software reset that looks the same as rebooting into the bootloader.
For these reasons, you may also want to store some trace of this BEFORE you reboot.
Often, the easiest way to do this is to put it in a chunk of RAM that doesn't get initialized on boot. The idea is that:
- When you panic, write a flag to uninit memory
- Cause a soft reset
- On boot, check for that flag
- If you see it, report it then clear the flag
You also may want to write a "magic word" to memory, to ensure that you can tell the difference between a random set-bit in memory, and an intentionally set flag. Another option is to have a longer section of data, and include a CRC.
You should wipe the CRC or magic word on read.
By the way, I wrote a panic handler in Rust to manage this for cortex-m devices, it's called panic-persist. You can see it here: https://docs.rs/panic-persist
It stores the panic message to a reserved RAM section, so you can retrieve it after boot and send over a serial port or such.
But why are we doing this all AFTER the reboot? If we could set a flag, why can't we just log the error when it occurs?
Well for some reasons, like a power outage, we may not get a chance to respond! And if you overflow the stack, we might not be able to actually do anything.
So instead, we try to do as little as possible in the "bad" times, and instead push that to the next reboot, which is likely the next time our system is in a "reasonable" condition.
At this point, we can decide whether to print to UART, send over ETH, or write to flash.
Especially in the case of writing to flash, this is something that can take a comparatively long time! If we're relying on "reboot quickly to get to a safe state", then we don't want to be waiting for 10s or 100s of milliseconds to reboot! We want to go now!
I also talked a lot about clearing the message when we're done. This is also important to avoid any false positives, such as a "watchdog reset flag" that stays set even though we rebooted intentionally for some other reason.
Faults are another area where we may not be able to always see all information after a reboot unless we store it.
Certain faults, like an invalid memory access, division by zero, or secure zone violation can all be a sign of a software bug.
These are signs of a potentially serious software defect! We can register handlers for faults like a HardFault, and use that handler to write the fault status registers to RAM, so we can retrieve them after our next boot.
We can also grab even more information, like the contents of our stack pointer, link register, or even look at unwinding the stack to figure out how we got to the unfortunate place that we are right now.
This can be invaluable info when hunting down a heisenbug!
In the future, we'll talk way more about "post mortem" debugging, as well as having persistent logs to notice long term patterns or trends, like the system always rebooting unexpectedly exactly every 49.71 days.
We could choose to start our logging or diagnostic anywhere, but let's be honest: everybody boots. It's a good chance to gather more information in a seemingly benign situation, and to set us up with the ability to grab important info we may need for critical debugging later.