Hacker News new | past | comments | ask | show | jobs | submit login
Launch HN: Memfault (YC W19) – Crashlytics for Firmware
134 points by fra on Aug 26, 2019 | hide | past | favorite | 54 comments
Hi everyone!

We're Chris, François, and Tyler, founders of Memfault (https://memfault.com). Memfault helps firmware teams find and fix issues before customers start calling (or worse, tweeting!) by providing a small <3kB SDK to include in the firmware and a web dashboard to manage releases, monitor devices, and view crashes. In the software world, Crashlytics, Sentry, and other error monitoring systems have been offering similar solutions for years. Memfault is the first such solution for firmware.

Embedded devices today are very different from ones built 10 years ago. Then, a device would run a small piece of firmware in a while() loop, capture input, compute some logic, write to a small 7-segment display, and that was about it.

Today, new products have a wireless connection to the internet, a bright 320x320 full color LCD, a high quality microphone and speaker for Alexa integration, and sometimes even run machine learning or computer vision algorithms on device! Building hardware products in 2019 is a significant software project, it requires software tools.

The three of us met at Pebble in 2013, where we shipped 4 watches together. Chris and Tyler went on to work at Fitbit, while François went to Oculus. Each time, we found ourselves building all of our tools from scratch which slowed us down tremendously. Imagine having to build a log collection solution every time you want to build a new web app!

As a result of the effort required to build them, the tools available to firmware engineers are not up to the task. For example, the state of the art in debugging requires connecting a physical debugger to your board. To investigate an error report from the field, customers must be contacted, devices shipped back, and enclosures disassembled. By the time this is all done, flash logs have rolled over, variables have reset, and developers are left scraping together raw data from flash to debug the issue. It can take weeks to get to the bottom of an issue that would be root caused in minutes with reasonable tools.

We've long wanted to show people what Memfault can do without the hurdle of integrating our SDK into their code. Today, we are launching a zero code, try-it-at-your-desk version of our tool available at https://memfault.com (click on the "Try Memfault" button"). In about 5 minutes, you should be able to connect a ARM Cortex-M based development board and upload an error report using a GDB script. If you do not have a board, you'll be able to interact with an example error report.

We could go in at length about the implementations (ask us questions in the comments!). One thing we're especially proud of is the "Globals & Statics" tab which lets you query the state of any static or global variable in your system. To get this to work, we cross compiled libdwarf to wasm via emscripten and used it to implement parts of an in-browser debugger which can be used to look up values for a known symbol given an elf file and a Memfault core file.

We'd love to hear what you think, and find out what other tools you've found helpful in this space. Looking forward to the discussion!




1. Any concerns about privacy? Even if Memfault is one-way (as you mentioned in a different comment), that doesn't mean that important user information is not exposed. Battery SOC and last-seen stats aren't completely harmless.

2. Maybe this will be clearer when you release docs on the SDK - do you provide interfaces for normal logging in addition to just crash logging? Ideally, firmware applications should never crash, but unexpected logic states or invalid user input happen all the time.

3. How are you expecting licensing to work? Per device? Monthly subscription fee? Flat fee software purchase?

4. Are your libraries ASIL or FDA certified to allow use in the automotive or medical industries? What are the reliability/safety implications of wrapping your main binary in Memfault's monitoring interface?


> Any concerns about privacy? Even if Memfault is one-way (as you mentioned in a different comment), that doesn't mean that important user information is not exposed. Battery SOC and last-seen stats aren't completely harmless.

Yes - some of the data is sensitive. We encrypt the data, use an aggressive expiry policy (2 weeks by default), and work with our customers to limit PII. Memfault does not know who the end user of the device is.

> Maybe this will be clearer when you release docs on the SDK - do you provide interfaces for normal logging in addition to just crash logging? Ideally, firmware applications should never crash, but unexpected logic states or invalid user input happen all the time.

Currently, we provide APIs for data logging ('telemetry') and error logging. Note that errors do not have to be crashes. You can send Memfault a trace for user defined issues (e.g. "bluetooth failed to connect") or even no issue at all.

> How are you expecting licensing to work? Per device? Monthly subscription fee? Flat fee software purchase?

It's a monthly subscription fee (not per device).

> Are your libraries ASIL or FDA certified to allow use in the automotive or medical industries? What are the reliability/safety implications of wrapping your main binary in Memfault's monitoring interface?

We are not currently certified, but this is something we know we'll have to do. Our error reporting only runs when an error is encountered, not during normal operation. Our telemetry collection can run on a timer, and a bug in our code there could impact your device.


Hi! I write firmware professionally and this looks pretty amazing. I've already signed up to play with the tool and submitted a demo request :)

That being said, the thing I'm most interested in here is how to integrate Memfault with my codebase, and that's the only thing I can't figure out! Your docs pages are quite pretty, but don't include the interesting bits! Clicking thought into the demo doesn't really help.

Any chance you'd consider publishing that to the site?


Thanks for the kind words, and looking forward to the demo! You are right, our documentation leaves a lot to be desired. We are working on it!

You can find some of the more interesting bits at https://github.com/memfault/memfault-firmware-sdk, which is our public facing firmware SDK. This gives a rough idea at the steps necessary to implement the coredump features of Memfault.


Great, thanks!

(Before digging into the SDK to see if this exists), is there any chance you'd support some form of "custom transport"? In the system I'm working on the micro is only connected to the network via a single board computer through which I'd need to shim the Memfault connection.


Yes, we can accommodate a custom transport. In a way, this is just a general application of what we do for the BLE transport: break up data structures into fixed sized packets, send them over the link, reassemble on the other side.


Congratulations! We rebuilt our diagnostics and firmware updating tools multiple times over the years for Lockitron, it was a massive pain each time.

How do you handle log caching and retrieval for offline devices (i.e Bluetooth)?


We have SDKs for iOS, Androids, and other gateway devices to push logs up to the cloud over bluetooth.


This looka great and like it solves a big real problem people have. Congrats on doing this!

How do you deal with security? IoT devices are infamous and having one with a debugger open to the world terrifies me.


Currently, Memfault is one-way so it is not quite like having JTAG access to your device from the cloud.

But it still needs to be secure, and we typically encrypt all data going from the device to the cloud (some devices, sadly, do not have the ability to do the encryption).

Edit: removed double negative.


How can it be one-way if you can push new releases to the device?


Rather than "pushing" releases and data to each device, the devices query for the URL to the latest firmware (if any).


This still isn't one-way, what protocol parsers are you implementing in firmware to do this?


It's up to the firmware or customer architecture to decide that. Many companies in the industry use an S3 bucket to publish firmware binaries to their devices, and these binaries are read by hubs, mobile applications, connected linux boxes, and yes, sometimes firmware devices themselves. Memfault provides a couple of layers on top of S3, allowing the customer to group devices into cohorts and do staged roll-outs.


Add experimentation, canarying, rollbacks and monitoring (any metrics relevant) and you have a winner.


Hah! Thanks for the suggestions. We're determined to do all of those over time.


It is an area i am somewhat working on and very experienced in, and would be happy to provide anectodes and ideas any time.

myhnusername@gmail.com


Edit: emailing you now.

That would be great! Send me an email: francois at memfault. I'd be thrilled to chat (or grab a coffee if you're in the bay).


What data warehouse or other analytics system are you storing the data in?


We took a page from Heap's book and ultimately store the data in Postgres databases.


That's probably a good idea to get to a usable product. You may want to investigate a proper data warehouse if your workload primarily consists of large scans and aggregations, such as if you offer a user-facing dashboard which can generate arbitrary queries.

Does your data have a fixed structure, or can customers send essentially whatever they want and you have to deal with it by e.g. storing a JSON blob in each event?


Do you have recommendations for data warehousing? Our data does have a fixed structure at the moment.


BigQuery and Snowflake are the two managed services I'd recommend today if you'd like good performance and cost-effective storage. They both separate compute from storage so that your cold data isn't sitting on expensive SSDs like your Postgres instances are probably using.

They're both also significantly faster than Postgres at large scans and aggregations.

Snowflake is the most interesting to me because they offer a semi-structured data type called VARIANT which efficiently encodes semi-structured data in a column-wise format while losing only a tiny bit of performance compared to a fixed schema. This could let your customers send semi-structured or variable size data (like arrays or maps with arbitrary keys) and still keep your dashboards fast.

If you'd like to chat more, I just requested to connect with you on LinkedIn.


BigQuery is /terrific/, for in-house analytics. It would very likely not be appropriate for backing a SaaS, at $5 per terabyte scanned.

I would suggest the OP is just fine with Postgres for awhile. They can shard it when needed.

Then eventually they can either get more sophisticated with Postgres sharding, or move to something like TiDB, clickhouse, or another event store.


Ha, I've been waiting to see this pop up on HN after seeing all the blog posts on /r/embedded.

Best of luck with the launch!


Thanks! Hope you've enjoyed reading Interrupt, let me know if there's a topic you'd like us to write about.


The bootloader and linker script pieces were both quite good, I thought.

Would love one on bootloader/firmware updates.


One of those "obvious in retrospect" ideas. The IoT startup I worked at probably would have used this.


I did browse the site, but couldn't find in what way exactly the mechanism telemetry/faults get passed from microcontroller to their backend.

For the last products I have touched, this would probably be the toughest part- abstracting/reimplementing whatever mechanism the device is already using to communicate with something that may have internet access (USB, UART, Bluetooth, LoRa) and tying in that end (mobile/desktop/connected device).


The connectivity story is indeed one of the major complications.

Here's a high level overview on how we deal with it: let's say you have a device connected via UART to a Linux box with WiFi.

1. When an error occurs, the Memfault library collects all the needed information and saves a packed error_report_t in a circular buffer in non-volatile storage (say, flash).

2. When connectivity is available, your code calls our SDK and says "hey, can you give us N bytes of data to send". If Memfault has data in the circular buffer, it returns a chunk of N bytes. Otherwise it returns false.

3. You use your transport to send the N bytes packet to the Linux system.

4. Your code on the Linux box calls the Memfault Gateway SDK to tell it you've received N bytes of Memfault code.

5. The Memfault Gateway SDK recombines packets into error_report_t and HTTP POST-s them to our backend.

Does that make sense? Happy to talk about it in more details.


Makes perfect sense. I would assume you also have to re-implement the routines writing the registers/memory to the non-volatile device, as you can't rely on peripheral registers being consistent. Ideally the host should have independent access to the said volatile memory, but that's getting close to implementing a debugger on host which uses JTAG/SWD to inspect the state of MCU after a crash.


Any plans to support x86 CPUs?


No plans for x86 at this time. What application do you have in mind, and what OS would it be running? In our experience even embedded linux projects run on ARM these days.


Probably not your main target audience but I'm thinking of old industrial automation systems that run Windows for the UI together with a real-time software PLC (Siemens or something similar). Hard to find good tools for debugging them.


It would be interesting to chat with someone working on that set up to figure out what an implementation could look like.

I wonder if solutions for windows desktop would work in that case?


See https://backtrace.io, disclaimer, previous employee.


I'm curious as well why you'd be using x86 processors in embedded devices. Care to share?


(Also not the parent) I've seen this in robotics; a lot of code in that space has been tested for a long time on x86 computers [^1], and only more recently been looked at on ARM.

I think this is also common in industrial embedded systems, since I periodically see ads to buy hardware for them [^2]. I'm not entirely sure why :).

[^1]: http://wiki.ros.org/Robots/TurtleBot/Robot%20Setup [^2]: https://www.logicsupply.com/computers/nuc/


Not the parent, but they can be nice for com express modules. Upgrade the SOM board without having to do an I/O board spin.


> In the software world, Crashlytics, Sentry, and other error monitoring systems have been offering similar solutions for years. Memfault is the first such solution for firmware.

Just curious, why can't Sentry be used in a firmware? (I don't do firmware dev)


There are a few concepts that do not map neatly:

1. For firmware, each user is on their own hardware. Rather than a session you need to track a device and the state thereof. Devices exist for a longer period of time than sessions do, and you need to have a concept of "device history".

2. Sentry assumes backtraces can be generated on the client side, which is impractical for firwmare

3. Our focus on embedded allows us to run some more specific analysis. For example we automatically detect if your MPU (memory protection unit) is misconfigured on an ARM chip.


I built a translator service between my devices and Sentry to take the memory dump and a known symbol file to rebuild the context serverside and send it back to Sentry. I only used this during development and early testing, but it was super useful.

At one point I thought about building this out more, so I'm glad to see someone is taking this on more seriously.


We are taking it very seriously ;)

Happy to hear that an you thought to build this translation service even for early development! It's usually an after thought at the companies we've talked to, and an expensive one too.


> 2. Sentry assumes backtraces can be generated on the client side, which is impractical for firwmare

If you can produce a minidump you can send it to Sentry and have it stackwalked on the server. Breakpad/Crashpad can produce such dumps for you. We do not have coredumps yet but that could be added if there is demand.


Yeah, minidump and breakpad/crashpad can be great when running native code on an OS!

One challenge with embedded devices (i.e ARM Cortex-Ms) is that it is not always safe/viable to try and grab the thread contexts at the time of crash. These MCUs have none or very limited support for memory management (a MPU at best), the OS/application state usually lives in the same RAM area, and the stack dealing with exception handling is usually quite small. For these reasons, it's usually desirable to grab a very simple memory dump and offload as much of the processing as possible to the server. The "core dump" format that is used for linux is saved as an ELF with a few special PT_NOTE sections to convey thread information. This would be hard to generate on most embedded devices. The coredump we collect for embedded devices is closer to a mini-dump. It is basically just a raw memory dump and the current register state on the device. On the server side, we recover the thread contexts and backtraces by using the debug symbols, the RAM capture, and analyzers for the RTOS that was used (i.e FreeRTOS, Zephyr, ThreadX, Mbed OS)


We have customers using us for firmware errors :)


Yes, I was excited to meet some of them! It is certainly possible to use Sentry with firmware, but I'd venture to say that Memfault is an easier integration and a better experience.


Makes sense. Thanks!


Do you have any plans to support embedded Linux IoT devices? Or are you strictly focussing on tiny devices running without an OS or with a RTOS ?


We plan to support embedded Linux in the future.


Any plans on adding Qualcomm (Bluetooth) chips?


Do you mean the CSR family? We don't currently support the XAP architecture, but it is technically doable and we would do it with the right partner. Get in touch if you want to chat!


Are you hiring?


We've just about hired everybody we need for the time being, but we're open to meeting folks who would bring a lot to the team. Send me a note! francois - at - memfault




Consider applying for YC's Spring batch! Applications are open till Feb 11.

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: