Using AI to scan datasheets and help generate driver code in Cyanobyte

Nick Felker
10 min readJun 21, 2024

--

The opinions stated here are my own, not those of my company

Using LLMs to read electronic datasheets

Several years ago I worked on a project called Cyanobyte. Its goal is to define a standardized, machine-readable format for I2C peripherals when interfacing with microcontrollers, so that integration code can be automatically generated for any software platform.

To demo this project, I put together a handful of peripherals (Todo link). To write out each file in the YAML format took some time, requiring me to read through hundred-page PDFs for what information was there. This process is inherently at risk for mistakes.

With the prevalence of large language models today, I was curious about the ways they could be used as a companion to electrical engineers in their integration. As Cyanobyte already handles the code generation step in a determinsitic way, my attention was turned toward ways to generate these peripheral files.

PDF OCR

My first approach was to do something akin to the way I process government minutes, using OCR on the PDF and sending that context into the language model.

  console.log('Begin processing')
const pdfLocal = fs.readFileSync('dl/bst-bmp280-ds001.pdf')
const pdfInfo = await pdf(pdfLocal)
console.log(pdfInfo.numpages, pdfInfo.info, pdfInfo.metadata)
const cyanobyteDefinition: any = {}
for (let i = 0; i < pdfInfo.numpages; i += 15) {
console.log('Pages starting at', i)
const ocrText = await convertPdfToText(bmp280Pdf, i)
fs.writeFileSync(`reports/bmp280-ocr-${i}.md`, ocrText)
}

As the OCR API only accepts 15 pages max, I paginate the PDF before processing.

const geminiPrompt = `You are an AI optimized for electrical engineering and embedded software development.
Scan the following OCR PDF output to identify important values and return them in JSON. If you do not find
the value, do not make something up. Just ignore the field.

Here are the fields:
title: string
description: string # Keep this simple and concise
i2c:
addressType: '7-bit' | '10-bit'
address: string
addressMask: string
registers: # Top Level. A map. Only represents single physical registers.
- [RegisterName]:
title: string
description: string
address: string
length: number # Length of the register in bits
signed: boolean # Whether the integer response should be signed when read
readWrite: 'R' | 'R/W' | 'W' | 'n'

fields: # Top-Level. A map. Only appropriate if a single physical register contains smaller components.
- [FieldName]: # A subcomponent of a larger physical register
title: string
description: string
register: string # The name of the Register defined above
readWrite: 'R' | 'R/W' | 'W' | 'n'
bitStart: integer
bitEnd: integer
type: 'enum' | 'number'
enum: #If an enum, this should be an array of titles and values
- title: string
value: string
`;

// ...

const geminiRender = await prompt(`${geminiPrompt}\n\n${ocrText}`)
const output = geminiRender.candidates![0].content.parts
.map(p => p.text)
.join(' ')
.replace('```json', '')
.replace('```', '')

My LLM prompt includes the OCR response but also the JSON schema representing the peripheral spec. I removed a few parts and rewrote it to better fit in the prompt.

From here, I am able to run JSON.parse on the output and get back the fields. However, it’s not really that easy. Most of the time it returns JSON within Markdown ticks. Maybe 10% of the time the output is malformed. So I need to catch this and throw it out.

When I run this code, it reads the first fifteen pages and generates some parts like the title and description. And then the next fifteen pages it also does that. And so on. This means you could end up with multiple fields all called title, and it’s hard to say in an automated way which one is correct.

To get around this, I put together a broader system that handles this ambiguity by adding additional fields appended by a number. This means that the engineer will need to do a pass-through at the end and clean up the resulting file, which is good practice either way.

function upsertGeminiField(geminiJson: any, cyanobyteDefinition: any, i: number, field: string) {
if (geminiJson[field] && !cyanobyteDefinition[field]) {
cyanobyteDefinition[field] = geminiJson[field]
} else if (cyanobyteDefinition[field]) {
cyanobyteDefinition[`${field}_ex${i}`] = geminiJson[field]
}
}

function upsertGeminiSubfield(geminiJson: any, cyanobyteDefinition: any, i: number, field: string, subfield: string) {
if (geminiJson[field] && !cyanobyteDefinition[field]) {
cyanobyteDefinition[field] = {}
}
if (geminiJson[field][subfield] && !cyanobyteDefinition[field][subfield]) {
cyanobyteDefinition[field][subfield] = geminiJson[field][subfield]
} else if (cyanobyteDefinition[field][subfield]) {
cyanobyteDefinition[field][`${subfield}_ex${i}`] = geminiJson[field][subfield]
}
}

function upsertGeminiArray(geminiJson: any, cyanobyteDefinition: any, field: string) {
if (geminiJson[field] && !cyanobyteDefinition[field]) {
cyanobyteDefinition[field] = [...geminiJson[field]]
} else {
cyanobyteDefinition[field].push(...geminiJson[field])
}
}

function upsertGeminiMap(geminiJson: any, cyanobyteDefinition: any, i: number, field: string) {
if (geminiJson[field] && !cyanobyteDefinition[field]) {
cyanobyteDefinition[field] = {}
}
if (geminiJson[field]) {
Object.entries(geminiJson[field]).forEach(([key, value]) => {
if (cyanobyteDefinition[field][key]) {
cyanobyteDefinition[field][`${key}_ex${i}`] = value
} else {
cyanobyteDefinition[field][key] = value
}
})
}
}

try {
const geminiJson = JSON.parse(output)
upsertGeminiField(geminiJson, cyanobyteDefinition, i, 'title')
upsertGeminiField(geminiJson, cyanobyteDefinition, i, 'description')
upsertGeminiSubfield(geminiJson, cyanobyteDefinition, i, 'i2c', 'address')
upsertGeminiSubfield(geminiJson, cyanobyteDefinition, i, 'i2c', 'addressMask')
upsertGeminiSubfield(geminiJson, cyanobyteDefinition, i, 'i2c', 'addressType')
upsertGeminiMap(geminiJson, cyanobyteDefinition, i, 'registers')
upsertGeminiMap(geminiJson, cyanobyteDefinition, i, 'fields')
} catch (e) {
console.error('Could not parse output at', i)
}

Results

With all of that code written out, I just need to generate this large object and write it to a YAML file to finish it:

const yamlOut = YAML.stringify(cyanobyteDefinition)

fs.writeFileSync('reports/bmp280.yaml', yamlOut)

I used the pressure/temperature sensor BMP280 as the input, which is known for being relatively straightforward as an I2C sensor but also having a lot of nuances.

When I ran it, this is what it returned from the first fifteen pages:

{
"title": "BMP280",
"description": "Digital Pressure Sensor",
"i2c": {
"addressType": "7-bit",
"address": "0x77",
"addressMask": "0x76"
},
"registers": {
"D0": {
"title": "ID register",
"description": "Version number",
"address": "0xD0",
"length": 1,
"signed": false,
"readWrite": "R"
},
"E0": {
"title": "Reset register",
"description": "Reset to factory settings",
"address": "0xE0",
"length": 1,
"signed": false,
"readWrite": "W"
},
"F3": {
"title": "Status register",
"description": "Status indicators",
"address": "0xF3",
"length": 1,
"signed": false,
"readWrite": "R"
},
"F4": {
"title": "Ctrl meas register",
"description": "Measurement configuration",
"address": "0xF4",
"length": 1,
"signed": false,
"readWrite": "RW"
},
"F5": {
"title": "Ctrl humid register",
"description": "Humidity configuration",
"address": "0xF5",
"length": 1,
"signed": false,
"readWrite": "RW"
},
"F7": {
"title": "Pressure data (MSB)",
"description": "Pressure data",
"address": "0xF7",
"length": 1,
"signed": false,
"readWrite": "R"
},
"F8": {
"title": "Pressure data (LSB)",
"description": "Pressure data",
"address": "0xF8",
"length": 1,
"signed": false,
"readWrite": "R"
},
"F9": {
"title": "Pressure data (XLSB)",
"description": "Pressure data",
"address": "0xF9",
"length": 1,
"signed": false,
"readWrite": "R"
},
"FA": {
"title": "Temperature data (MSB)",
"description": "Temperature data",
"address": "0xFA",
"length": 1,
"signed": false,
"readWrite": "R"
},
"FB": {
"title": "Temperature data (LSB)",
"description": "Temperature data",
"address": "0xFB",
"length": 1,
"signed": false,
"readWrite": "R"
},
"FC": {
"title": "Temperature data (XLSB)",
"description": "Temperature data",
"address": "0xFC",
"length": 1,
"signed": false,
"readWrite": "R"
}
},
"fields": {
"oss": {
"title": "Oversampling mode",
"description": "Controls the pressure and temperature oversampling",
"register": "F4",
"readWrite": "RW",
"bitStart": 0,
"bitEnd": 4,
"type": "enum",
"enum": [
{
"title": "Ultra low power",
"value": "000"
},
{
"title": "Low power",
"value": "001"
},
{
"title": "Standard resolution",
"value": "010"
},
{
"title": "High resolution",
"value": "011"
},
{
"title": "Ultra high resolution",
"value": "100"
}
]
},
"iir": {
"title": "IIR filter configuration",
"description": "Controls the IIR filter",
"register": "F5",
"readWrite": "RW",
"bitStart": 0,
"bitEnd": 2,
"type": "enum",
"enum": [
{
"title": "Filter off",
"value": "000"
},
{
"title": "Filter coeficient 1",
"value": "001"
},
{
"title": "Filter coeficient 2",
"value": "010"
},
{
"title": "Filter coeficient 3",
"value": "011"
},
{
"title": "Filter coeficient 4",
"value": "100"
},
{
"title": "Filter coeficient 5",
"value": "101"
},
{
"title": "Filter coeficient 6",
"value": "110"
},
{
"title": "Filter coeficient 7",
"value": "111"
}
]
},
"spi3": {
"title": "SPI 3-/4-wire selection",
"description": "Sets the SPI mode",
"register": "F5",
"readWrite": "RW",
"bitStart": 3,
"bitEnd": 3,
"type": "enum",
"enum": [
{
"title": "3-wire mode",
"value": "0"
},
{
"title": "4-wire mode",
"value": "1"
}
]
}
}
}

You can see that it does a reasonable job in general. It sticks to the JSON schema. It manages to format everything elegantly. It looks good to the untrained eye. But is it accurate?

Well… mostly. We can look at register 0xF5 which is the control register. Bits 2:0 aren’t meant for the filter. Instead it should be 4:2 on 0xF5. The enum is correctly applied. But the titles of each coefficient are not described well, and you’d still need to read the datasheet in order to figure out which ones to use.

As I went through each page of generated JSON, I found more examples like that. Things look good but are slightly off. Part of the issue is that datasheets use a lot of graphs and tables to communicate information and I’m removing all that through the OCR. Without that additional semantic data, hallucinations are easy.

When I merge everything together, I end up with a file almost 1000 lines long. A lot of it will need to be reviewed and deleted or merged together.

i2c:
address: "0x77"
addressMask: "0x76"
addressType: "7-bit"
address_ex15: "111011X1"
addressMask_ex15: ""
addressType_ex15: "7-bit"
address_ex30: "0x77"
addressMask_ex30: null
addressType_ex30: "7-bit"
address_ex45: "0x76"
addressMask_ex45: "0xfe"
addressType_ex45: "7-bit"

Here you see that there are a number of different outputs that may be produced and sometimes they aren’t quite right. So an embedded engineer will still need to remove most of these fields and pick the one that is the most correct.

Gemini Pro

That was just my initial experiment. It took me a while to put together this blog post and in the meantime Google released a new version of Gemini with a much larger context window. Instead of splitting up my PDF and removing the semantics, I could just upload the PDF right into my prompt and try getting an answer from it.

So what happens when I try?

It returns the following response.

{
"cyanobyte": "0.0.1",
"info": {
"title": "BMP280",
"package": "bosch_sensortec",
"description": "Digital Pressure Sensor",
"contact": {
"name": "Bosch Sensortec GmbH",
"url": "[www.bosch-sensortec.com](https://www.bosch-sensortec.com)",
"email": "contact@bosch-sensortec.com"
},
"copyright": {
"name": "Bosch Sensortec GmbH",
"date": "2021"
},
"license": {
"name": "Unknown"
},
"version": "1.26"
},
"i2c": {
"addressType": "7-bit",
"address": [
118,
119
],
"addressMask": 1
},
"spi": {
"clockPolarity": [
"low",
"high"
],
"clockPhase": [
"leading",
"trailing"
],
"frequency": 10000000,
"format": "register"
},
"registers": {
"id": {
"title": "id",
"description": "The \u201cid\u201d register contains the chip identification number chip_id[7:0], which is 0x58. This number can be read as soon as the device finished the power-on-reset.",
"address": 208,
"length": 8,
"signed": false,
"readWrite": "R",
"example": {
"valid": [
88
]
}
},
"reset": {
"title": "reset",
"description": "The \u201creset\u201d register contains the soft reset word reset[7:0]. If the value 0xB6 is written to the register, the device is reset using the complete power-on-reset procedure. Writing other values than 0xB6 has no effect. The readout value is always 0x00.",
"address": 224,
"length": 8,
"signed": false,
"readWrite": "W"
},
"status": {
"title": "status",
"description": "The \u201cstatus\u201d register contains two bits which indicate the status of the device.",
"address": 243,
"length": 8,
"signed": false,
"readWrite": "R"
},
"ctrl_meas": {
"title": "ctrl_meas",
"description": "The \u201cctrl_meas\u201d register sets the data acquisition options of the device.",
"address": 244,
"length": 8,
"signed": false,
"readWrite": "R/W"
},
"config": {
"title": "config",
"description": "The \u201cconfig\u201d register sets the rate, filter and interface options of the device. Writes to the \u201cconfig\u201d register in normal mode may be ignored. In sleep mode writes are not ignored.",
"address": 245,
"length": 8,
"signed": false,
"readWrite": "R/W"
},
"press_msb": {
"title": "press_msb",
"description": "Contains the MSB part up[19:12] of the raw pressure measurement output data.",
"address": 247,
"length": 8,
"signed": false,
"readWrite": "R"
},
"press_lsb": {
"title": "press_lsb",
"description": "Contains the LSB part up[11:4] of the raw pressure measurement output data.",
"address": 248,
"length": 8,
"signed": false,
"readWrite": "R"
},
"press_xlsb": {
"title": "press_xlsb",
"description": "Contains the XLSB part up[3:0] of the raw pressure measurement output data. Contents depend on temperature resolution, see table 5.",
"address": 249,
"length": 8,
"signed": false,
"readWrite": "R"
},
"temp_msb": {
"title": "temp_msb",
"description": "Contains the MSB part ut[19:12] of the raw temperature measurement output data.",
"address": 250,
"length": 8,
"signed": false,
"readWrite": "R"
},
"temp_lsb": {
"title": "temp_lsb",
"description": "Contains the LSB part ut[11:4] of the raw temperature measurement output data.",
"address": 251,
"length": 8,
"signed": false,
"readWrite": "R"
},
"temp_xlsb": {
"title": "temp_xlsb",
"description": "Contains the XLSB part ut[3:0] of the raw temperature measurement output data. Contents depend on pressure resolution, see Table 4.",
"address": 252,
"length": 8,
"signed": false,
"readWrite": "R"
}
},
"fields": {
"measuring": {
"title": "measuring",
"description": "Automatically set to \u20181\u2019 whenever a conversion is running and back to \u20180\u2019 when the results have been transferred to the data registers.",
"register": "status",
"readWrite": "R",
"bitStart": 3,
"bitEnd": 3,
"type": "enum",
"enum": [
{
"title": "Measuring",
"value": 1
},
{
"title": "Not Measuring",
"value": 0
}
]
},
"im_update": {
"title": "im_update",
"description": "Automatically set to \u20181\u2019 when the NVM data are being copied to image registers and back to \u20180\u2019 when the copying is done. The data are copied at power-on-reset and before every conversion.",
"register": "status",
"readWrite": "R",
"bitStart": 0,
"bitEnd": 0,
"type": "enum",
"enum": [
{
"title": "Update in Progress",
"value": 1
},
{
"title": "Update Complete",
"value": 0
}
]
},
"osrs_t": {
"title": "osrs_t",
"description": "Controls oversampling of temperature data. See chapter 3.3.2 for details.",
"register": "ctrl_meas",
"readWrite": "R/W",
"bitStart": 5,
"bitEnd": 7,
"type": "enum",
"enum": [
{
"title": "Skipped (output set to 0x80000)",
"value": 0
},
{
"title": "oversampling \u00d71",
"value": 1
},
{
"title": "oversampling \u00d72",
"value": 2
},
{
"title": "oversampling \u00d74",
"value": 3
},
{
"title": "oversampling \u00d78",
"value": 4
},
{
"title": "oversampling \u00d716",
"value": 5
}
]
},
"osrs_p": {
"title": "osrs_p",
"description": "Controls oversampling of pressure data. See chapter 3.3

This looks really good. But I didn’t truncate this response to fit within the blog post. Gemini cut off this response itself. It’s hard to tell ahead of time how long this output is going to be. Using the API, one could probably change the configuration to prefer long outputs in which the JSON response might actually be complete and accurate.

Conclusion

I don’t think AI is going to be taking my job. In fact, I imagine embedded engineering will keep growing as AI, particularly running locally, becomes more capable. In my opinion, a lot of embedded engineering code is not high-quality which is why I think Cyanobyte can be useful.

The work an embedded engineer does is much broader than code. They need to manage hardware and firmware and be very clear on documentation, since mistakes can cause things to literally blow up.

This Cyanobyte Scanner project is not quite ready for developers today. But I do believe it’s a good example of using AI as an assistant to help out on projects which have a great deal of complexity. It can help make sense of that complexity. Is it better to do a bunch of greenfield research or to fact check a pre-generated copy?

I don’t quite know. If I used this tool a lot, particularly in my normal job, I’d have stronger opinions one way or another. But until then I’ll continue to experiment with new ways to improve the embedded software landscape.

--

--

Nick Felker

Social Media Expert -- Rowan University 2017 -- IoT & Assistant @ Google