Skip to content

Fix/bitcoin block null check#9003

Open
Schnema1 wants to merge 2 commits intoElementsProject:masterfrom
Schnema1:fix/bitcoin-block-null-check
Open

Fix/bitcoin block null check#9003
Schnema1 wants to merge 2 commits intoElementsProject:masterfrom
Schnema1:fix/bitcoin-block-null-check

Conversation

@Schnema1
Copy link
Copy Markdown

@Schnema1 Schnema1 commented Mar 29, 2026

Problem described in #9002 Found during investigation of #8973

Commit 40dd780 switched from pull_bitcoin_tx to pull_bitcoin_tx_only
but did not add a null check, causing a segfault (FATAL SIGNAL 11) if
pull_bitcoin_tx_only returns NULL due to a malformed block response
from bitcoind.

Add null check consistent with existing patterns in the codebase and
with pull_bitcoin_tx itself.

This commit adds the corrections.

Please review as AI found this fix.

This is my first ever commit, please excuse errors and feel free to edit.

Checklist

Before submitting the PR, ensure the following tasks are completed. If an item is not applicable to your PR, please mark it as checked:

  • The changelog has been updated in the relevant commit(s) according to the guidelines.
  • Tests have been added or modified to reflect the changes.
  • Documentation has been reviewed and updated as needed.
  • Related issues have been listed and linked, including any that this PR closes.
  • Important All PRs must consider how to reverse any persistent changes for tools/lightning-downgrade

Commit 40dd780 switched from pull_bitcoin_tx to pull_bitcoin_tx_only
but did not add a null check, causing a segfault (FATAL SIGNAL 11) if
pull_bitcoin_tx_only returns NULL due to a malformed block response
from bitcoind.

Add null check consistent with existing patterns in the codebase and
with pull_bitcoin_tx itself.

Fixes: 40dd780 bitcoin_block_from_hex: avoid creating PSBT wrappers for finalized block txs
@cdecker cdecker force-pushed the fix/bitcoin-block-null-check branch from a8528ac to 38c5e56 Compare March 30, 2026 08:50
@cdecker
Copy link
Copy Markdown
Member

cdecker commented Mar 30, 2026

Thanks @Schnema1, that's great work hunting down the issue.

I'm afraid however the resolution is a bit more complicated, and could involve upgrading the libwally library, to not fail when decoding the block. Your current code returns NULL to the caller:

if (!blk)
bitcoin_plugin_error(call->bitcoind, buf, resulttok,
"getrawblockbyheight",
"bad block");

This in turn calls bitcoin_plugin_error which reports the error and then exits (loudly). Do you remember which block was to blame, was it always the same, and did the block sync continue past the block in question with the fix applied?

@Schnema1
Copy link
Copy Markdown
Author

Ok, you are the pro regarding the coding and it's next steps. I am not 100% sure it was always the same block causing the error. But I am pretty sure it was.

I am sorry, but most logs are deleted as my directory was full of crash logs.

Here some hints I found during my research:

2026-03-24T10:52:57.281Z INFO    lightningd: Scanning for missed UTXOs from block 908329
2026-03-24T10:52:57.296Z DEBUG   lightningd: Adding block 921132: 000000000000000000003ec346b60ca6ca907db22374c9a2bc8a46a1a9007871
2026-03-24T10:52:57.397Z DEBUG   lightningd: Adding block 921133: 0000000000000000000111fb88a8c401214fc214177760c59761f324d76567b1

2026-03-24T10:08:35.929Z INFO    lightningd: Scanning for missed UTXOs from block 908329

After the AI found the SIGNAL 11 error, I added the suggested and recompiled, the SIGNAL 6 error appeared. It apparently accelerated the troubleshooting.

After suggesting some smart grep commands we arrived here:
We have the blockhash. It's consistently failing on the same block:
0000000000000000000151b3a6e293f443602e1ad770b3578feeffd1d6eb8fe9

After digging more:

AI:
Very interesting. Height 908329 — notice this is exactly the block CLN logs as its UTXO scan start point:
2026-03-25T10:42:57.026Z INFO lightningd: Scanning for missed UTXOs from block 908329
So CLN always has to process this block on startup, which is why it crashes every time. This is not random at all.
The block has 260 transactions and the raw hex is ~3.6MB. The ReadCompactSize(): size too large error from bitcoind at verbosity 1/2 suggests one transaction in this block has an extremely large witness — almost certainly a large inscription.
Let's find the problematic transaction:
`# Decode the raw block and look at transaction sizes
bitcoin-cli getblock 0000000000000000000151b3a6e293f443602e1ad770b3578feeffd1d6eb8fe9 0 > /tmp/rawblock.hex

Check the block stats which use a different code path
bitcoin-cli getblockstats 908329`

This then returned:
`# Decode the raw block and look at transaction sizes
bitcoin-cli getblock 0000000000000000000151b3a6e293f443602e1ad770b3578feeffd1d6eb8fe9 0 > /tmp/rawblock.hex

Check the block stats which use a different code path
bitcoin-cli getblockstats 908329`

`bitcoin-cli getblockchaininfo | grep -E "pruned|pruneheight|prune"
"pruned": false,

bitcoin-cli getblockchaininfo | grep -E "blocks|headers|pruneheight"
"blocks": 942190,
"headers": 942190,

bitcoin-cli getblockfrompeer 0000000000000000000151b3a6e293f443602e1ad770b3578feeffd1d6eb8fe9 0
error code: -1
error message:
Block already downloaded`

So the block is supposedly downloaded but not on disk — that's a corrupted or incomplete bitcoind block database. This is the real root cause of everything.

Nailed down error:
tail -100 ~/.bitcoin/debug.log | grep -E "ERROR|error|corrupt" 2026-03-25T19:09:12Z [error] ReadBlock: Deserialize or I/O error - ReadCompactSize(): size too large: iostream error at FlatFilePos(nFile=5066, nPos=63528873) 2026-03-25T19:09:28Z [error] ReadBlock: Deserialize or I/O error - ReadCompactSize(): size too large: iostream error at FlatFilePos(nFile=5066, nPos=63528873)

AI: There it is! The block data at file 5066, position 63528873 is corrupted on disk. This is the root cause of everything — not a CLN bug at all (well, the missing null check is still a real bug, but it's not your actual problem).

Interesting enough, I did not touch the bitcoin chain. The blocks directory has a symlink to the sata drive. This one was not touched during OS cloning

If the fix really causes trouble elsewhere, this is not good. But at least after that we got SIGNAL 6, leading to the solution. Take this with a grain of salt as it is really out of my knowledge. I could follow the AI suggestions, and learned a lot during this process. Let me know if you need more information.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants