Create datasets that bring together RNA 3D structures & chemical mapping
usage: create_pdb_sequences_csv.py [-h] --targets_file TARGETS_FILE --fasta_file FASTA_FILE [--cif_dir CIF_DIR] [--rdat_2A3_file RDAT_2A3_FILE] [--rdat_DMS_file RDAT_DMS_FILE] [--start_idx START_IDX] [--end_idx END_IDX]
[--dataset_name DATASET_NAME] [-o OUTDIR] [--align_map_to_target_sequence] [--use_target_order] [--max_len_target MAX_LEN_TARGET]
Prepare .csv files integrating PDB 3D output and chemical mapping data
options:
-h, --help show this help message and exit
--targets_file TARGETS_FILE
CSV file with columns including "target_id" and "sequence". Default is `test_sequences.csv`.
--fasta_file FASTA_FILE
FASTA file with sequences that were chemically mapped
--cif_dir CIF_DIR Directory to search for previously downloaded cif.gz files
--rdat_2A3_file RDAT_2A3_FILE
File with 2A3 data in RDAT format
--rdat_DMS_file RDAT_DMS_FILE
File with DMS data in RDAT format
--start_idx START_IDX
Start index (1,2,...) of test_sequences to work on, for parallelization. Default: 0 (do all sequences).
--end_idx END_IDX End index (1,2,...) of test_sequences to work on, for parallelization. Default: 0 (do all sequences).
--dataset_name, --name DATASET_NAME
full dataset_name, tag for csvs
-o, --outdir OUTDIR Where to save output CSVs (Default ./)
--align_map_to_target_sequence
legacy: align chem map CSV output to target residues. Default false (align chem map CSV to fasta used for chemical mapping).
--use_target_order legacy: use order of targets in --target_file, not in fasta_file
--max_len_target MAX_LEN_TARGET
maximum length of sequence to output to wide xyz, 2A3, and DMS csv files (only relevant with use_target_order)
Go into the directory example/
cd example
and type the command in README_RUN:
python3 ../create_pdb_sequences_csv.py \
--targets_file train_sequences.v2.csv \
--fasta_file train2_pdb130_RNA_sequences.fasta \
--cif_dir PDB_RNA \
--rdat_2A3_file RDAT/train2_pdb130_RTB002_2A3.rdat \
--rdat_DMS_file RDAT/train2_pdb130_RTB000_DMS.rdat \
--dataset_name train2_pdb130 \
--start_idx 1001 --end_idx 1005
This provides the list of target sequences (--sequence_file, here train_sequences.v2.csv from the Stanford RNA 3D Folding Kaggle competition, the location of any pre-downloaded CIF files from the PDB (here PDB_RNA), chemical mapping data (2A3 and DMS) in RDAT format, and a dataset_name to label the output csvs (--dataset_name).
For illustration purposes, the optional --start_idx and --end_idx are set to be 1001 and 1005, and so the script will look at only 5 sequences. Don't provide these flags if you want to run through all the sequences that are mapped in the RDAT/fasta files.
Output files (examples are in example_output/):
train2_pdb130.labels.1001_1005.csv- 'narrow' file withx,y,zof C1' coordinates, and DMS and 2A3 (and errors), one row for each residue in each target sequence for which chemical mapping data was found in the RDAT files. Similar in format tolabelsfor Stanford RNA 3D Folding Kaggle competitiontrain2_pdb130.2A3.1001_1005.csv,train2_pdb130.DMS.1001_1005.csv- 'wide' files with 2A3/DMS and errors. Similar in format totrain_data.csv` in another Kaggle competition, the Stanford Ribonanza challenge.train2_pdb130.xyz.1001_1005.csv- 'wide' files with C1'x,y,zand residue numbers in the PDB.train2_pdb130.sequences.1001_1005.csv- same information and format as input sequences file, but just for the targets for which chemical mapping data were found in the RDAT files.
and there's also:
color_data.pml- script that can be run in Pymol withrun color_data.pmlto look through each PDB file, color by the 2A3 and DMS data, and output tomap_data/folder. You'll need to install Ribovis to run this.map_data- Files like7TAX_M.2A3.txtthat containresnum reactivityfor coloring PDB files.
The output to terminal gives information on how the target sequences are aligned into the sequences that were subjected to chemical mapping, which include constant flanking regions and barcode hairpins:
1001 7Z4I_A
Found resolved sequence in fasta
target 0 GGGAACGACUCGAGUAGAGUCGAAAAGUAGCACGUACGUGUGCUACAAGCACAUUGGAUA
0 ------------------------------------------------------------
query 0 ------------------------------------------------------------
target 60 ----CGCAUAAAGAUGAGACGCGUUUUAGAGCUAGAAAUAGCAAGUUAAAAUAAGGCUAG
60 ----||||||||||||||||||||||||||||||||||||||||||||||||||||||||
query 60 GGGACGCAUAAAGAUGAGACGCGUUUUAGAGCUAGAAAUAGCAAGUUAAAAUAAGGCUAG
target 120 UCCGUUAUCAACUUGAAAAAGUGGCACCGAGUCGGUGCUU--ACUGUUUUACCUUUCGAG
120 ||||||||||||||||||||||||||||||||||||||||--------------------
query 120 UCCGUUAUCAACUUGAAAAAGUGGCACCGAGUCGGUGCUUUU------------------
target 180 GUAAAGCAGUCAAAAGAAACAACAACAACAAC 212
180 -------------------------------- 212
query 180 -------------------------------- 212
Created: map_data/7Z4I_A.2A3.txt
Created: map_data/7Z4I_A.DMS.txt
Completed 1 sequences
1002 1ATV_A
Found entire target in fasta
target 0 GGGAACGACUCGAGUAGAGUCGAAAAGAACAGGCCGUGAGGCUUGUUCAAGACAUCGUGU
0 ------------------------------------------------------------
query 0 ------------------------------------------------------------
target 60 GAGCGAUGUCAACUCAACACGUGAGUGUUGAGAACAUGAUCGGUGACGAUCGUGAACGUA
60 ------------------------------------------------------------
query 60 ------------------------------------------------------------
target 120 CUUUACGAAGUGCGAAAUAGGGACCAGAAGGUCCCGACUUAGAGGUACGUGAGUACUUCU
120 -------------------|||||||||||||||||------------------------
query 120 -------------------GGGACCAGAAGGUCCCG------------------------
target 180 AAGUGAAAAGAAACAACAACAACAAC 206
180 -------------------------- 206
query 180 -------------------------- 206
...
In addition, for this particular example, there were mismatches in many sequences that were mapped by DMS/2A3 and what was in the PDB, e.g. for some sequences only the residues that were resolved in the PDB structure were included for chemical mapping. The script automatically recognizes both of these possibilities. And if it fails, as a backup, the scripts uses Biopython's PairwiseAligner, to guess the correspondences of target sequence to PDB and chemical mapping constructs.
Tips
- Only targets for which sequences are the RDAT/fasta files are output and, by default, they are in order of appearance in that file.
- The sequences in the RDAT file have to match the ones in the FASTA file.
- RDAT is a format used in Stanford RNA Mapping Database. Basic scripts for manipulation are available at RDATkit.
- If you don't have the .cif.gz files downloaded, the script will download them as needed into a folder
PDB/. But it can be faster to pre-download them -- check out the PDB's batch download service. For the example above, all the RNA-containing entries in the PDB as of May 2025 are at this Google Drive link. - If parallelizing on a cluster, have
--start_idxand--end_idxset to pull out chunks for each job. Then you can concatenate all the chunks of each .csv file with, e.g.,csvstack.