Skip to content

DasLab/pdb_map

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

23 Commits
 
 
 
 
 
 
 
 

Repository files navigation

pdb_map

Create datasets that bring together RNA 3D structures & chemical mapping

Basic command

usage: create_pdb_sequences_csv.py [-h] --targets_file TARGETS_FILE --fasta_file FASTA_FILE [--cif_dir CIF_DIR] [--rdat_2A3_file RDAT_2A3_FILE] [--rdat_DMS_file RDAT_DMS_FILE] [--start_idx START_IDX] [--end_idx END_IDX]
                                   [--dataset_name DATASET_NAME] [-o OUTDIR] [--align_map_to_target_sequence] [--use_target_order] [--max_len_target MAX_LEN_TARGET]

Prepare .csv files integrating PDB 3D output and chemical mapping data

options:
  -h, --help            show this help message and exit
  --targets_file TARGETS_FILE
                        CSV file with columns including "target_id" and "sequence". Default is `test_sequences.csv`.
  --fasta_file FASTA_FILE
                        FASTA file with sequences that were chemically mapped
  --cif_dir CIF_DIR     Directory to search for previously downloaded cif.gz files
  --rdat_2A3_file RDAT_2A3_FILE
                        File with 2A3 data in RDAT format
  --rdat_DMS_file RDAT_DMS_FILE
                        File with DMS data in RDAT format
  --start_idx START_IDX
                        Start index (1,2,...) of test_sequences to work on, for parallelization. Default: 0 (do all sequences).
  --end_idx END_IDX     End index (1,2,...) of test_sequences to work on, for parallelization. Default: 0 (do all sequences).
  --dataset_name, --name DATASET_NAME
                        full dataset_name, tag for csvs
  -o, --outdir OUTDIR   Where to save output CSVs (Default ./)
  --align_map_to_target_sequence
                        legacy: align chem map CSV output to target residues. Default false (align chem map CSV to fasta used for chemical mapping).
  --use_target_order    legacy: use order of targets in --target_file, not in fasta_file
  --max_len_target MAX_LEN_TARGET
                        maximum length of sequence to output to wide xyz, 2A3, and DMS csv files (only relevant with use_target_order)

Example

Go into the directory example/

cd example

and type the command in README_RUN:

 python3 ../create_pdb_sequences_csv.py \
 	 --targets_file train_sequences.v2.csv \
    --fasta_file train2_pdb130_RNA_sequences.fasta \
    --cif_dir PDB_RNA \
	 --rdat_2A3_file RDAT/train2_pdb130_RTB002_2A3.rdat \
	 --rdat_DMS_file RDAT/train2_pdb130_RTB000_DMS.rdat \
	 --dataset_name train2_pdb130 \
	 --start_idx 1001 --end_idx 1005 

This provides the list of target sequences (--sequence_file, here train_sequences.v2.csv from the Stanford RNA 3D Folding Kaggle competition, the location of any pre-downloaded CIF files from the PDB (here PDB_RNA), chemical mapping data (2A3 and DMS) in RDAT format, and a dataset_name to label the output csvs (--dataset_name).

For illustration purposes, the optional --start_idx and --end_idx are set to be 1001 and 1005, and so the script will look at only 5 sequences. Don't provide these flags if you want to run through all the sequences that are mapped in the RDAT/fasta files.

Output files (examples are in example_output/):

  • train2_pdb130.labels.1001_1005.csv - 'narrow' file with x,y,z of C1' coordinates, and DMS and 2A3 (and errors), one row for each residue in each target sequence for which chemical mapping data was found in the RDAT files. Similar in format to labels for Stanford RNA 3D Folding Kaggle competition
  • train2_pdb130.2A3.1001_1005.csv,train2_pdb130.DMS.1001_1005.csv- 'wide' files with 2A3/DMS and errors. Similar in format totrain_data.csv` in another Kaggle competition, the Stanford Ribonanza challenge.
  • train2_pdb130.xyz.1001_1005.csv - 'wide' files with C1' x,y,z and residue numbers in the PDB.
  • train2_pdb130.sequences.1001_1005.csv - same information and format as input sequences file, but just for the targets for which chemical mapping data were found in the RDAT files.

and there's also:

  • color_data.pml - script that can be run in Pymol with run color_data.pml to look through each PDB file, color by the 2A3 and DMS data, and output to map_data/ folder. You'll need to install Ribovis to run this.
  • map_data - Files like 7TAX_M.2A3.txt that contain resnum reactivity for coloring PDB files.

The output to terminal gives information on how the target sequences are aligned into the sequences that were subjected to chemical mapping, which include constant flanking regions and barcode hairpins:

 1001 7Z4I_A
Found resolved sequence in fasta
target            0 GGGAACGACUCGAGUAGAGUCGAAAAGUAGCACGUACGUGUGCUACAAGCACAUUGGAUA
                  0 ------------------------------------------------------------
query             0 ------------------------------------------------------------

target           60 ----CGCAUAAAGAUGAGACGCGUUUUAGAGCUAGAAAUAGCAAGUUAAAAUAAGGCUAG
                 60 ----||||||||||||||||||||||||||||||||||||||||||||||||||||||||
query            60 GGGACGCAUAAAGAUGAGACGCGUUUUAGAGCUAGAAAUAGCAAGUUAAAAUAAGGCUAG

target          120 UCCGUUAUCAACUUGAAAAAGUGGCACCGAGUCGGUGCUU--ACUGUUUUACCUUUCGAG
                120 ||||||||||||||||||||||||||||||||||||||||--------------------
query           120 UCCGUUAUCAACUUGAAAAAGUGGCACCGAGUCGGUGCUUUU------------------

target          180 GUAAAGCAGUCAAAAGAAACAACAACAACAAC 212
                180 -------------------------------- 212
query           180 -------------------------------- 212

Created: map_data/7Z4I_A.2A3.txt
Created: map_data/7Z4I_A.DMS.txt
Completed 1 sequences

 1002 1ATV_A
Found entire target in fasta
target            0 GGGAACGACUCGAGUAGAGUCGAAAAGAACAGGCCGUGAGGCUUGUUCAAGACAUCGUGU
                  0 ------------------------------------------------------------
query             0 ------------------------------------------------------------

target           60 GAGCGAUGUCAACUCAACACGUGAGUGUUGAGAACAUGAUCGGUGACGAUCGUGAACGUA
                 60 ------------------------------------------------------------
query            60 ------------------------------------------------------------

target          120 CUUUACGAAGUGCGAAAUAGGGACCAGAAGGUCCCGACUUAGAGGUACGUGAGUACUUCU
                120 -------------------|||||||||||||||||------------------------
query           120 -------------------GGGACCAGAAGGUCCCG------------------------

target          180 AAGUGAAAAGAAACAACAACAACAAC 206
                180 -------------------------- 206
query           180 -------------------------- 206

...

In addition, for this particular example, there were mismatches in many sequences that were mapped by DMS/2A3 and what was in the PDB, e.g. for some sequences only the residues that were resolved in the PDB structure were included for chemical mapping. The script automatically recognizes both of these possibilities. And if it fails, as a backup, the scripts uses Biopython's PairwiseAligner, to guess the correspondences of target sequence to PDB and chemical mapping constructs.

Tips

  • Only targets for which sequences are the RDAT/fasta files are output and, by default, they are in order of appearance in that file.
  • The sequences in the RDAT file have to match the ones in the FASTA file.
  • RDAT is a format used in Stanford RNA Mapping Database. Basic scripts for manipulation are available at RDATkit.
  • If you don't have the .cif.gz files downloaded, the script will download them as needed into a folder PDB/. But it can be faster to pre-download them -- check out the PDB's batch download service. For the example above, all the RNA-containing entries in the PDB as of May 2025 are at this Google Drive link.
  • If parallelizing on a cluster, have --start_idx and --end_idx set to pull out chunks for each job. Then you can concatenate all the chunks of each .csv file with, e.g., csvstack.

About

Create datasets that bring together RNA 3D structures & chemical mapping

Resources

Stars

Watchers

Forks

Packages

 
 
 

Contributors