Skip to content

Add Pathway Commons v14 data analysis report and plots#5

Open
chronicgiardia wants to merge 2 commits intoPathwayCommons:masterfrom
chronicgiardia:master
Open

Add Pathway Commons v14 data analysis report and plots#5
chronicgiardia wants to merge 2 commits intoPathwayCommons:masterfrom
chronicgiardia:master

Conversation

@chronicgiardia
Copy link
Copy Markdown

Summary

Exploratory data analysis of the Pathway Commons v14 dataset (pc-hgnc.txt.gz), including:

  • REPORT.md — full analysis report covering interaction types, data sources, network connectivity (degree distribution), and data quality notes
  • interaction_types.png — bar chart of top 10 interaction types
  • data_sources.png — bar chart of top 10 data sources
  • degree_distribution.png — histogram and log-log scatter of gene degree distribution

Key Findings

  • 2,484,221 clean interaction rows across 13 interaction types
  • 20 primary data sources (BioGRID and CTD dominate)
  • 40,684 unique genes/entities with scale-free degree distribution
  • Top gene hubs: RORA (7,078 connections), NOG (6,087)

Co-Authored-By: Oz oz-agent@warp.dev

- REPORT.md: full EDA report covering interaction types, data sources,
  network connectivity, and data quality notes
- interaction_types.png: top 10 interaction types bar chart
- data_sources.png: top 10 data sources bar chart
- degree_distribution.png: degree distribution histogram and log-log plot

Co-Authored-By: Oz <oz-agent@warp.dev>
Copilot AI review requested due to automatic review settings April 8, 2026 05:53
Copy link
Copy Markdown

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Adds an exploratory analysis report for the Pathway Commons v14 Extended SIF dataset and includes the generated plots referenced by the report. This complements the repo’s purpose (working with PC v14 SIF data) by documenting dataset composition, source contributions, and graph connectivity characteristics.

Changes:

  • Add REPORT.md describing interaction type distribution, data sources, degree distribution, and data quality notes for PC v14.
  • Add plots for interaction types, data sources, and degree distribution.
  • Reference these plots from the report for a self-contained writeup.

Reviewed changes

Copilot reviewed 1 out of 4 changed files in this pull request and generated 3 comments.

File Description
REPORT.md New analysis report with dataset summary, connectivity stats, and embedded plot references.
interaction_types.png Plot used by the report to visualize top interaction types.
data_sources.png Plot used by the report to visualize top data sources.
degree_distribution.png Plot used by the report to visualize the degree distribution.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

# Pathway Commons v14 — Data Analysis Report

## Dataset Overview
- **Source:** Pathway Commons v14 (`pc-hgnc.txt.gz` from `download.baderlab.org`)
Copy link

Copilot AI Apr 8, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The dataset reference here (pc-hgnc.txt.gz from download.baderlab.org) doesn’t match the repo’s documented download + naming convention (README uses data.gz downloaded from pathwaycommons.org/archives/.../PC14.All.hgnc.txt.gz). To avoid confusion for users trying to reproduce the analysis, please align this source line with README (same URL/filename) or explicitly state that this file is equivalent and how it was obtained/renamed.

Suggested change
- **Source:** Pathway Commons v14 (`pc-hgnc.txt.gz` from `download.baderlab.org`)
- **Source:** Pathway Commons v14 (`PC14.All.hgnc.txt.gz` from `pathwaycommons.org/archives/...`, saved locally as `data.gz` per README)

Copilot uses AI. Check for mistakes.
Comment on lines +99 to +100
## Generated Artifacts
- `data.gz` — raw Pathway Commons v14 data file
Copy link

Copilot AI Apr 8, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

data.gz is listed under “Generated Artifacts”, but it’s an input dataset (downloaded) rather than something produced by the analysis, and it isn’t included in this PR. Consider renaming this section to distinguish inputs vs. outputs, or move data.gz to “Dataset Overview” / “Inputs” and keep this section limited to committed/generated outputs (the report + plots).

Suggested change
## Generated Artifacts
- `data.gz` — raw Pathway Commons v14 data file
## Generated Outputs

Copilot uses AI. Check for mistakes.
| 17 | **NOG** | **6,087** | **Gene** |
| 18 | chebi:78510 | 6,004 | Small molecule |
| 19 | chebi:23414 | 5,986 | Small molecule |
| 20 | CHEBI:45713 | 5,808 | Small molecule |
Copy link

Copilot AI Apr 8, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The ChEBI identifier casing is inconsistent (CHEBI:45713 vs chebi:... above). If these are meant to be the same identifier namespace, consider normalizing the casing in this table (or note why this one differs) so readers don’t interpret it as a different ID format.

Suggested change
| 20 | CHEBI:45713 | 5,808 | Small molecule |
| 20 | chebi:45713 | 5,808 | Small molecule |

Copilot uses AI. Check for mistakes.
This workflow builds a package using Gradle and publishes it to GitHub Packages upon release creation.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants