Methodology Limitations, RocSite Discovery

Why this page exists. A trustworthy audit service has to be specific about what its audit can and cannot tell you. This page is that specificity. Read it before you submit, and read it again when you receive a rating.

What an automated rating CAN tell you

An automated RocSite Discovery rating tells you, deterministically and reproducibly, whether a submission satisfies the methodological criteria encoded in our published protocol at the version stamped on the rating.

Concretely, an automated rating can answer:

Does the submission meet our pre-registration discipline? The 8 OSF rules, filed before analysis, hypothesis stated explicitly, cohort definition, outcome variables defined, pre-specified analysis plan, falsification criteria, immutability, deviations flagged, pass or fail per rule.
Do the falsification gates pass? When a gate is implementable for a submission's domain (e.g., care-process leakage detection on a clinical AI claim), the engine reports pass / fail / not applicable for that gate.
Is the cited prior literature consistent with the new claim? The engine cross-references claimed effect sizes against meta-analyses and primary literature where the engine has access to a comparable corpus.
Does the submission survive the same protocol applied to RocSite's own work? We hold our own published findings to the same gates we hold yours to. A rating reports your submission against that consistent bar.

What an automated rating CANNOT tell you

Equally important. The engine produces methodological signal, not scientific truth.

Whether the underlying science is correct. A submission can pass every gate and still be wrong about the world. Strong methodology is necessary, not sufficient.
Whether the conclusions generalize. The engine evaluates internal validity against the protocol; external validity, does this hold in a different population, different hospital, different decade, is a separate question we do not answer.
Whether the data was collected ethically. We do not assess IRB approval, consent procedures, data-use agreements, or patient-protection regimes. That is the role of IRBs, regulators, and your institution.
Whether the work is novel. The engine does not search the global literature for prior art. A submission can pass our protocol and still duplicate published work.
Whether peer reviewers will agree. Peer reviewers bring domain judgment, taste, and disciplinary context that our protocol does not encode.
Whether the work is fraudulent. We do not detect data fabrication, plagiarism, or image manipulation. Our gates are methodological, not forensic.

Known limitations of the current engine

Be specific. This is what we can and cannot do today, named.

Domain coverage. The engine's gate library is most mature for clinical AI on ICU outcome data (MIMIC-IV, eICU-CRD). Gates for financial models, legal AI, and applied research are in active development and are presently audit-by-engagement only, they do not run automatically on submissions in those domains.
Care-process leakage gate is implemented for clinical AI claims; the analogous gate for non-clinical domains is conceptual, not yet automated.
Adversarial debate. The Advocate / Adversary / Arbiter twin layer is implemented for findings the engine itself produces. Submissions to the public registry currently get rule-by-rule feedback without the full debate trace; the debate layer is on the roadmap for paid tiers.
Subgroup fairness. Where a submission's cohort definition lacks the demographic stratification needed to evaluate subgroup effects, the engine flags this as "scope: not assessable" rather than passing or failing the gate. We do not invent stratification.
Pre-registration timestamp. We accept the OSF or comparable timestamp at face value. We do not independently audit the registry's chain of custody. Disputes about an OSF registry entry's timestamp or chain of custody are handled by OSF directly. We can refer you to the appropriate OSF support channel.
PDF parsing. OCR'd or image-only PDFs may produce degraded ratings because the engine cannot extract structured methodology from rendered images at the same fidelity as text PDFs. We flag this when detected.
Language. The engine reads English. Non-English submissions are translated programmatically before review; ratings on translated submissions carry an explicit translation flag and reduced confidence.

Why automated review is valuable despite these limitations

Naming the limits makes the value sharper, not softer.

Consistency. The same input produces the same rating, every time. There is no "the reviewer was tired today."
Reproducibility. A rating record names its protocol version. Anyone with the protocol can, in principle, re-derive the rating from the submission. Ratings that cannot be reproduced are bugs we want to know about.
No reviewer bias. The engine has no prior relationship with the submitter, no career stake in the submitter's institution, no conflict from prior peer review of the same author's work.
No institutional conflict. A rating can be unfavorable to a paper from a major center without political consequence, because no human author of the rating exists.
Accessibility. A pre-registration check is free. The protocol is public. The methodology is published. Anyone can audit the auditor.
Speed and scale. Automated review runs in days, not months. We can process more submissions than any panel of human reviewers, applying the same standard to each.

How to use a rating well

Read the rule-by-rule breakdown, not just the headline number. The breakdown is where the actionable information lives.
Treat "Exploratory" as a flag, not a verdict. An Exploratory rating means at least one gate is unmet or untestable, often because the underlying study design is single-dataset or pre-replication. This is normal at early stages of a research program.
If a gate flagged a problem, fix the problem. Then resubmit. The engine is faster than peer review; the iteration loop is the point.
Read this page again before you cite a rating in a high-stakes context. Citing methodology audit as if it were peer review or regulatory clearance is a category error we cannot prevent on your behalf.

How to flag a defect in this protocol

If you believe the engine evaluated your submission against a faulty gate, the right place to file the bug is the right of reply process described in our Terms of Service. Methodology critiques of RocSite Discovery itself are welcomed and published. A live audit service that cannot accept criticism of its own audit is not a real audit service.