CNKI Paper Detail Extraction
Extract complete metadata from a CNKI paper detail page.
Arguments
$ARGUMENTS is optionally a CNKI paper detail URL (containing kcms2/article/abstract). If not provided, assumes the current page is already a paper detail page.
Steps
1. Navigate to the paper page (if URL provided)
If $ARGUMENTS contains a URL:
- Use
mcp__chrome-devtools__navigate_pagewith the URL. - Use
mcp__chrome-devtools__wait_forwith text["摘要"]and timeout 15000.
2. Check for captcha
Use mcp__chrome-devtools__take_snapshot. If "拖动下方拼图完成验证" found, notify user:
CNKI 正在显示滑块验证码。请在 Chrome 浏览器中手动完成拼图验证,完成后告诉我继续。
3. Extract paper metadata via JavaScript
Use mcp__chrome-devtools__evaluate_script with this function:
() => {
const brief = document.querySelector('.brief');
if (!brief) return { error: 'Paper detail section (.brief) not found' };
// Title
const title = brief.querySelector('h1')?.innerText?.trim()
?.replace(/\s*附视频\s*$/, '') // remove "附视频" suffix
?.replace(/\s*网络首发\s*$/, ''); // remove "网络首发" suffix
// Authors - first h3.author contains author links with sup tags
const authorH3s = brief.querySelectorAll('h3.author');
const authorSection = authorH3s[0];
const authors = [];
if (authorSection) {
const authorLinks = authorSection.querySelectorAll('a');
authorLinks.forEach(a => {
const name = a.innerText?.replace(/\d+$/, '').trim();
const supMatch = a.innerText?.match(/(\d+)$/);
const affiliationNum = supMatch ? supMatch[1] : '';
authors.push({ name, affiliationNum });
});
}
// Affiliations - second h3.author contains org links
const affiliations = [];
if (authorH3s.length > 1) {
const orgLinks = authorH3s[1].querySelectorAll('a');
orgLinks.forEach(a => {
affiliations.push(a.innerText?.trim());
});
}
// Abstract
const abstractEl = document.querySelector('.abstract-text');
const abstract = abstractEl?.innerText?.trim() || '';
// Keywords
const keywordsP = document.querySelector('p.keywords');
const keywords = keywordsP
? Array.from(keywordsP.querySelectorAll('a')).map(a => a.innerText?.replace(/;$/, '').trim())
: [];
// Fund
const fundsP = document.querySelector('p.funds');
const fund = fundsP?.innerText?.trim() || '';
// Classification code
const clcCode = document.querySelector('.clc-code');
const classification = clcCode?.innerText?.trim() || '';
// Journal/source
const docTop = document.querySelector('.doc-top');
const journal = docTop?.querySelector('a')?.innerText?.trim() || '';
// Online first / publication info
const headTime = document.querySelector('.head-time');
const pubInfo = headTime?.innerText?.trim() || '';
// Is online first?
const isOnlineFirst = !!brief.querySelector('.icon-shoufa');
// Article outline/TOC
const catalogList = document.querySelector('.catalog-list, .catalog-listDiv');
const toc = catalogList?.innerText?.trim() || '';
// Citation network counts
const citationTabs = document.querySelectorAll('ul.module-tab.tpl_lieteratures li');
const citationInfo = {};
citationTabs.forEach(li => {
const id = li.getAttribute('data-id');
const text = li.innerText?.trim();
const countMatch = text.match(/(\d+)/);
if (id) {
citationInfo[id] = {
label: text.replace(/\d+/, '').trim(),
count: countMatch ? parseInt(countMatch[1]) : 0
};
}
});
return {
title,
authors,
affiliations,
abstract,
keywords,
fund,
classification,
journal,
pubInfo,
isOnlineFirst,
toc,
citationInfo
};
}
4. Format and present the output
## {title} {isOnlineFirst ? "[网络首发]" : ""}
**Authors:**
{For each author: "- {name} ({affiliation})"}
**Affiliations:**
{For each affiliation: "- {affiliation}"}
**Journal:** {journal}
**Publication Info:** {pubInfo}
**Abstract:**
{abstract}
**Keywords:** {keywords joined by ", "}
**Fund:** {fund}
**Classification:** {classification}
**Citation Network:**
{For each citation type: "- {label}: {count}"}
5. Fallback: snapshot-based parsing
If JS extraction fails, use mcp__chrome-devtools__take_snapshot and parse the accessibility tree:
- Title:
headinglevel 1 element - Authors:
linkelements whose URLs containkcms2/author/detail - Affiliations:
linkelements whose URLs containkcms2/organ/detail - Abstract:
StaticTextfollowing "摘要:" - Keywords:
linkelements whose URLs containkcms2/keyword/detail - Fund:
linkelements following "基金资助:" - Classification:
StaticTextfollowing "分类号:"
Verified DOM Selectors
| Data | Selector | Notes |
|---|---|---|
| Paper section | .brief | Main paper info container |
| Title | .brief h1 | May contain icons, clean text needed |
| Authors | .brief h3.author:first-of-type a | Text has superscript numbers (e.g., "张三1") |
| Affiliations | .brief h3.author:nth-of-type(2) a | Text starts with "N." (e.g., "1.北京大学") |
| Abstract | .abstract-text | Full abstract text |
| Keywords | p.keywords a | Semicolon-separated keyword links |
| Fund | p.funds | Fund information text |
| Classification | .clc-code | CLC classification codes |
| Journal | .doc-top a | Source journal link |
| Online first | .brief .icon-shoufa | Present if paper is online first |
| Citation tabs | ul.module-tab.tpl_lieteratures li | data-id attr identifies type |