Data Construction Skill
Build supervision datasets from markdown books or long markdown documents.
Books are knowledge sources only. The final dataset must teach reusable domain knowledge and how to apply it. Do not generate book-comprehension questions, citation-led questions, or document-structure questions.
Default behavior is full coverage, not sampling.
Dataset objective
Compile book knowledge into three complementary supervision forms:
concept_qa: teach atomic reusable knowledge such as definitions, categories, rules, mechanisms, purposes, and constraints.process_qa: teach concise, grounded reasoning patterns such as condition checking, rule application, causal explanation, comparison, exception handling, and step ordering.case_application: teach knowledge transfer into realistic but source-grounded scenarios where the model must analyze a situation and apply the book's knowledge.
Use all three forms when supported. Do not force all three forms for every chunk.
Inputs
Use either of these inputs:
- A directory of markdown files.
- One or more precomputed
*.chunks.jsonlfiles.
If chunk files already exist, reuse them instead of re-splitting the source books.
Required completion rule
A task is complete only when every chunk has exactly one of the following outcomes:
- at least one supervision record written to a batch JSONL file and a matching
keptrecord written tochunk_status.jsonl, or - a recorded
skippeddecision inchunk_status.jsonlwith a non-emptyskip_reason
Do not stop after producing a small sample unless the user explicitly asks for a sample.
Do not report the task as completed, finished, done, or ready until all of the following are true:
check_coverage.pyreportsunprocessed_chunks = 0sample_without_status_previewis emptysample_status_mismatch_previewis empty
Partial progress may be reported only as progress, never as completion.
Output files
Use a resumable work layout like this:
work/
manifest.jsonl
chunks/
book_a.chunks.jsonl
supervision_batches/
batch_001.jsonl
batch_002.jsonl
chunk_status.jsonl
supervision_merged.jsonl
validation.json
coverage.json
chunk_status.jsonl is required for full runs.
Workflow
1. Prepare the corpus
If the user provides markdown files, run:
scripts/build_manifest.py <input_dir> --output work/manifest.jsonl
scripts/split_markdown_book.py <input_md> --output work/chunks/<name>.chunks.jsonl --source-root <input_dir>
If chunk files already exist, skip this step.
2. Process chunks in batches
Process chunk files sequentially in small batches.
Recommended batch size: 20 to 50 chunks.
Use:
scripts/next_unprocessed_chunks.py work/chunks/*.chunks.jsonl --status work/chunk_status.jsonl --limit 25 --output work/next_batch.jsonl
For each chunk in the batch, do the following in order:
- Decide whether the chunk contains reusable knowledge.
- If not, write a
skippedrecord tochunk_status.jsonl. - If yes, identify all distinct reusable knowledge propositions in the chunk.
- Identify proposition relations such as prerequisite, condition-result, cause-effect, contrast, sequence, category-membership, and exception-override.
- Decide which sample types the chunk can support:
concept_qa,process_qa,case_application. - Generate all strong, non-duplicative supervision records supported by the chunk.
- Write exactly one
keptstatus record for that chunk with per-type counts.
After finishing one batch, continue with the next unprocessed batch until no chunks remain.
3. Status and sample synchronization rule
After every processed batch:
- write or append supervision records for the batch
- write or append status records for the exact same processed chunk ids
- only then merge supervision files
- only then run validation and coverage auditing
Never leave supervision records without status records.
Never mark a chunk as kept unless at least one supervision record was actually written for that chunk.
Never leave a processed chunk without a status record.
Triage: decide whether a chunk is knowledge-bearing
Keep the chunk if it contains reusable knowledge such as:
- definitions
- functions of entities
- steps in a process
- mechanisms
- comparisons
- causes
- purposes
- rules and constraints
- categories
- enumerations
- conditions, exceptions, or consequences
- operational distinctions that teach reusable knowledge
Skip the chunk if it is mainly:
- acknowledgements
- table of contents
- author lists
- navigation structure
- headings without body
- broken OCR fragments
- page furniture
- index-like lists with no teachable proposition
- pedagogy-only text such as exercises, study prompts, or self-check instructions without reusable knowledge
- pure transition text with no substantive proposition
- generic filler that does not teach a reusable fact, rule, distinction, mechanism, or application pattern
If a chunk contains no teachable knowledge, generate 0 samples and record a skip reason.
Allowed skip reasons
Use only these values for skip_reason:
- navigation
- non_knowledge
- pedagogy
- heading_only
- low_information
- noisy
- broken_ocr
- duplicate_scope
- index_like
Do not invent new skip labels.
Extract knowledge propositions
Before drafting samples, identify the knowledge taught by the chunk.
A knowledge proposition is a distinct reusable statement the model should learn.
Typical proposition types:
- definition
- function
- mechanism
- process_step
- comparison
- cause
- purpose
- rule
- constraint
- category
- enumeration
- condition
- exception
- consequence
If the chunk does not support clear propositions, skip it.
Extract proposition relations
For each chunk kept for supervision, identify any proposition relations that are explicitly supported or can be derived in one grounded step from the chunk:
- prerequisite
- condition_result
- cause_effect
- exception_override
- contrast
- sequence
- category_member
- part_whole
- decision_rule
These relations determine whether the chunk can support process or case supervision. Do not fabricate relations not supported by the source.
Canonicalize knowledge
Transform source statements into concept-level knowledge.
Remove:
- book-relative wording
- section references
- citation framing
- passage language
- chapter-led prompts
- instructional scaffolding such as “in this lesson” or “the following section explains”
- assessment framing such as “students should understand”
The samples must ask about the concept or application itself, not the document.
Route each proposition into the right sample type
Emit concept_qa when the chunk supports atomic knowledge such as:
- definitions
- categories
- purposes
- functions
- rules stated directly
- independent consequences
- concise mechanism descriptions
Emit process_qa when the chunk supports concise grounded reasoning such as:
- applying a rule to stated conditions
- checking a decision path
- tracing a cause-effect link
- resolving a comparison
- ordering process steps
- handling exceptions or constraints
- explaining why one outcome follows and another does not
Emit case_application when the chunk supports scenario reframing such as:
- a realistic situation can be described using only source-grounded concepts
- the answer requires applying one or more source rules or mechanisms
- the case can be solved without introducing external domain facts
Do not force process or case samples from chunks that only support atomic knowledge.
Exhaustive proposition coverage rule
Do not impose a fixed upper limit on sample count per chunk.
The goal is to exhaust the chunk’s reusable knowledge propositions and supported reasoning patterns.
If a chunk teaches five distinct reusable propositions, generate supervision for all five.
If a chunk teaches ten distinct reusable propositions, generate supervision for all ten.
Do not stop early just because the chunk already has “enough” items.
However, exhaustiveness means exhausting distinct knowledge and reasoning patterns, not generating paraphrase variants.
Generate all distinct, supportable, reusable propositions and applications in the chunk, but do not ask multiple questions that test the same proposition with only wording changes.
Prefer proposition coverage over superficial sample count.
What exhaustiveness means
Exhaustiveness includes:
- each distinct definition
- each independent function of an entity
- each rule or constraint
- each exception or condition that materially changes the concept
- each non-overlapping item in a meaningful category or enumeration when the items are teachable
- each comparison where the contrasted sides matter
- each process step only when the step is conceptually meaningful and reusable
- each grounded reasoning path where relation structure materially changes how the knowledge is applied
Exhaustiveness does not include:
- repeating the same fact in multiple phrasings
- turning every sentence into a separate item when several sentences express one proposition
- generating trivial heading-restatement questions
- fragmenting one clean proposition into many low-value samples
- wrapping a simple definition in fake multi-step reasoning
Cross-chunk support rule
Default to generating supervision from the current chunk alone.
If adjacent chunks belong to the same concept and one chunk alone is insufficient for a clean conceptual or process sample, generate a sample anchored to the primary chunk and optionally record supporting chunk ids in metadata.
Do not merge distant chunks or broad chapter themes into one item.
Prefer zero samples over weak samples
Skip the chunk instead of generating supervision when:
- the content is mainly structural or pedagogical
- the content is too generic to teach reusable knowledge
- the only possible questions would merely restate the heading
- all candidate items would be low-distinction paraphrases of the same proposition
- the chunk contains text but no clear, supportable conceptual takeaway
- a case would require too much invented context beyond the chunk
- the only possible reasoning is fake reasoning that merely restates the answer
Exhaustive coverage does not justify weak samples.
How to write grounded reasoning
Reasoning in this dataset is external supervision, not hidden chain-of-thought.
Use short, explicit, domain-grounded reasoning steps that teach a reusable decision pattern. Keep them concise and factual.
Good reasoning characteristics:
- each step is justified by source knowledge
- steps identify the relevant condition, rule, comparison, exception, or causal link
- steps are brief and structured
- the final answer follows naturally from the steps
Bad reasoning characteristics:
- filler such as “first read the question” or “according to the passage”
- meta commentary such as “this question asks about”
- answer restatement disguised as steps
- invented facts not supported by the chunk
- long free-form essays
Sample style
Good concept_qa
Use when teaching reusable knowledge directly.
Question should stand alone.
Answer should:
- answer directly in the first clause
- be self-contained
- teach reusable knowledge
- use clean instructional language
- paraphrase the source unless exact wording is essential
Good process_qa
Use when teaching how to reason with the knowledge.
Question should require applying a source-supported rule, condition, comparison, sequence, or exception.
Reasoning should:
- be 2 to 6 short steps
- name the relevant condition, rule, relation, or exception
- show the minimal grounded path from premises to conclusion
Answer should be brief and directly resolve the question.
Good case_application
Use when the chunk supports scenario transfer without hallucination.
Case should:
- be realistic but generic
- use only source-grounded entities, conditions, rules, and mechanisms
- avoid unnecessary narrative decoration
Analysis should:
- identify which source knowledge applies
- compare the case facts against the relevant rule, mechanism, or exception
- reach the answer in a compact grounded path
Answer should resolve the case directly.
Hard rules
Do not generate source-anchored questions
Avoid phrases such as:
- according to the excerpt
- according to the passage
- according to the text
- according to the book
- according to the framework
- according to the model
- based on the above content
- based on the source chunk
- 根据本节
- 根据本文
- 根据这段内容
Questions must stand alone.
Do not generate citation-led questions
Avoid questions framed around:
- section numbers
- chapter numbers
- book titles
- figure numbers
- statute citations
Ask about the concept instead.
Do not generate meta answers or meta reasoning
Avoid answers or reasoning such as:
- the answer should summarize
- based on the source chunk
- this section mainly discusses
- the passage explains that
- first identify what the question is asking
- this problem tests whether
Answers and reasoning must provide knowledge, not instructions about answering.
Do not create supervision from non-knowledge content
Never generate samples from:
- acknowledgements
- table of contents
- title-only sections
- advisor lists
- navigation headings
- index-only lists
- pedagogy-only scaffolding
- review questions copied from the source without conceptual rewriting
Do not fabricate hidden reasoning
Never invent:
- external facts not supported by the chunk
- latent domain assumptions not present in the source
- extra diagnostic steps added only to sound smart
- cases that require outside knowledge to solve
If the chunk does not support a clean grounded reasoning path, emit concept_qa only.
Question type schema
Use only these question_type values:
- definition
- function
- mechanism
- process
- comparison
- cause
- purpose
- rule
- constraint
- category
- enumeration
- condition
- exception
- consequence
Use singular labels exactly as written above.
Sample schema
Write one JSON object per line.
Required fields for all sample types:
{
"sample_type": "concept_qa",
"source_file": "...",
"chunk_id": "..."
}
concept_qa
{
"sample_type": "concept_qa",
"question": "...",
"answer": "...",
"source_file": "...",
"chunk_id": "...",
"question_type": "definition",
"metadata": {
"knowledge_point": "...",
"supporting_chunk_ids": []
}
}
process_qa
{
"sample_type": "process_qa",
"question": "...",
"reasoning": [
"...",
"..."
],
"answer": "...",
"source_file": "...",
"chunk_id": "...",
"question_type": "rule",
"metadata": {
"knowledge_points": ["..."],
"reasoning_pattern": "rule_application",
"supporting_chunk_ids": []
}
}
case_application
{
"sample_type": "case_application",
"case": "...",
"question": "...",
"analysis": [
"...",
"..."
],
"answer": "...",
"source_file": "...",
"chunk_id": "...",
"question_type": "condition",
"metadata": {
"knowledge_points": ["..."],
"task_form": "case_analysis",
"supporting_chunk_ids": []
}
}
Use metadata.knowledge_point or metadata.knowledge_points when it helps identify the canonical concept being taught.
Use metadata.supporting_chunk_ids only when adjacent chunks are genuinely needed.
Chunk status schema
Write one JSON object per line to chunk_status.jsonl.
For kept chunks:
{
"chunk_id": "...",
"source_file": "...",
"status": "kept",
"skip_reason": "",
"concept_count": 2,
"process_count": 1,
"case_count": 1,
"total_sample_count": 4
}
For skipped chunks:
{
"chunk_id": "...",
"source_file": "...",
"status": "skipped",
"skip_reason": "navigation",
"concept_count": 0,
"process_count": 0,
"case_count": 0,
"total_sample_count": 0
}
There must be exactly one final status record per processed chunk.
Validation and coverage audit
After each completed batch or after merging batches, run:
scripts/validate_qa_jsonl.py work/supervision_merged.jsonl --report work/validation.json
scripts/check_coverage.py work/chunks/*.chunks.jsonl --status work/chunk_status.jsonl --qa work/supervision_merged.jsonl --report work/coverage.json
If coverage reports any of the following, the run is not complete:
unprocessed_chunks > 0- non-empty
sample_without_status_preview - non-empty
sample_status_mismatch_preview
If validation passes but coverage is incomplete, continue processing remaining chunks.
Operating principle
The final dataset should read like a general domain-supervision corpus, not a book comprehension exercise.
The objective is exhaustive coverage of reusable knowledge propositions across the corpus, plus grounded reasoning patterns and case application whenever the source supports them, with zero tolerance for structural leakage, fake reasoning, status inconsistency, or paraphrase-only duplication.