// This node runs once for the single batch item. // It constructs a dynamic prompt for the LLM to analyze all files in the batch. const item = $input.item; const fileBatch = item.json.fileBatch; // 1. Create a string representation of the files for the prompt. // We give the LLM only what it needs for analysis and identification. const filesForPrompt = fileBatch.map(f => ({ filePath: f.originalItem.filePath, // Use the real, full file path fileExtension: f.originalItem.fileExtension, fileContent: f.originalItem.fileContent })); const filesString = JSON.stringify(filesForPrompt, null, 2); // 2. Define the revised prompt for BATCH analysis. const batchPrompt = ` ## ROLE & EXPERTISE You are a specialized RAG systems architect with deep expertise in document preprocessing, chunking methodologies, and vector retrieval optimization. Your analysis directly impacts retrieval quality and computational efficiency in production RAG pipelines. ## ANALYSIS CONTEXT You have been provided with a JSON array of multiple file objects. Your task is to analyze each file object within this array. - **Pipeline Stage**: Pre-processing for vector embedding and retrieval - **Target Use Case**: Multi-repository codebase analysis and knowledge extraction ## BATCH OF FILE OBJECTS TO ANALYZE \`\`\`json ${filesString} \`\`\` ## CHUNKING STRATEGY OPTIONS ### 'code' - **Use For**: Programming languages (.py, .js, .java, .cpp, .go, .rs, etc.) - **Method**: AST-aware recursive splitting respecting function/class boundaries - **Optimal For**: Preserving semantic code blocks, maintaining context ### 'recursive' - **Use For**: Structured markup (.md, .html, .xml, .rst, .tex) - **Method**: Hierarchical splitting on headers, sections, structural elements - **Optimal For**: Documents with clear logical structure ### 'semantic' - **Use For**: Natural language content (.txt, documentation, prose) - **Method**: Sentence/paragraph boundary-aware splitting - **Optimal For**: Maintaining contextual meaning and readability ### 'do_not_chunk' - **Use For**: Binary files, small configs, media, or files where chunking destroys utility - **Method**: Process as single unit or skip entirely - **Optimal For**: Preserving file integrity ## ANALYSIS FRAMEWORK For EACH file object in the provided array, you must perform the following analysis: 1. **Content Structure Assessment**: Analyze the provided content's syntax, formatting, and logical organization. 2. **Semantic Density Evaluation**: Determine information distribution patterns in the actual content. 3. **Context Dependency Analysis**: Identify cross-reference and dependency patterns within the content. 4. **Retrieval Optimization**: Consider how chunks from this specific content will perform in vector similarity search. 5. **Content Summarization**: Create a brief, one-sentence summary of the file's primary purpose or content. ## SIZE & OVERLAP GUIDELINES - **Code**: 800-1500 chars (preserve function scope), 150-300 overlap - **Structured Text**: 1000-2000 chars (complete sections), 200-400 overlap - **Prose/Documentation**: 1200-2500 chars (complete thoughts), 300-500 overlap - **Configuration**: Assess if chunking adds value vs. whole-file processing ## OUTPUT REQUIREMENTS Your entire response MUST be a single, valid JSON object. This object must contain a single key: "analysisResults". The value of "analysisResults" must be a JSON array. Each element in this array is an object corresponding to a file from the input batch. For each file, the object must have this exact structure: \`\`\`json { "filePath": "[The full path of the file, copied exactly from the input]", "contentSummary": "[A one-sentence summary of the file's purpose based on its content]", "chunkingStrategy": "[code|recursive|semantic|do_not_chunk]", "reasoning": "[Concise technical justification based on the actual content analysis above]", "recommendedChunkSize": [integer|null], "recommendedChunkOverlap": [integer|null] } \`\`\` ## **CRITICAL INSTRUCTIONS** - Analyze EACH file object in the BATCH OF FILE OBJECTS provided above. - Your output array "analysisResults" MUST contain exactly one object for each file object in the input array. - The "filePath" in your output objects MUST EXACTLY MATCH the "filePath" from the corresponding input object. - Base your decision on the content structure, not just the file extension. - Prioritize retrieval quality over processing speed. - Consider the file's role in the broader repository context. - Your entire response must be ONLY the JSON object that strictly adheres to the schema, starting with \`{\` and ending with \`}\`. Do not include markdown, comments, or any other text outside the JSON structure. `; // 3. Add the generated prompt to the JSON data. item.json.batchPrompt = batchPrompt; // Return the modified item. return item;