SHAREPOINT · INTEGRATION
SharePoint and OneDrive: document RAG source for fiduciary and law firm
SharePoint and OneDrive form the document store of most Swiss firms. REST API and Graph for AI indexing, CSOM as legacy.
Researched & fact-checked by: DuneDive LLC · As of: 2026-05
What are SharePoint and OneDrive?
SharePoint Online and OneDrive for Business are the document stores of the Microsoft 365 platform. OneDrive is the personal store per user (typically 1 TB per user in business plans), SharePoint is the shared team and project store with the concept of sites, libraries, and lists. As of May 2026, in Swiss fiduciary and law firms an estimated 70 to 85 percent of digital client documents sit in SharePoint or OneDrive.
For integrations there are three API generations in parallel. First: the SharePoint REST API (endpoints under /_api/web/), available since SharePoint 2013 and living in many existing integrations. Second: the Microsoft Graph API (see Graph topic), which is the recommended path for new integrations and covers files, lists, sites uniformly. Third: the CSOM (Client-Side Object Model) with a C# or JavaScript library, used by many existing tools. CSOM has been marked as legacy since 2023; new implementations should use Graph.
The most important Graph endpoints for documents are /me/drive/items (OneDrive files), /sites/{site-id}/drive/items (SharePoint library files), /sites/{site-id}/lists/{list-id}/items (SharePoint lists). File bytes flow through the /content sub-endpoint (e.g. /items/{id}/content). For large files there is a resumable upload mechanism with upload sessions.
May 2026 sees a trend toward modern pages and modern lists. Classic SharePoint sites (with master pages, web parts, site workflows) are less often created anew; new sites use modern pages with reactive web parts and a slimmer API.
Why it matters for Swiss fiduciary
Fiduciary and law firms work document-centrically. Contracts, year-end statements, client correspondence, tax returns, meeting minutes, articles of association, share registers, wage statements, expense receipts. All of that usually sits in SharePoint (shared client sites) or OneDrive (personal notes, drafts). Volumes are substantial: a 10-person firm typically accumulates 50,000 to 500,000 documents in 5 years.
The AI layer over SharePoint/OneDrive is almost always a RAG pipeline. Documents are read via the Graph API, run through an OCR layer (if needed), split into chunks, embedded, and indexed in a vector DB (Qdrant). When a case handler asks a question, the AI searches the vector DB for the most relevant passages and answers with source citations and a direct link to the SharePoint document.
Three use-cases have the highest ROI. First: contract indexing. All client contracts become semantically searchable. Question: "Which clients have a notice period under 30 days?" The AI searches 5,000 contracts and delivers a list with sources.
Second: precedent search for law firms. A new client question is matched against similar prior cases. The AI finds 3 similar cases from the precedent library and returns the related arguments and outcomes.
Third: onboarding knowledge. A new case handler can ask the AI questions an experienced colleague would answer. The AI answers from the internal knowledge base with clear source citations.
How it works
The pipeline has four stations: discovery (which documents exist?), ingestion (fetch file bytes), processing (OCR, chunking, embedding), index (vector DB and metadata).
Discovery runs through Graph calls on SharePoint sites:
```bash # Get all tenant sites curl -X GET "https://graph.microsoft.com/v1.0/sites?search=*" \ -H "Authorization: Bearer $ACCESS_TOKEN"
# Per site, get all drives (document libraries) curl -X GET "https://graph.microsoft.com/v1.0/sites/{site-id}/drives" \ -H "Authorization: Bearer $ACCESS_TOKEN"
# Per drive, all files via delta query (incremental) curl -X GET "https://graph.microsoft.com/v1.0/sites/{site-id}/drives/{drive-id}/root/delta" \ -H "Authorization: Bearer $ACCESS_TOKEN" ```
The /delta endpoint is decisive here. Instead of pulling the full set every run, you fetch only the changes since the last deltaLink. A 100,000-document tenant is incrementally synchronised in under 60 seconds.
Ingestion fetches file bytes:
```bash curl -X GET "https://graph.microsoft.com/v1.0/sites/{site-id}/drives/{drive-id}/items/{item-id}/content" \ -H "Authorization: Bearer $ACCESS_TOKEN" \ -o contract.pdf ```
Processing: PDFs go through an OCR layer (Tesseract, Azure Document Intelligence, or Mistral OCR), Office files (docx, xlsx) through dedicated parsers. Text is sliced into 500- to 1,000-token chunks, each with 100-token overlap. Embeddings are computed with text-embedding-3-small or Cohere embed-multilingual.
Index: chunks and embeddings land in Qdrant. Metadata (site, library, file path, last-modified, client) is stored as payload. On RAG queries, retrieval filters by client or confidentiality tier.
SharePoint/OneDrive RAG in 5 steps
- 01Register the Graph app, request Files.Read.All and Sites.Read.All permissions, obtain admin consent.
- 02Set up the discovery pipeline: list sites, list drives, fetch all file metadata via /delta incrementally.
- 03Define the data-class strategy: which documents are indexed? Whitelist via metadata filter (site tag, sensitivity label).
- 04Build the ingestion pipeline: download files, OCR for PDFs, Office parser, split into 500-1000 token chunks, embed.
- 05Set up Qdrant index with client and permission metadata, retrieval filters by user permission before the LLM call.
When to use
The SharePoint/OneDrive integration is worthwhile from around 5,000 indexed documents and at least 5 people regularly searching for content. Below these values, SharePoints own search is often enough and the effort for a RAG pipeline is disproportionate.
The integration is especially sensible at law firms with a precedent library, at fiduciary offices with large contract collections, at auditors with document-retention folders. For these profiles search quality is the decisive lever.
The integration does not override the SharePoint permission model. The RAG index honours SharePoint permissions through metadata filters: only chunks from documents the querying user has access to are returned. That is technically delicate but indispensable for professional-secrecy mandates.
When not to use
If the documents live in another system (Google Drive, Dropbox, NAS, local server shares), the SharePoint integration is not the right lever. A different connector or a migration is needed here.
If document volumes are small (under 1,000 documents) and search happens only occasionally, SharePoints own search is enough. A RAG pipeline pays off only from substantial volume.
For highly sensitive mandates (such as an attorney-client conversation, an advisory note under professional secrecy) a careful data-class strategy is required. Not every document may land in a vector index. We recommend a metadata-driven indexing whitelist: only documents with the label "RAG-OK" are indexed. Classification happens manually or through a separate AI classification pipeline.
Trade-offs
STRENGTHS
- Graph API offers unified and incremental sync via /delta
- SharePoint permissions can be respected in the RAG index
- Resumable uploads and sessions for large files
- Microsoft 365 hosts in Switzerland on request, ideal for FADP
WEAKNESSES
- Permission sync is delicate, faulty implementation can breach confidentiality
- CSOM in legacy tools must be migrated actively, is end-of-life
- OCR for large contract sets can be costly
- Modern vs classic sites have different API semantics in detail
FAQ
What does the SharePoint integration cost?
API usage is included in Microsoft 365. What costs are the AI components: embedding (CHF 15-50 for an initial index of 50,000 documents), OCR (variable, depending on provider), vector DB storage (Qdrant on-prem from CHF 50/month, Qdrant Cloud from USD 50/month), and LLM calls for answering.
How are permissions respected?
Per file, SharePoint permissions are read at indexing time and stored as metadata in the vector index. On a query the retrieval filters by user permissions before chunks go to the LLM. Important: permission changes must be propagated; otherwise the index is stale. We recommend a daily re-sync of permissions.
CSOM or Graph?
Graph for new implementations. CSOM is marked as legacy; Microsoft actively recommends switching. Existing tools with CSOM keep running, but new features only come to Graph.
How do I handle large files?
For files over 4 MB use resumable upload sessions via POST /items/{id}/createUploadSession. You upload in chunks of 320 KiB to 60 MiB. For downloading large files, split via Range headers into multiple calls. PDF files over 100 MB are rarely sensible in RAG and should be split before ingestion.