Using LLMs with your data
Why this matters
Large language models (LLMs) process the text you provide. Depending on the tool you use, that data may:
- be sent to external providers
- be logged or stored
- be used for monitoring or improvement
- leave your institutional environment
Choosing the wrong setup can lead to privacy risks, policy violations, or loss of control over your data.
Key question
Can I use this LLM setup with my data?
The answer depends on what data you have and which tool you use.
Step 1: Classify your data
🟢 Low-risk data (generally safe)
- Publicly available text
- Published open-access papers
- Synthetic or generated data
- Non-sensitive notes or drafts
Typically safe to use with any LLM usage mode.
🟡 Medium-risk data (use caution)
- Internal documents
- Unpublished research
- Draft manuscripts
- Project notes or meeting summaries
Recommended:
- institutional tools (chat/API)
- controlled environments (VM/VRE/HPC)
Avoid: - pasting into public tools without checking policies
đź”´ High-risk data (restricted)
- Personal data (e.g., names, emails, identifiable individuals)
- Sensitive personal data (e.g., health, political views)
- Confidential or proprietary data
- Restricted datasets (e.g., administrative data such as CBS data)
Do NOT use:
- public chat tools
- public APIs
Use instead:
- institutional infrastructure
- secure environments
- HPC or controlled research platforms
Step 2 — Understand the tool you are using
Public chat tools
(e.g., ChatGPT-style interfaces)
- Easy to use
- Data may leave your institution
- Limited control
Use only for low-risk data
Public APIs
(e.g., OpenAI-style APIs)
- Programmatic access
- More control than chat tools
- Still external providers
Suitable for low- to medium-risk data (if allowed)
Institutional tools (chat/API)
- Provided or approved by your university or SURF
- Governed usage and access
- Often preferred for research
Suitable for:
- medium-risk data
- sometimes higher-risk data (depending on setup)
Remote environments (VMs / VREs)
- More control over environment
- Data handling depends on provider
- Can be configured for privacy
Suitable for:
- medium-risk data
- some sensitive data (if properly configured)
HPC / secure environments
- Strict access control
- Data stays within controlled infrastructure
- Designed for sensitive or large-scale workloads
Required for: - high-risk or restricted data
Rule of thumb
If you would not email the data to an external service, do not share it with a public LLM (including APIs).
Common mistakes
- Copy-pasting sensitive data into public chat tools
- Assuming “research use” automatically allows data sharing
- Ignoring institutional policies
- Using APIs without checking where data is processed
- Forgetting to log or document LLM usage
Good practices
- Check your institution’s AI or data policy
- Prefer institutional tools when available
- Use local or controlled environments for sensitive work
- Remove or anonymize personal data where possible
- Keep track of:
- model used
- prompts
- parameters
- Validate outputs against original data
When in doubt
If you are unsure:
- Contact your research support or IT services
- Check institutional AI guidance
- Use a more restrictive setup (e.g., institutional or local)
Summary
| Data type | Recommended tools |
|---|---|
| Low-risk | Any (chat, API, local, remote) |
| Medium-risk | Institutional tools, VMs/VREs |
| High-risk | HPC, secure environments only |