Citing Knowledgebase sources when using a large .csv file as a source instead of many individual .md files

Posted on April 26, 2025

Big Data

Data Analysis

Managed Databases

AI/ML

Databases

Database Dev/Management

By setec

I figured how to do exactly what I was looking to do in terms of citing sources with .md files stored in spaces buckets (https://www.digitalocean.com/community/questions/linking-to-source-documents), it’s actually quite easy if you just request the sources/context from the API. Now I have a new challenge. I would like to move to using a .csv file because I want to add columns of metadata next to the content so that the metadata and content is all unified in a single knowledge base.

The problem is that I was relying on the filename names, bucket names and directories to construct the original source. With a single large .csv file every source ends up as /bucketname/foldername/some_big.csv

Maybe the only solution is to create an individual .csv for each and every .md file. That is an OK solution but I was just curious if that would be the best practice or if there is a better way and if there is a way to accomplish it with a single large .csv

This may be useful for other projects as well since datasets are often provided as .parquet files which are easy to dump into a large .csv file.

(it would be great if knowledge bases had support for .parquet files directly)

This textbox defaults to using Markdown to format your answer.

You can type !ref in this text area to quickly search our full set of tutorials, documentation & marketplace offerings and insert the link!

These answers are provided by our Community. If you find them useful, show some love by clicking the heart. If you run into issues leave a comment, or add your own answer to help others.

Bobby

April 28, 2025

Hi there,

Really great to see you pushing the GenAI platform this way, you’re getting into the kind of real-world use cases that can really help shape future improvements!

From what I understand, with a single large .csv, it’’s tricky to track individual sources properly since everything points back to the same file. Splitting into multiple .csv files (one per logical source) might be the more reliable approach right now, similar to how multiple .md files work. But I’m not 100% sure if that’s the only way, it might be worth checking directly with DigitalOcean Support.

Also, full support for .parquet files or more flexible metadata would definitely be a great improvement.

I’d really encourage you to send this feedback to DigitalOcean Support, you’re raising exactly the kinds of points that could help improve the product and documentation over time.

- Bobby

Become a contributor for community

Get paid to write technical tutorials and select a tech-focused charity to receive a matching donation.

DigitalOcean Documentation

Full documentation for every DigitalOcean product.

Learn more

Resources for startups and SMBs

The Wave has everything you need to know about building a business, from raising funding to marketing your product.

Learn more

Get our newsletter

Stay up to date by signing up for DigitalOcean’s Infrastructure as a Newsletter.

New accounts only. By submitting your email you agree to our Privacy Policy

The developer cloud

Scale up as you grow — whether you're running one virtual machine or ten thousand.

View all products

Get started for free

Get started

*This promotional offer applies to new accounts only.