Ripgrep and database for NAS

Hi everyone,

I’ve only been using Linux for about a year and would welcome some feedback.

I’m in the process of expanding my network to include a homelab so I can have a database running either on a Linux server or a Synology NAS (~32+ TB). I’m trying to filter data in CSV files that are over 30GB in size so I can quickly query them in the database.

Ripgrep was suggested by a friend to filter the data but here’s where I keep finding myself researching for weeks.

Can anyone recommend any video’s to learn Ripgrep?
Any suggestions on s/w to run on the NAS for the database (MariaDB)?

Thank you for any insight you can offer.

LinuxPup

MariaDB is available in Synology’s Package Manager.

As far as I know, ripgrep is just a much faster grep; IDK about it building databases?
https://awesomeopensource.com/project/BurntSushi/ripgrep

Hello Buffy,
Thank you. I did see LMariaDB in the package manager. Just didn’t know if anyone had experience with it and if it worked well with Ripgrep.

Ah… thank you. I’ll check out youtube for tutorials on grep.

LinuxPup

1 Like

Well, I think maybe (depends on your data and what you want to do though), using ripgrep to search for things and then using other tools like awk to extract things from found files might be what you want?

Like, ripgrep will return lines with the matching string/regexp, then you can pipe that output into an awk that will give you just the field(s) you want.

MariaDB is real nice though, just I think you’d need to do something different than just ripgrep for what you want? Lke if you wanted to load just some data into it, you can pipe ripgrep output into a python script that inserts selected fields into a database like MariaDB.

Hello Buffy,

Awesome! Thank you…

Have a great day!

1 Like

I added a bit on my post :smiley_cat:

Thank you. I’ll need to do more reading fo python scripts. I was accustomed to manipulating CSV files with MS Excel and pivot tables. I’m just getting larger files now so I need to expand my toolkit.

1 Like

I think if you want do do that kind of stuff on larger scale, then the python package “pandas” is what you’ll want.

There’s a nice one-stop-install project called Anaconda that is real ez to install and use:

There’s lots of good books and videos about it too. :smiley_cat:

1 Like

Hi Buffy,

Thank you very much for your feedback and insight. I’ll check it out.

1 Like

You’re very welcome! :smile_cat:

My $0.02 on Big CSV files.

I can’t speak to Ripgrep specifically, however, there are ways.

First, just don’t use CSV or traditional RDMS for none-relational big data. Find an alternative file format that is designed for fast-seeks without the need for writing (updating) to the files: Parquet, Avro, ORC, etc.

Here’s a small article describing some common Big Data File Formats and their relative uses cases.

If you absolutely must have your data in a Database, have a look at ClickHouse. It’s Insanely Fast. It can easily parse (Query) 10’s of GB of data in seconds with complex queries. You can use your NAS as the block storage device (iSCSI) for the database, then run the DB server from any node you wish. You will be hard pressed to find a faster analytics database anywhere.

Anaconda Python, as @Buffy pointed to, has many Big Data modules; Pandas, Apache Arrow (PyArrow), PySpark etc. Also, you can probably just skip the whole Excel step and use Jupyter Notebooks for visualization if desired, though that may not fit your use case but they sure are nice.

2 Likes

Wow, Thank you KI7MT! I’ll check out the links you listed.

Have a great evening!