Ripgrep and database for NAS

LinuxPup · August 16, 2021, 7:12pm

Hi everyone,

I’ve only been using Linux for about a year and would welcome some feedback.

I’m in the process of expanding my network to include a homelab so I can have a database running either on a Linux server or a Synology NAS (~32+ TB). I’m trying to filter data in CSV files that are over 30GB in size so I can quickly query them in the database.

Ripgrep was suggested by a friend to filter the data but here’s where I keep finding myself researching for weeks.

Can anyone recommend any video’s to learn Ripgrep?
Any suggestions on s/w to run on the NAS for the database (MariaDB)?

Thank you for any insight you can offer.

LinuxPup

Buffy · August 16, 2021, 7:16pm

MariaDB is available in Synology’s Package Manager.

As far as I know, ripgrep is just a much faster grep; IDK about it building databases?
https://awesomeopensource.com/project/BurntSushi/ripgrep

LinuxPup · August 16, 2021, 7:22pm

Hello Buffy,
Thank you. I did see LMariaDB in the package manager. Just didn’t know if anyone had experience with it and if it worked well with Ripgrep.

Ah… thank you. I’ll check out youtube for tutorials on grep.

LinuxPup

Buffy · August 16, 2021, 7:26pm

Well, I think maybe (depends on your data and what you want to do though), using ripgrep to search for things and then using other tools like awk to extract things from found files might be what you want?

Like, ripgrep will return lines with the matching string/regexp, then you can pipe that output into an awk that will give you just the field(s) you want.

MariaDB is real nice though, just I think you’d need to do something different than just ripgrep for what you want? Lke if you wanted to load just some data into it, you can pipe ripgrep output into a python script that inserts selected fields into a database like MariaDB.

LinuxPup · August 16, 2021, 7:27pm

Hello Buffy,

Awesome! Thank you…

Have a great day!

Buffy · August 16, 2021, 7:28pm

I added a bit on my post

LinuxPup · August 16, 2021, 7:31pm

Thank you. I’ll need to do more reading fo python scripts. I was accustomed to manipulating CSV files with MS Excel and pivot tables. I’m just getting larger files now so I need to expand my toolkit.

Buffy · August 16, 2021, 7:36pm

I think if you want do do that kind of stuff on larger scale, then the python package “pandas” is what you’ll want.

There’s a nice one-stop-install project called Anaconda that is real ez to install and use:

There’s lots of good books and videos about it too.

LinuxPup · August 16, 2021, 7:37pm

Hi Buffy,

Thank you very much for your feedback and insight. I’ll check it out.

Buffy · August 16, 2021, 7:39pm

You’re very welcome!

KI7MT · August 17, 2021, 5:51am

My $0.02 on Big CSV files.

I can’t speak to Ripgrep specifically, however, there are ways.

First, just don’t use CSV or traditional RDMS for none-relational big data. Find an alternative file format that is designed for fast-seeks without the need for writing (updating) to the files: Parquet, Avro, ORC, etc.

Here’s a small article describing some common Big Data File Formats and their relative uses cases.

If you absolutely must have your data in a Database, have a look at ClickHouse. It’s Insanely Fast. It can easily parse (Query) 10’s of GB of data in seconds with complex queries. You can use your NAS as the block storage device (iSCSI) for the database, then run the DB server from any node you wish. You will be hard pressed to find a faster analytics database anywhere.

Anaconda Python, as @Buffy pointed to, has many Big Data modules; Pandas, Apache Arrow (PyArrow), PySpark etc. Also, you can probably just skip the whole Excel step and use Jupyter Notebooks for visualization if desired, though that may not fit your use case but they sure are nice.

LinuxPup · November 2, 2021, 12:06am

Wow, Thank you KI7MT! I’ll check out the links you listed.

Have a great evening!