I’ve only been using Linux for about a year and would welcome some feedback.
I’m in the process of expanding my network to include a homelab so I can have a database running either on a Linux server or a Synology NAS (~32+ TB). I’m trying to filter data in CSV files that are over 30GB in size so I can quickly query them in the database.
Ripgrep was suggested by a friend to filter the data but here’s where I keep finding myself researching for weeks.
Can anyone recommend any video’s to learn Ripgrep?
Any suggestions on s/w to run on the NAS for the database (MariaDB)?
Hello Buffy,
Thank you. I did see LMariaDB in the package manager. Just didn’t know if anyone had experience with it and if it worked well with Ripgrep.
Ah… thank you. I’ll check out youtube for tutorials on grep.
Well, I think maybe (depends on your data and what you want to do though), using ripgrep to search for things and then using other tools like awk to extract things from found files might be what you want?
Like, ripgrep will return lines with the matching string/regexp, then you can pipe that output into an awk that will give you just the field(s) you want.
MariaDB is real nice though, just I think you’d need to do something different than just ripgrep for what you want? Lke if you wanted to load just some data into it, you can pipe ripgrep output into a python script that inserts selected fields into a database like MariaDB.
Thank you. I’ll need to do more reading fo python scripts. I was accustomed to manipulating CSV files with MS Excel and pivot tables. I’m just getting larger files now so I need to expand my toolkit.
I can’t speak to Ripgrep specifically, however, there are ways.
First, just don’t use CSV or traditional RDMS for none-relational big data. Find an alternative file format that is designed for fast-seeks without the need for writing (updating) to the files: Parquet, Avro, ORC, etc.
Here’s a small article describing some common Big Data File Formats and their relative uses cases.
If you absolutely must have your data in a Database, have a look at ClickHouse. It’s Insanely Fast. It can easily parse (Query) 10’s of GB of data in seconds with complex queries. You can use your NAS as the block storage device (iSCSI) for the database, then run the DB server from any node you wish. You will be hard pressed to find a faster analytics database anywhere.
Anaconda Python, as @Buffy pointed to, has many Big Data modules; Pandas, Apache Arrow (PyArrow), PySpark etc. Also, you can probably just skip the whole Excel step and use Jupyter Notebooks for visualization if desired, though that may not fit your use case but they sure are nice.