Hacker News

xx123122
Show HN: Data Engineering Book – An open source, community-driven guide github.com

Hi HN! I'm currently a Master's student at USTC (University of Science and Technology of China). I've been diving deep into Data Engineering, especially in the context of Large Language Models (LLMs).

The Problem: I found that learning resources for modern data engineering are often fragmented and scattered across hundreds of medium articles or disjointed tutorials. It's hard to piece everything together into a coherent system.

The Solution: I decided to open-source my learning notes and build them into a structured book. My goal is to help developers fast-track their learning curve.

Key Features:

LLM-Centric: Focuses on data pipelines specifically designed for LLM training and RAG systems.

Scenario-Based: Instead of just listing tools, I compare different methods/architectures based on specific business scenarios (e.g., "When to use Vector DB vs. Keyword Search").

Hands-on Projects: Includes full code for real-world implementations, not just "Hello World" examples.

This is a work in progress, and I'm treating it as "Book-as-Code". I would love to hear your feedback on the roadmap or any "anti-patterns" I might have included!

Check it out:

Online: https://datascale-ai.github.io/data_engineering_book/

GitHub: https://github.com/datascale-ai/data_engineering_book


baalimago3 hours ago

> "Data is the new oil, but only if you know how to refine it."

Oil[0] is fairly useless without being refined as well. Perhaps: "Data is the new oil, you need to refine it"?

[0]: https://en.wikipedia.org/wiki/Petroleum

hliyan8 hours ago

I'm not sure whether this is an artefact of translation, but things like this don't inspire confidence:

> The "Modern Data Stack" (MDS) is a hot concept in data engineering in recent years, referring to a cloud-native, modular, decoupled combination of data infrastructure

https://github.com/datascale-ai/data_engineering_book/blob/m...

Later parts are better and more to the point though: https://github.com/datascale-ai/data_engineering_book/blob/m...

Edit: perhaps I judged to early. The RAG sections isn't bad either: https://github.com/datascale-ai/data_engineering_book/blob/m...

xx123122op5 hours ago

Appreciate the honest feedback.

esafak10 hours ago

I'd have titled the submission 'Data Engineering for LLMs...' as it is focused on that.

xx123122op5 hours ago

That's a great point. I completely agree—'Data Engineering for LLMs' is much more accurate given the content.I'll pass this feedback on to the project lead immediately. Thanks for the suggestion.

dotancohen32 minutes ago

Project lead? The post gives the impression that you are a student open sourcing your learning notes.

osamabinladen9 hours ago

this is great and i bookmarked it so i can read it later. i’m just curious though, was the readme written by chatgpt? i can’t tell if im paranoid thinking everything is written by chatgpt

xx123122op5 hours ago

Yes, you are right. We are a team from China and used GPT to help with the English translation. We didn't realize it came across as 'fake warmth.' We appreciate the feedback and will work on making the tone more neutral and concise.

nimonian6 hours ago

I think it was. It's a wall of information, lots of summary tables, fake warmth, and has that LLM smell to it. I'd be very surprised if this wasn't generated text.

Whether it's GPT or not, it needs rewriting.

joshuaissac15 hours ago

dang15 hours ago

Oh thanks! I've switched the top URL to that now. Submitted URL was https://github.com/datascale-ai/data_engineering_book.

I hope xx123122 won't mind my mentioning that they emailed us about this post, which originally got caught in a spam filter. I invited them to post a comment giving the background to the project but they probably haven't seen my reply yet. Hopefully soon, given that the post struck a chord!

Edit: they did, and I've moved that post to the toptext.

xx123122op5 hours ago

Huge thanks, dang! I really appreciate you rescuing the post from the filter and switching the URL to the English version.And thanks for pinning the context comment; it helps a lot since the project is quite extensive. We're thrilled it struck a chord.

xx123122op5 hours ago

Thanks for sharing the direct link! Much appreciated.

guillem_lefait13 hours ago

The figures in the different chapters are in english (it's not the case for the image in README_en.md).

xx123122op10 hours ago

Thanks for the heads-up! We noticed that discrepancy as well and have just updated the README_en.md with the correct English diagram. It should be displaying correctly now.

alexott6 hours ago

Parquet alone is not for modern data engineering. Delta, Iceberg should be in the list

xx123122op5 hours ago

Thanks for the feedback! I've flagged this for the team member working on that section.We are taking a short break for the Chinese New Year, so updates might be a bit slower than usual.QAQ

Thanks for understanding, and Happy New Year!

[deleted]11 hours agocollapsed

xx123122op10 hours ago

[dead]

[deleted]12 hours agocollapsed

[deleted]12 hours agocollapsed

dvrp14 hours ago

If you are interested in (2026-)internet scale data engineering challenges (e.g. 10-100s of petabyte processing) challenges and pre-training/mid-training/post-training scale challenges, please send me an email to [email protected] !

[deleted]15 hours agocollapsed

rafavargascom15 hours ago

谢谢

How is possible a Chinese publication gets to the top in HN?

xx123122op10 hours ago

Thanks for the support! We believe that code and engineering challenges are universal languages.

We are pleasantly surprised by the warm reception. We know the project (and our English localization) is still a Work in Progress, but we are committed to improving it to meet the high standards of the HN community. We'll keep shipping updates!

heliumtera2 hours ago

Just sprinkle a little llm on top and it gets there in no time

rafavargascom15 hours ago

Nevermind.

MUSTANG3038 hours ago

[dead]

hn-front (c) 2024 voximity
source