Understanding And Improving Language Models Through A Data-Centric Lens