In this article I’m going to talk about five different ways that people use SQL in data industry.
Number one is adhoc analysis. I was recently talking to a data scientist and he told me he takes his laptop to any meeting and in the meeting a business manager might ask what was the net profit last year in this particular region and this person would just run the SQL query on his laptop and give the readymade answer.
So many times for your discussions and brainstorming sessions you need to know certain numbers and knowing sql can really become useful in such a case and person can just run sql query right there to get an answer and this is I think the most widely used use case where people write sql queries for doing adhoc analysis.
The second use case is report generation and this is something I have seen being used widely throughout the industry even in my own experience at Bloomberg. We have seen cases where business managers will have a requirement where they will say, can we have this type of report generated and send that to me over an email every day at nine o’clock in the morning or every weekly let’s say every Monday at nine o’clock in the morning and we would write a stored procedure which will have sql query to generate that data and then we will wrap that around a python script which will generate the csv file out of that sql query and send an email to the relevant party or we can also send these reports to some portal some internal portal where the person can come to office and they can check that internal portal and download those reports.
Now the way these reports are triggered is using Unix crontab. So if you’re using Unix as an operating system which is what we are using at Bloomberg we would write a cron job. Cron job is something like a windows task scheduler which means that script would run at a given time frequency.
So we can configure and say run the script every Monday at five o’clock and that script would be triggered it will generate a report and we use two methods of delivery either an email or uploading to internal portal. People use sophisticated tools such as azure data factory also to schedule these reports. There are tons of cloud solutions and tools available where you can schedule these reports.
Once business manager receives this report the next typical step is they would write some formula to derive more data or to compute additional information out of the existing data or they would do visualization. We already looked at visualization use case in one of our previous videos where from the csv file you know you will create the various charts and plots in your excel file and these plots is something you can use in your presentation or for any other use case.
The next use case is exploratory data analysis also known as EDA and machine learning. Machine learning is usually done by data scientists whereas data analysts and data scientists both can do EDA. Here I’m showing you a jupyter notebook with python code where we are pulling data from our atlic hardware database and let’s say your business manager has asked you to either build a machine learning model or perform some exploratory data analysis. This is what you will do.
I have connected with our database and with that I can load data directly from our fake sales monthly table into a pandas data frame and I can just do now data exploration. I want to know how many unique fiscal years we have? Well five we have 389 product codes 209 customer codes.
Do we have any null values? Checking null values and clearing bad data is a usual job that data scientist performs. I want to figure out if there are any transactions with sold quantity zero because sold quantity shouldn’t be zero right and apparently there are some data errors in my database. So what data scientists would do is they will filter out this bad data.
Okay so you see this is how you filter out bad data in pandas data frame and they will do further data exploration. So a business manager might ask you ad hoc question they will say okay can you get me the total salt quantity for each year and show me a bar chart or something right so you can just write a query to get that data and using matplotlib library you can plot this kind of visualization and you can draw insights you can clearly see from 2018 to 2021 your sold quantity doubled so your business was growing rapidly but in 2022 the business is slightly declined however we have not finished 2022 yet but still we are not going to see obviously double growth. So you can draw all these insights and and for these if you don’t want to write your sql queries the data scientists or data analysts can pull the data into jupyter notebook and they can quickly write some python code.
They can also share this notebook with business users. Nowadays business users are also learning python and they can see all the outputs. So I’m providing you this jupyter notebook check download files button below this video and you can open it in in your jupyter notebook.
We have covered couple of other business question you can see one more a line chart in matplotlib you will see a pie chart. See I’m showing the contribution by various channel and so on. Other than eda data scientists can perform machine learning also.
So usually they will do all of these things right they first perform exploratory data analysis clean bad data and then they can write machine learning code. I’m not showing you any machine learning code because it is out of the scope but once again on my youtube channel there are many videos on machine learning. The next one is using sql inside bi tools such as power bi and tablu.
In power bi or tablu you can connect with your data source such as mysql database and you can either pull the entire table as is or you can write your sql query and you can filter some data and only pull the filter data and after that you can use all cool features of power bi or tablu to build amazing dashboards. The next use case is etl and data migration. We already saw what is etl if you’re a data engineer you will be doing etl and obviously you are reading your oltp databases and you are doing transformation into olap databases and for doing that reading part you might use sql queries to read only certain information.
You might not have to read all the tables right so there you will use sql and then to write data obviously you will use sql query insert updates and so on. So etl is widely popular use case of sql among data engineers especially also for data migrations. We see so many migrations are happening nowadays where people move from let’s say oracle to postgre sql or let’s say oracle to some no sql database right oracle or any relational database to no sql database and when you are doing these database migrations which is another big wave in the industry nowadays there also you need to use sql.
Now we covered five use cases specifically in data industry. If you are doing software development obviously you are using sql for crowd operation crowd means create read update and delete right but that is more for a software engineer so you can consider that as a sixth use case.