Week #622 & #623

Friday, January 20th, 02023 at 12:21 UTC

Week #622

The Hyperion project took a small turn this week. Our original deadline is a few months away, but we’ve been offered some big benefits if we can manage to push it up! This week we’ve spent figuring out what we can cut to make that new, earlier deadline and getting approval from various stakeholders. The short answer is that we’re going for it!

We published a very long article entitled TIL 02022. Over the entire last year, we saved a few headlines each morning. We collected nearly 800 and tried to extract some interesting trends. It is totally subjected through several filters, but we’ll see if this is/was useful and how we can improve our world view.

We have a few hourly scheduled tasks to do some web scraping like it’s 01997! One recently broke so we spent some time investigating it this week. There is a lot of different ways to extract data from a webpage: XSLT, Regular Expressions, Machine Learning, but we settled for the simple “find string index”. The issue was that this particular webpage used to be published as HTML, now it is an HTML shell and everything in injected via javascript. 🤮 None the less, we updated our search strings and extracted what we needed from the json instead of the html. Things seem to be working fine again, but now we wait for them to update the webpage to confirm (at least the false positives have stopped)

Week #623

For a few days, here and there, we took down various festive items around the office. Last year is now done and after a little clean, we’re now facing 02023 head-on.

One of the more monotonous tasks that pop-up this week (and probably for a while) is that our old company credit card expires this month. Several services have emailed us encouraging us to update the info to not terminate their services. So we duly login, enter the new info and forget about it for a few years.

This week we sent out s02e01 of ⪮ Good Morning. We changed the format slightly to focus on four links that you should find interesting. We send it out once a month and publish it online as well. We’re excited for another 12 months and continue to iterate and improve.

For one of our projects, we had Heroku for hosting and AWS RDS mySQL for the database. Those two separate entities were adding around 60ms to every database request! It was taking between half and one second to return a batched API call. That’s way too much, so we started to go down the rabbit hole of optimizations. Firstly, we moved the database from AWS RDS into heroku PostgreSQL. This adds cost, but it is going to be easier to maintain. It also reduced the average request from 600ms to around 20ms! It was a 25-30x increase in speed, which equates to the same gains in handling traffic. The next thing we started todo was go through all the API calls on the server-side. Many, when written are using Flask Python SQLAlchemy to convert the data in the database to models. This is handy, but sometimes it means searching and converting lots more data than needed. We went function-by-function and looked at what we actually used from the model and re-wrote the SQL query to only get the columns we actually used rather than everything. This should decrease the memory and increase the response times. Next week’s optimization is to condense some of these batch API calls. We are calling the same function several times, but with just one different parameter. (We might request user info 4 times, each time for a different user_id) We want to change the function to accept a single value or a list. Then we can use a single database call to get all the info rather than do it multiple times. This will require changes to the client as well as the backed, but should also get us some big performance benefits.

Finally, next month we will be conducting several parent surveys. Two weeks ago, we set them all up and have already started to collect all the necessary information from our customers. We had one request internally for some small improvements. At the end of each survey, we have several steps to close it down. We export anonymous CSV data, an SPSS file, backup the survey syntax, down-grade the servers and databases and purge out all the identifiable data for participants who had not started. We had hard-coded the database columns which we considered personal information. As we take on more projects, we needed to move the definition of what is personal information to be dynamic based on the imported lists. Since we built our own data collection tool, we have full control of how everything works and added more information into the instructions of our CSV importer.

Fluxcapacitor

Back in 02010, we wrote about our TweetCC project. At the time, it wasn’t ever 100% clear who owned your tweets and if it was possible to re-publish them as fair-use in print or elsewhere. We created a service and account @tweetCC so anyone could will their tweets into the Creative Commons. This is when Twitter’s API was much more open and developer friendly. The project served it’s purpose at the time and then dwindled.

In 02021, we started to publish the first time a word was used on this website. We wrote about the process of First Word Instances, and have been publishing them to the @optionalBot twitter account. It was a fun and interesting way to see our writing and publishing evolve. We’ll have to find a new home for this in the near future.

Bric-à-brac

lived in Edinburgh long enough to recognize the top of Usher Hall, Lothian Road, St Mary’s Cathedral and that general Scottish look. This is taken from the castle. https://t.co/36DJj8ICbi pic.twitter.com/GIIxVe5De3
— Brian Suda (@briansuda) January 19, 2023

The photo on the left has been shared around in relation to the pay gap between the people who created ChatGPT and the Kenyan workers cleaning it up. It is an NYTimes article behind a paywall, but we immediately recognized where the photo was taken and even found (nearly) the same spot on Google Street View.

Categories: Briefs, Weeknotes

Tags: performance, planning, scraping

(optional.is)

Week #622 & #623

Week #622

Week #623

Fluxcapacitor

Bric-à-brac